gpt4 book ai didi

java - 具有值为 LOOKING numeric 的 String 列的数据集被分区和存储。再次读取时,数据仍为 "string"但丢失了零

转载 作者:行者123 更新时间:2023-12-04 07:54:23 28 4
gpt4 key购买 nike

Spark 3.0.2 ,我正在写一个 Dataset在 Parquet 文件中。我写的代码就是这样结束的:

etablissements = etablissements.repartition(col("codeDepartement"));
etablissements = etablissements.sortWithinPartitions(col("siret"));
etablissements = etablissements.persist();

// Write it in a file named with the year of data, selections, and sorting in it's name.
// Underlying statement writing the parquet file is :
// ds.write().partitionBy(colonnesPartionnement /* = codeDepartement */)
saveToStore(etablissements, new String[] {"codeDepartement"},
"{0}_{1,number,#0}_{2}_{3}", "etablissements", anneeSIRENE, actifsSeulement,
communesValides);
codeDepartment有一个 StringType ,因为法国的部门代码是一个三字符代码。
# schema() :
|-- codeDepartement: string (nullable = true)
在此 show() 的最后三分之一处可见输出(城市名称前三列大写),并具有值: "01" :
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+---------------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+
|siren |nic |siret |statutDiffusionEtablissement|dateCreationEtablissement|trancheEffectifSalarie|anneeEffectifsEtablissement|activiteArtisanRegistreDesMetiers|dateDernierTraitement|etablissementSiege|nombrePeriodesEtablissement|complementAdresse |numeroVoie|indiceRepetition|typeDeVoie|libelleVoie |codePostal|nomCommuneEtrangere|distributionSpeciale|codeCommune|cedex|libelleCedex |codePaysEtranger|nomPaysEtranger|complementAdresseSecondaire|numeroVoieSecondaire|indiceRepetitionSecondaire|typeDeVoieSecondaire|libelleVoieSecondaire|codePostalSecondaire|nomCommuneSecondaire|nomCommuneEtrangereSecondaire|distributionSpecialeSecondaire|codeCommuneSecondaire|cedexSecondaire|libelleCedexSecondaire|codePaysEtrangerSecondaire|nomPaysEtrangerSecondaire|dateDebutHistorisation|etatAdministratifEtablissement|enseigne1 |enseigne2|enseigne3|denominationEtablissement|activitePrincipale|nomenclatureActivitePrincipale|caractereEmployeurEtablissement|active|anneeValiditeEffectifSalarie|caractereEmployeur|siege|nombrePeriodes|typeCommune|codeRegion|codeDepartement|arrondissement|typeNomEtCharniere|nomMajuscules |nomCommune |libelle |codeCanton|codeCommuneParente|strateCommune|sirenCommune|populationTotale|populationMunicipale|populationCompteApart|sirenCommuneMembre|codeEPCI |nomEPCI |libelleNAF |
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+---------------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+
|015850944|00024|01585094400024|O |2007-04-01 |11 |2017 |null |2019-11-14T14:00:12 |false |2 |ZONE INDUSTRIELLE |null |null |CHE |DE THIL |01700 |null |null |01376 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |25.73B |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |01 |012 |0 |SAINT MAURICE DE BEYNOST|Saint-Maurice-de-Beynost|Saint-Maurice-de-Beynost|0113 |null |5 |210103768 |4006 |3967 |39 |210103768 |240100800|CC de Miribel et du Plateau |Fabrication d'autres outillages |
|015851793|00479|01585179300479|O |2005-01-01 |11 |2017 |null |2019-06-24T13:04:28 |false |2 |null |null |null |null |ZONE INDUST LA FONTAINE |01290 |null |null |01134 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |46.73A |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |01 |012 |0 |CROTTET |Crottet |Crottet |0123 |null |3 |210101341 |1777 |1734 |43 |210101341 |200070555|CC de la Veyle |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |
|015851793|00743|01585179300743|O |2012-09-01 |02 |2017 |null |2019-06-24T13:04:28 |false |1 |ZA ACTIPARC |null |null |null |PRE LION |01190 |null |null |01057 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2012-09-01 |A |null |null |null |DORAS |46.73A |NAFRev2 |O |true |2017 |true |false|1 |COM |84 |01 |012 |0 |BOZ |Boz |Boz |0117 |null |3 |210100574 |519 |512 |7 |210100574 |200071371|CC Bresse et Saône |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |
|015851793|00917|01585179300917|O |2020-01-01 |null |null |null |2020-01-31T16:13:25 |false |1 |null |28 |null |AV |DE MARBOZ |01000 |null |null |01053 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2020-01-01 |A |CLEAU |null |null |null |46.73A |NAFRev2 |O |true |null |true |false|1 |COM |84 |01 |012 |0 |BOURG EN BRESSE |Bourg-en-Bresse |Bourg-en-Bresse |0199 |null |8 |210100533 |43306 |41527 |1779 |210100533 |200071751|CA du Bassin de Bourg-en-Bresse |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |

我看到我的 Parquet 文件下的文件夹很好:
codeDepartement=01
codeDepartement=2A
codeDepartement=75
codeDepartement=971
注意:由于某些值,如 2A (对于 Corse)部门代码永远不能转换为数值。 snappy.parquet块在 /data/tmp/etablissements_2020_true_true/codeDepartement=01 中各存储一个文件夹等:没关系。
在阅读时,我尝试阅读该商店的内容。搜索城市代码(在法国以部门代码开头)以 "01" 开头的城市:读取应有的 Parquet 文件和块:
2021-03-24 07:14:33.825  INFO 13860 --- [er for task 106] o.a.s.s.e.datasources.FileScanRDD        : Reading File path: file:/data/tmp/etablissements_2020_true_true/codeDepartement=01/part-00024-f7d33eea-6d79-4f1a-bf35-0666dcc5e0f5.c000.snappy.parquet, range: 0-5246504, partition values: [1]
当显示部门时(即现在在数据集 show() 命令的末尾),它现在具有值 "1"而不是 "01" :
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+---------------+
|siren |nic |siret |statutDiffusionEtablissement|dateCreationEtablissement|trancheEffectifSalarie|anneeEffectifsEtablissement|activiteArtisanRegistreDesMetiers|dateDernierTraitement|etablissementSiege|nombrePeriodesEtablissement|complementAdresse |numeroVoie|indiceRepetition|typeDeVoie|libelleVoie |codePostal|nomCommuneEtrangere|distributionSpeciale|codeCommune|cedex|libelleCedex |codePaysEtranger|nomPaysEtranger|complementAdresseSecondaire|numeroVoieSecondaire|indiceRepetitionSecondaire|typeDeVoieSecondaire|libelleVoieSecondaire|codePostalSecondaire|nomCommuneSecondaire|nomCommuneEtrangereSecondaire|distributionSpecialeSecondaire|codeCommuneSecondaire|cedexSecondaire|libelleCedexSecondaire|codePaysEtrangerSecondaire|nomPaysEtrangerSecondaire|dateDebutHistorisation|etatAdministratifEtablissement|enseigne1 |enseigne2|enseigne3|denominationEtablissement|activitePrincipale|nomenclatureActivitePrincipale|caractereEmployeurEtablissement|active|anneeValiditeEffectifSalarie|caractereEmployeur|siege|nombrePeriodes|typeCommune|codeRegion|arrondissement|typeNomEtCharniere|nomMajuscules |nomCommune |libelle |codeCanton|codeCommuneParente|strateCommune|sirenCommune|populationTotale|populationMunicipale|populationCompteApart|sirenCommuneMembre|codeEPCI |nomEPCI |libelleNAF |codeDepartement|
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+---------------+
|015850944|00024|01585094400024|O |2007-04-01 |11 |2017 |null |2019-11-14T14:00:12 |false |2 |ZONE INDUSTRIELLE |null |null |CHE |DE THIL |01700 |null |null |01376 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |25.73B |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |012 |0 |SAINT MAURICE DE BEYNOST|Saint-Maurice-de-Beynost|Saint-Maurice-de-Beynost|0113 |null |5 |210103768 |4006 |3967 |39 |210103768 |240100800|CC de Miribel et du Plateau |Fabrication d'autres outillages |1 |
|015851793|00479|01585179300479|O |2005-01-01 |11 |2017 |null |2019-06-24T13:04:28 |false |2 |null |null |null |null |ZONE INDUST LA FONTAINE |01290 |null |null |01134 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |46.73A |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |012 |0 |CROTTET |Crottet |Crottet |0123 |null |3 |210101341 |1777 |1734 |43 |210101341 |200070555|CC de la Veyle |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |1 |
|015851793|00743|01585179300743|O |2012-09-01 |02 |2017 |null |2019-06-24T13:04:28 |false |1 |ZA ACTIPARC |null |null |null |PRE LION |01190 |null |null |01057 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2012-09-01 |A |null |null |null |DORAS |46.73A |NAFRev2 |O |true |2017 |true |false|1 |COM |84 |012 |0 |BOZ |Boz |Boz |0117 |null |3 |210100574 |519 |512 |7 |210100574 |200071371|CC Bresse et Saône |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |1 |
|015851793|00917|01585179300917|O |2020-01-01 |null |null |null |2020-01-31T16:13:25 |false |1 |null |28 |null |AV |DE MARBOZ |01000 |null |null |01053 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2020-01-01 |A |CLEAU |null |null |null |46.73A |NAFRev2 |O |true |null |true |false|1 |COM |84 |012 |0 |BOURG EN BRESSE |Bourg-en-Bresse |Bourg-en-Bresse |0199 |null |8 |210100533 |43306 |41527 |1779 |210100533 |200071751|CA du Bassin de Bourg-en-Bresse |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |1 |
即使它仍然被 Parquet 文件声明为 StringType :
|-- codeDepartement: string (nullable = true)
发生了什么 ?
我倾向于涉及 repartition()声明是造成这种困惑的原因,但我不知道如何。如果该命令是欺骗性的,并且分区无法按字符串值进行分区,那么程序如何按字母中的红色、蓝色和黄色对数据进行分区?
我不明白我面临的整体行为(问题?)。

最佳答案

我能够重现这个问题。

spark.sql("select '01' key, 123 val union all select 'ab', 456").show()
+---+---+
|key|val|
+---+---+
| 01|123|
| ab|456|
+---+---+

spark.sql("select '01' key, 123 val union all select 'ab', 456").write().partitionBy("key").parquet("test")

spark.read().parquet("test").show()
+---+---+
|val|key|
+---+---+
|456| ab|
|123| 1|
+---+---+
要解决此问题,您可以在阅读时提供架构:
spark.read().schema(spark.read().parquet("test").schema).parquet("test").show()
+---+---+
|val|key|
+---+---+
|456| ab|
|123| 01|
+---+---+
(在 Pyspark 中测试,希望可以在 Java 中工作)

关于java - 具有值为 LOOKING numeric 的 String 列的数据集被分区和存储。再次读取时,数据仍为 "string"但丢失了零,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66776106/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com