gpt4 book ai didi

apache-spark - 将字符串转换为 int null 问题

转载 作者:行者123 更新时间:2023-12-02 08:15:10 26 4
gpt4 key购买 nike

我有一个 spark 数据框,结果,它有两个我想转换为数字的字符串列:

>>> results.show()
+--------------------+-----------------+------------------------+
| Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+--------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC...| "43"| "20"|
|"BAYLOR MEDICAL C...| "32"| "20"|
|"GOOD SHEPHERD ME...| "25"| "20"|
|"GOOD SHEPHERD ME...| "25"| "20"|
|"MASONIC HOME AND...| "Not Available"| "Not Available"|
|"ST HELENA HOSPITAL"| "41"| "20"|
| "TOURO INFIRMARY"| "15"| "18"|
|"WAHIAWA GENERAL ...| "17"| "10"|
|"ANNA JAQUES HOSP...| "27"| "18"|
| "CMC-BLUE RIDGE"| "31"| "18"|
|"EVANSTON REGIONA...| "15"| "15"|
|"OKLAHOMA SPINE H...| "79"| "20"|
|"PICKENS COUNTY M...| "Not Available"| "Not Available"|
|"PORTNEUF MEDICAL...| "11"| "17"|
|"PRESENCE SAINT J...| "20"| "17"|
|"RIVERSIDE MEDICA...| "39"| "20"|
|"RIVERSIDE MEDICA...| "39"| "20"|
|"RIVERSIDE MEDICA...| "39"| "20"|
|"SOUTH GEORGIA ME...| "3 out of 10"| "24"|
|"TAMPA GENERAL HO...| "23"| "16"|
+--------------------+-----------------+------------------------+

尝试这样做会给我一个空值表:

>>> results2 = results.select( results["Hospital Name"], results["HCAHPS Base Score"].cast(pe()).alias("HCAHPS Base Score"), results["HCAHPS Consistency Score"].cast(IntegerType()).aHPS Consistency Score") )
>>> results2.show()
+--------------------+-----------------+------------------------+
| Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+--------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC...| null| null|
|"BAYLOR MEDICAL C...| null| null|
|"GOOD SHEPHERD ME...| null| null|
|"GOOD SHEPHERD ME...| null| null|
|"MASONIC HOME AND...| null| null|
|"ST HELENA HOSPITAL"| null| null|
| "TOURO INFIRMARY"| null| null|
|"WAHIAWA GENERAL ...| null| null|
|"ANNA JAQUES HOSP...| null| null|
| "CMC-BLUE RIDGE"| null| null|
|"EVANSTON REGIONA...| null| null|
|"OKLAHOMA SPINE H...| null| null|
|"PICKENS COUNTY M...| null| null|
|"PORTNEUF MEDICAL...| null| null|
|"PRESENCE SAINT J...| null| null|
|"RIVERSIDE MEDICA...| null| null|
|"RIVERSIDE MEDICA...| null| null|
|"RIVERSIDE MEDICA...| null| null|
|"SOUTH GEORGIA ME...| null| null|
|"TAMPA GENERAL HO...| null| null|
+--------------------+-----------------+------------------------+

only showing top 20 rows

是否无法在 pyspark 中将字符串列转换为整数?

最佳答案

首先您最好去掉双引号,然后您应该能够转换为 IntegerType。您可以使用下面的 udf 来完成它。

>>> def stripDQ(string):
... return string.replace('"', "")
...
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import StringType, IntegerType
>>> udf_stripDQ = udf(stripDQ, StringType())

我们将使用它..

您的实际 DataFrame:

>>> results.show()
+------------------+-----------------+------------------------+
| Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC"| "43"| "20"|
|"BAYLOR MEDICAL C"| "32"| "20"|
|"GOOD SHEPHERD ME"| "25"| "20"|
|"GOOD SHEPHERD ME"| "25"| "20"|
|"MASONIC HOME AND"| "Not Available"| "Not Available"|
+------------------+-----------------+------------------------+

现在,我们将使用我们的 udf 从两列中去除双引号。

>>> results1 = results.withColumn("HCAHPS Base Score", udf_stripDQ(results["HCAHPS Base Score"]) ).withColumn("HCAHPS Consistency Score", udf_stripDQ(results["HCAHPS Consistency Score"]) )
>>> results1.show()
+------------------+-----------------+------------------------+
| Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC"| 43| 20|
|"BAYLOR MEDICAL C"| 32| 20|
|"GOOD SHEPHERD ME"| 25| 20|
|"GOOD SHEPHERD ME"| 25| 20|
|"MASONIC HOME AND"| Not Available| Not Available|
+------------------+-----------------+------------------------+

现在转换为整数:

>>> results2 = results1.select( results1["Hospital Name"], results1["HCAHPS Base Score"].cast(IntegerType()).alias("HCAHPS Base Score"), results1["HCAHPS Consistency Score"].cast(IntegerType()).alias("HPS Consistency Score") )
>>> results2.show()
+------------------+-----------------+---------------------+
| Hospital Name|HCAHPS Base Score|HPS Consistency Score|
+------------------+-----------------+---------------------+
|"ADIRONDACK MEDIC"| 43| 20|
|"BAYLOR MEDICAL C"| 32| 20|
|"GOOD SHEPHERD ME"| 25| 20|
|"GOOD SHEPHERD ME"| 25| 20|
|"MASONIC HOME AND"| null| null|
+------------------+-----------------+---------------------+

关于apache-spark - 将字符串转换为 int null 问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42709279/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com