gpt4 book ai didi

python - Pyspark - 将 json 字符串转换为 DataFrame

转载 作者:太空狗 更新时间:2023-10-30 01:52:51 24 4
gpt4 key购买 nike

我有一个包含简单 json 的 test2.json 文件:

{  "Name": "something",  "Url": "https://stackoverflow.com",  "Author": "jangcy",  "BlogEntries": 100,  "Caller": "jangcy"}

我已将我的文件上传到 blob 存储,并从中创建了一个 DataFrame:

df = spark.read.json("/example/data/test2.json")

然后我可以毫无问题地看到它:

df.show()
+------+-----------+------+---------+--------------------+
|Author|BlogEntries|Caller| Name| Url|
+------+-----------+------+---------+--------------------+
|jangcy| 100|jangcy|something|https://stackover...|
+------+-----------+------+---------+--------------------+

第二种情况:我的笔记本中确实声明了相同的 json 字符串:

newJson = '{  "Name": "something",  "Url": "https://stackoverflow.com",  "Author": "jangcy",  "BlogEntries": 100,  "Caller": "jangcy"}'

我可以打印它等等。但是现在如果我想从它创建一个 DataFrame:

df = spark.read.json(newJson)

我收到“绝对 URI 中的相对路径”错误:

'java.net.URISyntaxException: Relative path in absolute URI: {  "Name":%20%22something%22,%20%20%22Url%22:%20%22https:/stackoverflow.com%22,%20%20%22Author%22:%20%22jangcy%22,%20%20%22BlogEntries%22:%20100,%20%20%22Caller%22:%20%22jangcy%22%7D'
Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 249, in json
return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'java.net.URISyntaxException: Relative path in absolute URI: { "Name":%20%22something%22,%20%20%22Url%22:%20%22https:/stackoverflow.com%22,%20%20%22Author%22:%20%22jangcy%22,%20%20%22BlogEntries%22:%20100,%20%20%22Caller%22:%20%22jangcy%22%7D'

我应该对 newJson 字符串应用额外的转换吗?如果是,它们应该是什么?如果这太琐碎了,请原谅我,因为我是 Python 和 Spark 的新手。

我正在使用带有 PySpark3 内核的 Jupyter 笔记本。

提前致谢。

最佳答案

您可以执行以下操作

newJson = '{"Name":"something","Url":"https://stackoverflow.com","Author":"jangcy","BlogEntries":100,"Caller":"jangcy"}'
df = spark.read.json(sc.parallelize([newJson]))
df.show(truncate=False)

应该给

+------+-----------+------+---------+-------------------------+
|Author|BlogEntries|Caller|Name |Url |
+------+-----------+------+---------+-------------------------+
|jangcy|100 |jangcy|something|https://stackoverflow.com|
+------+-----------+------+---------+-------------------------+

关于python - Pyspark - 将 json 字符串转换为 DataFrame,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49675860/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com