gpt4 book ai didi

xml - 如何在使用 spark 解析 xml 时将标题信息添加到行信息

转载 作者:可可西里 更新时间:2023-11-01 16:28:51 24 4
gpt4 key购买 nike

我有一个像

这样的xml结构
 <root>
<bookinfo>
<time>1232314973</time>
<requestID>233</requestID>
<supplier>asd123</supplier>
</bookinfo>

<books>
<book>
<name>book1</name>
<pages>124</pages>
</book>
<book>
<name>book2</name>
<pages>456</pages>
</book>
<book>
<name>book4</name>
<pages>789</pages>
</book>
</books>
</root>

我知道我可以像这样解析 books:

val xml = sqlContext.read.format("com.databricks.spark.xml")
.option("rowTag", "book").load("FILENAME")

但我想在每一行中添加标题信息,如 supplier

有没有一种方法可以将这个“headerinfo”添加到带有 spark 的所有行,而无需加载文件两次并将信息存储在全局变量/值中?

提前致谢!

最佳答案

可以读取从“root”标签开始的所有xml,然后展开需要的标签:

val df = hiveContext.read.format("xml").option("rowTag", "root").load("books.xml")
df.printSchema()
df.show(false)

println("-- supplier --")
val supplierDF = df.select(col("bookinfo.supplier"))
supplierDF.printSchema()
supplierDF.show(false)

println("-- books --")
val booksDF = df.select(explode(col("books.book")).alias("bookDetails"))
booksDF.printSchema()
booksDF.show(false)

println("-- bookDetails --")
val booksDetailsDF = booksDF.select(col("bookDetails.name"), col("bookDetails.pages"))
booksDetailsDF.printSchema()
booksDetailsDF.show(false)

输出:

root
|-- bookinfo: struct (nullable = true)
| |-- requestID: long (nullable = true)
| |-- supplier: string (nullable = true)
| |-- time: long (nullable = true)
|-- books: struct (nullable = true)
| |-- book: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- name: string (nullable = true)
| | | |-- pages: long (nullable = true)

+-----------------------+-----------------------------------------------------+
|bookinfo |books |
+-----------------------+-----------------------------------------------------+
|[233,asd123,1232314973]|[WrappedArray([book1,124], [book2,456], [book4,789])]|
+-----------------------+-----------------------------------------------------+

-- supplier --
root
|-- supplier: string (nullable = true)

+--------+
|supplier|
+--------+
|asd123 |
+--------+

-- books --
root
|-- bookDetails: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- pages: long (nullable = true)

+-----------+
|bookDetails|
+-----------+
|[book1,124]|
|[book2,456]|
|[book4,789]|
+-----------+

-- bookDetails --
root
|-- name: string (nullable = true)
|-- pages: long (nullable = true)

+-----+-----+
|name |pages|
+-----+-----+
|book1|124 |
|book2|456 |
|book4|789 |
+-----+-----+

关于xml - 如何在使用 spark 解析 xml 时将标题信息添加到行信息,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46057776/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com