gpt4 book ai didi

scala - 如何将 show 运算符的输出读回数据集?

转载 作者:行者123 更新时间:2023-12-04 19:29:07 25 4
gpt4 key购买 nike

假设我们有以下文本文件(df.show() 命令的输出):

+----+---------+--------+
|col1| col2| col3|
+----+---------+--------+
| 1|pi number|3.141592|
| 2| e number| 2.71828|
+----+---------+--------+

现在我想将它读/解析为数据帧/数据集。什么是最“ Shiny ”的方式来做到这一点?

附言我对 的解决方案感兴趣两者 scalapyspark ,这就是使用这两个标签的原因。

最佳答案

更新:使用“UNIVOCITY”解析器库,我可以去掉一行,在其中删除列名中的空格:

斯卡拉:

// read Spark Output Fixed width table:
def readSparkOutput(filePath: String) : org.apache.spark.sql.DataFrame = {
val t = spark.read
.option("header","true")
.option("inferSchema","true")
.option("delimiter","|")
.option("parserLib","UNIVOCITY")
.option("ignoreLeadingWhiteSpace","true")
.option("ignoreTrailingWhiteSpace","true")
.option("comment","+")
.csv(filePath)
t.select(t.columns.filterNot(_.startsWith("_c")).map(t(_)):_*)
}

PySpark:
def read_spark_output(file_path):
t = spark.read \
.option("header","true") \
.option("inferSchema","true") \
.option("delimiter","|") \
.option("parserLib","UNIVOCITY") \
.option("ignoreLeadingWhiteSpace","true") \
.option("ignoreTrailingWhiteSpace","true") \
.option("comment","+") \
.csv("file:///tmp/spark.out")
# select not-null columns
return t.select([c for c in t.columns if not c.startswith("_")])

用法示例:
scala> val df = readSparkOutput("file:///tmp/spark.out")
df: org.apache.spark.sql.DataFrame = [col1: int, col2: string ... 1 more field]

scala> df.show
+----+---------+--------+
|col1| col2| col3|
+----+---------+--------+
| 1|pi number|3.141592|
| 2| e number| 2.71828|
+----+---------+--------+


scala> df.printSchema
root
|-- col1: integer (nullable = true)
|-- col2: string (nullable = true)
|-- col3: double (nullable = true)

旧答案:

这是我在 Scala (Spark 2.2) 中的尝试:
// read Spark Output Fixed width table:
val t = spark.read
.option("header","true")
.option("inferSchema","true")
.option("delimiter","|")
.option("comment","+")
.csv("file:///temp/spark.out")
// select not-null columns
val cols = t.columns.filterNot(c => c.startsWith("_c")).map(a => t(a))
// trim spaces from columns
val colsTrimmed = t.columns.filterNot(c => c.startsWith("_c")).map(c => c.replaceAll("\\s+",""))
// reanme columns using 'colsTrimmed'
val df = t.select(cols:_*).toDF(colsTrimmed:_*)

它有效,但我有一种感觉,必须有更优雅的方法来做到这一点。
scala> df.show
+----+---------+--------+
|col1| col2| col3|
+----+---------+--------+
| 1.0|pi number|3.141592|
| 2.0| e number| 2.71828|
+----+---------+--------+

scala> df.printSchema
root
|-- col1: double (nullable = true)
|-- col2: string (nullable = true)
|-- col3: double (nullable = true)

关于scala - 如何将 show 运算符的输出读回数据集?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46868820/

25 4 0