v-6ren">
gpt4 book ai didi

scala - Spark - 将 CSV 文件加载为 DataFrame?

转载 作者:可可西里 更新时间:2023-11-01 14:06:06 29 4
gpt4 key购买 nike

我想在 spark 中读取 CSV 并将其转换为 DataFrame 并使用 df.registerTempTable("table_name") 将其存储在 HDFS 中

我试过:

scala> val df = sqlContext.load("hdfs:///csv/file/dir/file.csv")

我得到的错误:

java.lang.RuntimeException: hdfs:///csv/file/dir/file.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [49, 59, 54, 10]
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:418)
at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:277)
at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:276)
at scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
at scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:165)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

在 Apache Spark 中加载 CSV 文件作为 DataFrame 的正确命令是什么?

最佳答案

spark-csv 是核心 Spark 功能的一部分,不需要单独的库。所以你可以做例如

df = spark.read.format("csv").option("header", "true").load("csvfile.csv")

在 scala 中,(这适用于任何格式的定界符,对于 csv 提及“,”,对于 tsv 等提及“\t”)

val df = sqlContext.read.format("com.databricks.spark.csv")
.option("分隔符", ",")
.load("csvfile.csv")

关于scala - Spark - 将 CSV 文件加载为 DataFrame?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29704333/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com