gpt4 book ai didi

apache-spark - Scala Spark:找到json的多个来源

转载 作者:行者123 更新时间:2023-12-02 19:53:22 25 4
gpt4 key购买 nike

在我的hadoop集群上执行spark2-submit时出现异常,在hdfs中读取.jsons目录时我不知道如何解决它。
我在几块板上发现了一些有关此的问题,但没有一个受到欢迎或有答案。
我尝试了显式导入org.apache.spark.sql.execution.datasources.json.JsonFileFormat,但是与导入SparkSession相比似乎很多余,因此无法识别。
但是,我可以确认这两个类均可用。

val json:org.apache.spark.sql.execution.datasources.json.JsonDataSource
val json:org.apache.spark.sql.execution.datasources.json.JsonFileFormat
堆栈跟踪:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Multiple sources found for json (org.apache.spark.sql.execution.datasources.json.JsonFileFormat, org.apache.spark.sql.execution.datasources.json.DefaultSource), please specify the fully qualified class name.;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:670)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:397)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:340)
at jsonData.HdfsReader$.readJsonToDataFrame(HdfsReader.scala:45)
at jsonData.HdfsReader$.process(HdfsReader.scala:52)
at exp03HDFS.StartExperiment03$.main(StartExperiment03.scala:41)
at exp03HDFS.StartExperiment03.main(StartExperiment03.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

HdfsReader:
import java.net.URI
import org.apache.hadoop.fs.{LocatedFileStatus, RemoteIterator}
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SparkSession
import pipelines.ContentPipeline

object HdfsReader {

...

def readJsonToDataFrame(inputDir: String, multiline: Boolean = true, verbose: Boolean = false)
: DataFrame = {

val multiline_df = spark.read.option("multiline",value = true).json(inputDir)
multiline_df.show(false)
if (verbose) multiline_df.show(truncate = true)
multiline_df
}

def process(path: URI) = {
val dataFrame = readJsonToDataFrame(path.toString, verbose = true)
val contentDataFrame = ContentPipeline.getContentOfText(dataFrame)
val newDataFrame = dataFrame.join(contentDataFrame, "text").distinct()
JsonFileUtils.saveAsJson(newDataFrame, outputFolder)
}

}
build.sbt
version := "0.1"
scalaVersion := "2.11.8" //same version hadoop uses


libraryDependencies ++=Seq(
"org.apache.spark" %% "spark-core" % "2.3.0", //same version hadoop uses
"com.johnsnowlabs.nlp" %% "spark-nlp" % "2.3.0",
"org.apache.spark" %% "spark-sql" % "2.3.0",
"org.apache.spark" %% "spark-mllib" % "2.3.0",
"org.scalactic" %% "scalactic" % "3.2.0",
"org.scalatest" %% "scalatest" % "3.2.0" % "test",
"com.lihaoyi" %% "upickle" % "0.7.1")

最佳答案

看来您在classpath中同时具有Spark 2.x和3.x jar。根据sbt文件,应使用Spark 2.x,但是,在Spark 3.x中使用JsonFileFormat添加了this issue

关于apache-spark - Scala Spark:找到json的多个来源,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62743053/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com