gpt4 book ai didi

apache-spark - 与 aws-java-sdk 链接时读取 json 文件时 Spark 崩溃

转载 作者:行者123 更新时间:2023-12-04 06:34:43 24 4
gpt4 key购买 nike

config.json是一个小的json文件:

{
"toto": 1
}

我做了一个简单的代码,用 sc.textFile 读取 json 文件(因为文件可以在S3、本地或HDFS上,所以textFile很方便)
import org.apache.spark.{SparkContext, SparkConf}

object testAwsSdk {
def main( args:Array[String] ):Unit = {
val sparkConf = new SparkConf().setAppName("test-aws-sdk").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val json = sc.textFile("config.json")
println(json.collect().mkString("\n"))
}
}

SBT 文件仅拉取 spark-core图书馆
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.5.1" % "compile"
)

该程序按预期工作,将 config.json 的内容写入标准输出。

现在我还想与亚马逊的 sdk aws-java-sdk 链接以访问 S3。
libraryDependencies ++= Seq(
"com.amazonaws" % "aws-java-sdk" % "1.10.30" % "compile",
"org.apache.spark" %% "spark-core" % "1.5.1" % "compile"
)

执行相同的代码,spark 抛出以下异常。
Exception in thread "main" com.fasterxml.jackson.databind.JsonMappingException: Could not find creator property with name 'id' (in class org.apache.spark.rdd.RDDOperationScope)
at [Source: {"id":"0","name":"textFile"}; line: 1, column: 1]
at com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:148)
at com.fasterxml.jackson.databind.DeserializationContext.mappingException(DeserializationContext.java:843)
at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.addBeanProps(BeanDeserializerFactory.java:533)
at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.buildBeanDeserializer(BeanDeserializerFactory.java:220)
at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.createBeanDeserializer(BeanDeserializerFactory.java:143)
at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer2(DeserializerCache.java:409)
at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer(DeserializerCache.java:358)
at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCache2(DeserializerCache.java:265)
at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCacheValueDeserializer(DeserializerCache.java:245)
at com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:143)
at com.fasterxml.jackson.databind.DeserializationContext.findRootValueDeserializer(DeserializationContext.java:439)
at com.fasterxml.jackson.databind.ObjectMapper._findRootDeserializer(ObjectMapper.java:3666)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3558)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2578)
at org.apache.spark.rdd.RDDOperationScope$.fromJson(RDDOperationScope.scala:82)
at org.apache.spark.rdd.RDDOperationScope$$anonfun$5.apply(RDDOperationScope.scala:133)
at org.apache.spark.rdd.RDDOperationScope$$anonfun$5.apply(RDDOperationScope.scala:133)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:133)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1012)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:827)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:825)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
at org.apache.spark.SparkContext.textFile(SparkContext.scala:825)
at testAwsSdk$.main(testAwsSdk.scala:11)
at testAwsSdk.main(testAwsSdk.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)

读栈,好像是aws-java-sdk链接时, sc.textFile检测到该文件是一个 json 文件,并尝试使用 jackson 假设某种格式来解析它,当然它找不到。我需要与 aws-java-sdk 链接,所以我的问题是:

1- 为什么添加 aws-java-sdk修改 spark-core 的行为?

2- 是否有解决方法(文件可以在 HDFS、S3 或本地)?

最佳答案

与亚马逊支持人员交谈。这是 jackson 图书馆的依赖问题。在 SBT 中,覆盖 jackson:

libraryDependencies ++= Seq( 
"com.amazonaws" % "aws-java-sdk" % "1.10.30" % "compile",
"org.apache.spark" %% "spark-core" % "1.5.1" % "compile"
)

dependencyOverrides ++= Set(
"com.fasterxml.jackson.core" % "jackson-databind" % "2.4.4"
)

他们的回答:
我们已经在 Mac、Ec2 (redhat AMI) 实例和 EMR (Amazon Linux) 上完成了这项工作。 3 不同的环境。问题的根本原因是 sbt 构建了一个依赖图,然后通过驱逐旧版本并选择依赖库的最新版本来处理版本冲突问题。在这种情况下,spark 取决于 jackson 库的 2.4 版本,而 AWS SDK 需要 2.5。所以存在版本冲突,sbt 驱逐 spark 的依赖版本(旧版本)并选择 AWS SDK 版本(最新版本)。

关于apache-spark - 与 aws-java-sdk 链接时读取 json 文件时 Spark 崩溃,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33465686/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com