gpt4 book ai didi

java - Spark/Scala - 从 Json 创建 DataFrame 时出错 : java. lang.NoSuchMethodError : org. apache.spark.sql.DataFrameReader.json

转载 作者:行者123 更新时间:2023-11-30 06:19:47 25 4
gpt4 key购买 nike

我是 Spark 和 Scala 新手。我正在尝试从 JSONArray 创建数据框。下面是我的代码:

 public class JSONParse{
public JSONArray actionItems() {
JSONParser parser = new JSONParser();
JSONArray results = null;
try {
JSONObject obj = (JSONObject) parser.parse(new FileReader("/data/home/actionitems.json"));
JSONObject obj2 = (JSONObject) obj.get("d");
results = (JSONArray) obj2.get("results");
System.out.println(results);

} catch (Exception e) {
e.printStackTrace();
}
return results;
}
}

object driver {
val parse = new JsonParse
val conf = new SparkConf().setAppName("test")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val hiveContext = new HiveContext(sc)
val sqlContext = new SQLContext(sc)

def main(args: Array[String]): Unit = {
val actionItemsRDD = sc.parallelize(Seq(parse.actionItems.toString))
val df: DataFrame = hiveContext.read.json(actionItemsRDD)
df.show
println("number of records: "+df.count)
}
}

Java 类 JsonParse 从文件中读取 json,并将 JSONArray 返回到 scala 对象 driver。在 driver 中,我将 Json 字符串转换为 RDD,然后使用 hiveContext.read.json(actionItemsRDD)< 创建 Dataframe/。我使用 maven 构建,没有构建错误。

但是,当我运行 jar 时,出现以下错误:线程“main”中的异常 java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrameReader.json(Lorg/apache/spark/rdd/RDD;)Lorg/apache/spark/sql/Dataset;

它在 hiveContext.read.json 行引发异常。我以前做过这个,没有任何问题。我还使用了与之前尝试相同的依赖项。下面是我的 pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>json</groupId>
<artifactId>test</artifactId>
<version>1.0-SNAPSHOT</version>
<name>${project.artifactId}</name>
<build>
<sourceDirectory>src</sourceDirectory>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>

<execution>
<id>scala-test-compile</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<relocations>
<relocation>
<pattern>org.apache.http</pattern>
<shadedPattern>org.shaded.apache.http</shadedPattern>
</relocation>
</relocations>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<shadedArtifactAttached>true</shadedArtifactAttached>
<shadedClassifierName>shaded</shadedClassifierName>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.11</artifactId>
<version>1.4.0</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.6.0</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
</exclusions>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.0</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>jcl-over-slf4j</artifactId>
</exclusion>
</exclusions>
</dependency>

<!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.10.6</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.jodd/jodd -->
<dependency>
<groupId>org.jodd</groupId>
<artifactId>jodd</artifactId>
<version>3.4.0</version>
<type>pom</type>
</dependency>
<!-- https://mvnrepository.com/artifact/org.json/json -->
<dependency>
<groupId>org.json</groupId>
<artifactId>json</artifactId>
<version>20170516</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpcore -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.4.4</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.googlecode.json-simple/json-simple -->
<dependency>
<groupId>com.googlecode.json-simple</groupId>
<artifactId>json-simple</artifactId>
<version>1.1.1</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.threeten/threetenbp -->
<dependency>
<groupId>org.threeten</groupId>
<artifactId>threetenbp</artifactId>
<version>1.3.3</version>
</dependency>
</dependencies>
</project>

不确定为什么会出现此错误,并且我无法解决它。任何帮助,将不胜感激。谢谢!

最佳答案

第一点 - 不要自己解析数据。 Spark 内置了对 JSON 的支持:

val df = spark.read.json("file:///data/home/actionitems.json")
val newDataset = df.select("d.results")

如果 JSON 中有任何 JSON,您还可以使用内置函数,例如 from_json ;)

如果您的 JSON 不是逐行的 - 每行一个对象 - 使用 multiLine 选项并将其设置为 true,那么您的数据集将只有一列

第二点 - 集群上的 Spark 版本似乎错误,因此 Spark 无法看到正确的方法

第三点-最好至少更新到Spark 2.2,它有很多改进

第四点 - Scala 版本不匹配,所有组件应使用相同的 Scala。您声明一次 2.10,在其他依赖项中声明 2.11

关于java - Spark/Scala - 从 Json 创建 DataFrame 时出错 : java. lang.NoSuchMethodError : org. apache.spark.sql.DataFrameReader.json,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48406282/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com