gpt4 book ai didi

apache-spark - 为什么格式 ("kafka")失败并出现 "Failed to find data source: kafka."(即使使用 uber-jar )?

转载 作者:行者123 更新时间:2023-12-04 02:58:01 34 4
gpt4 key购买 nike

我将 HDP-2.6.3.0 与 Spark2 包 2.2.0 一起使用。

我正在尝试使用结构化流 API 编写 Kafka 消费者,但在将作业提交到集群后出现以下错误:

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:553)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:89)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:89)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:198)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:90)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:90)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:150)
at com.example.KafkaConsumer.main(KafkaConsumer.java:21)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:782)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$anonfun$22$anonfun$apply$14.apply(DataSource.scala:537)
at org.apache.spark.sql.execution.datasources.DataSource$anonfun$22$anonfun$apply$14.apply(DataSource.scala:537)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$anonfun$22.apply(DataSource.scala:537)
at org.apache.spark.sql.execution.datasources.DataSource$anonfun$22.apply(DataSource.scala:537)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:537)
... 17 more

关注 spark-submit命令:
$SPARK_HOME/bin/spark-submit \
​--master yarn \
​ --deploy-mode client \
​​ --class com.example.KafkaConsumer \​
​ --executor-cores 2 \
​​ --executor-memory 512m \​
--driver-memory 512m \​
sample-kafka-consumer-0.0.1-SNAPSHOT.jar​

我的java代码:

package com.example;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class KafkaConsumer {

public static void main(String[] args) {

SparkSession spark = SparkSession
.builder()
.appName("kafkaConsumerApp")
.getOrCreate();

Dataset<Row> ds = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "dog.mercadoanalitico.com.br:6667")
.option("subscribe", "my-topic")
.load();
}
}

pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>sample-kafka-consumer</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>

<dependencies>

<!-- spark -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
</dependency>


<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>

<!-- kafka -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.10.1.0</version>
</dependency>


</dependencies>


<repositories>
<repository>
<id>local-maven-repo</id>
<url>file:///${project.basedir}/local-maven-repo</url>
</repository>
</repositories>

<build>

<!-- Include resources folder in the .jar -->
<resources>
<resource>
<directory>${basedir}/src/main/resources</directory>
</resource>
</resources>

<plugins>

<!-- Plugin to compile the source. -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>

<!-- Plugin to include all the dependencies in the .jar and set the main class. -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.0.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<!-- This filter is to workaround the problem caused by included signed jars.
java.lang.SecurityException: Invalid signature file digest for Manifest main attributes
-->
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.example.KafkaConsumer</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>

</plugins>
</build>
</project>

[更新] UBER-JAR

下面是 pom.xml 中用来生成 uber-jar 的配置

            <plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.0.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<!-- This filter is to workaround the problem caused by included signed jars.
java.lang.SecurityException: Invalid signature file digest for Manifest main attributes
-->
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.example.KafkaConsumer</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>

最佳答案

kafka数据源是 external模块,默认情况下对 Spark 应用程序不可用。

您必须在 pom.xml 中将其定义为依赖项。 (正如您所做的那样),但这只是将它放入 Spark 应用程序的第一步。

    <dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>

使用该依赖项,您必须决定是否要创建一个所谓的 uber-jar 将所有依赖项捆绑在一起(这会导致一个相当大的 jar 文件并使提交时间更长)或使用 --packages (或不太灵活的 --jars )选项在 spark-submit 处添加依赖项时间。

(还有其他选项,例如在 Hadoop HDFS 上存储所需的 jar 或使用特定于 Hadoop 发行版的方式为 Spark 应用程序定义依赖项,但让我们保持简单)

我建议使用 --packages首先且仅当它起作用时才考虑其他选项。

使用 spark-submit --packages包括 Spark -sql-kafka-0-10 模块如下。
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0

根据需要包括其他命令行选项。

Uber-Jar 方法

在所谓的 中包含所有依赖项 super jar 由于 META-INF 的原因,可能并不总是有效目录被处理。

对于 kafka要工作的数据源(以及其他一般数据源),您必须确保 META-INF/services/org.apache.spark.sql.sources.DataSourceRegister所有数据源中的 合并 (不是 replacefirst 或您使用的任何策略)。
kafka数据源使用自己的 META-INF/services/org.apache.spark.sql.sources.DataSourceRegister注册 org.apache.spark.sql.kafka010.KafkaSourceProvider作为 kafka 的数据源提供者格式。

关于apache-spark - 为什么格式 ("kafka")失败并出现 "Failed to find data source: kafka."(即使使用 uber-jar )?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48011941/

34 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com