gpt4 book ai didi

apache-spark - 如何在 Spark2.4 的 spark2-shell 中读取 Avro 文件?

转载 作者:行者123 更新时间:2023-12-04 10:41:53 28 4
gpt4 key购买 nike

我们在 Spark2.4 中的 spark2-shell 中读取 avro 文件时遇到问题
任何指针都会有很大帮助。

我们在 spark2.3 中使用以下方法读取 avro 文件,但在 Spark2.4 中已删除此支持:

spark2-shell --jars /tmp/spark/spark-avro_2.11-4.0.0.jar
import org.apache.avro.Schema
spark.sqlContext.sparkContext.hadoopConfiguration.set("avro.mapred.ignore.inputs.without.extension", "true")
val df = spark.read.format("com.databricks.spark.avro").option("header", "true").option("mode", "DROPMALFORMED").load("<DIR_PATH_FOR_AVRO>")
  • Spark 2.4 文档提供了以下详细信息:

  • ( https://spark.apache.org/docs/latest/sql-data-sources-avro.html )

    ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.4

    但是我们在使用这种方法时遇到以下异常:
    Exception in thread "main" java.lang.RuntimeException: 
    [unresolved dependency: org.apache.spark#spark-avro_2.12;2.4.4: not found]
    at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1306)
    at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:315)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

    也试过:
    spark2-shell --packages org.apache.spark:spark-avro_2.12:2.4.4 --jars /tmp/spark/spark-avro_2.12-2.4.0.jar

    最佳答案

    “线程“main”中的异常 java.lang.RuntimeException: [ Unresolved 依赖项:org.apache.spark#spark-avro_2.12;2.4.4: not found]
    ..."似乎是在 https://repo1.maven.org/maven2/ 访问中央 maven 存储库的问题,可能是因为您的环境正在使用代理。

    所以我认为你走在正确的道路上 - 你可以手动下载一个 jar spark-avro_2.1x-2.4.x.jar来自 https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.xx/2.4.x/ ,将其传输到您的节点,然后使用 spark2-shell --jar spark-avro_2.xx-2.4.x.jar启动 REPL shell。

    看起来您使用的是 Cloudera 发行版 for Spark 2.4。它的最新维护版本是 2.4.2,它仍然基于 Scala 2.11,所以我认为你正在寻找 jar spark-avro_2.11-2.4.2.jar .

    有了那个 jar ,事情对我来说似乎没问题:

    $ spark2-shell --jars ~/.m2/repository/org/apache/spark/spark-avro_2.11/2.4.2/spark-avro_2.11-2.4.2.jar
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Spark context Web UI available at http://xxxxxxx.xxxnet:4056
    Spark context available as 'sc' (master = yarn, app id = application_xxxxxxxxxxxxx_xxxxx).
    Spark session available as 'spark'.
    Welcome to
    ____ __
    / __/__ ___ _____/ /__
    _\ \/ _ \/ _ `/ __/ '_/
    /___/ .__/\_,_/_/ /_/\_\ version 2.4.0.cloudera2
    /_/

    Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_221)
    Type in expressions to have them evaluated.
    Type :help for more information.

    scala> val df = spark.read.format("avro").load("/some/hdfs/path/kilo_sample.avro")
    df: org.apache.spark.sql.DataFrame = [registration_dttm: string, id: bigint ... 11 more fields]

    scala> df.show(false)
    +--------------------+---+----------+---------+------------------------+------+---------------+-------------------+----------------------+----------+---------+----------------------------+----------------------------+
    |registration_dttm |id |first_name|last_name|email |gender|ip_address |cc |country |birthdate |salary |title |comments |
    +--------------------+---+----------+---------+------------------------+------+---------------+-------------------+----------------------+----------+---------+----------------------------+----------------------------+
    |2016-02-03T07:55:29Z|1 |Amanda |Jordan |ajordan0@com.com |Female|1.197.201.2 |6759521864920116 |Indonesia |3/8/1971 |49756.53 |Internal Auditor |1E+02 |
    |2016-02-03T17:04:03Z|2 |Albert |Freeman |afreeman1@is.gd |Male |218.111.175.34 |null |Canada |1/16/1968 |150280.17|Accountant IV | |
    ...
    |2016-02-03T10:30:36Z|20 |Rebecca |Bell |rbellj@bandcamp.com |Female|172.215.104.127|null |China | |137251.19| | |
    +--------------------+---+----------+---------+------------------------+------+---------------+-------------------+----------------------+----------+---------+----------------------------+----------------------------+
    only showing top 20 rows

    scala>

    如果您在尝试此版本后仍然遇到问题,请使用完整的堆栈跟踪更新您的问题,以便我们可以确切地看到问题所在。

    关于apache-spark - 如何在 Spark2.4 的 spark2-shell 中读取 Avro 文件?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59898154/

    28 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com