gpt4 book ai didi

java - 转换 Spark 数据集中的数据时数据类型不匹配

转载 作者:行者123 更新时间:2023-11-30 06:47:49 25 4
gpt4 key购买 nike

我使用 Spark 从 csv 文件创建了 Parquet 结构:

Dataset<Row> df = park.read().format("com.databricks.spark.csv").option("inferSchema", "true")
.option("header", "true").load("sample.csv");
df.write().parquet("sample.parquet");

我正在阅读 Parquet 结构,并尝试转换数据集中的数据:

Dataset<org.apache.spark.sql.Row> df = spark.read().parquet("sample.parquet");
df.createOrReplaceTempView("tmpview");
Dataset<Row> namesDF = spark.sql("SELECT *, md5(station_id) as hashkey FROM tmpview");

不幸的是,我收到数据类型不匹配错误。我必须显式分配数据类型吗?

17/04/12 09:21:52 INFO SparkSqlParser: Parsing command: SELECT *, md5(station_id) as hashkey FROM tmpview Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'md5(tmpview.station_id)' due to data type mismatch: argument 1 requires binary type, however, 'tmpview.station_id' is of int type.; line 1 pos 10; 'Project [station_id#0, bikes_available#1, docks_available#2, time#3, md5(station_id#0) AS hashkey#16] +- SubqueryAlias tmpview, tmpview +- Relation[station_id#0,bikes_available#1,docks_available#2,time#3] parquet

最佳答案

是的,根据 Spark documentationmd5函数仅适用于binary(文本/字符串)列,因此您需要在之前将station_id转换为string应用md5。在 Spark SQL 中,您可以将 md5cast 链接在一起,例如:

Dataset<Row> namesDF = spark.sql("SELECT *, md5(cast(station_id as string)) as hashkey FROM tmpview");

或者您可以在数据框中创建一个新列并对其应用md5,例如:

val newDf = df.withColumn("station_id_str", df.col("station_id").cast(StringType))
newDf.createOrReplaceTempView("tmpview");
Dataset<Row> namesDF = spark.sql("SELECT *, md5(station_id_str) as hashkey FROM tmpview");

关于java - 转换 Spark 数据集中的数据时数据类型不匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43363059/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com