gpt4 book ai didi

java - Apache Spark 和 scala,执行查询时出错

转载 作者:行者123 更新时间:2023-12-01 16:17:33 26 4
gpt4 key购买 nike

我正在使用一个数据集,其示例如下:

"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"
44;"technician";"single";"secondary";"no";29;"yes";"no";"unknown";5;"may";151;1;-1;0;"unknown";"no"

我已成功执行以下命令:

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import spark.sqlContext.implicits._
val data = sc.textFile(“file:///C:/Users/Desktop/bank-full-Copy.csv")
data.map(x => x.split(";(?=([^\"]*\"[^\"]*\")*[^\"]*$)",-1))
val header = data.first()
val filtered = data.filter(x => x(0)!= header(0))
val rdds = filtered.map(x => Row(x(0).toInt,
x(1),
x(2),
x(3),
x(4),
x(5).toInt,
x(6),
x(7),
x(8),
x(9).toInt,
x(10),
x(11).toInt,
x(12).toInt,
x(13).toInt,
x(14).toInt,
x(15),
x(16) ))
val schema = StructType( List(StructField("age", IntegerType, true),
StructField("job", StringType, true) ,
StructField("marital", StringType, true),
StructField("education", StringType, true) ,
StructField("default", StringType, true),
StructField("balance", IntegerType, true) ,
StructField("housing", StringType, true) ,
StructField("loan", StringType, true) ,
StructField("contact", StringType, true) ,
StructField("day", IntegerType, true) ,
StructField("month", StringType, true) ,
StructField("duration", IntegerType, true) ,
StructField("campaign", IntegerType, true) ,
StructField("pdays", IntegerType, true) ,
StructField("previous", IntegerType, true) ,
StructField("poutcome", StringType, true) ,
StructField("y", StringType, true)) )
val df = spark.sqlContext.createDataFrame(rdds, schema)

我收到以下错误:

df.groupBy("age","y").count.show()*,

java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.Character is not a valid external type for schema of string

在对数据执行任何查询时,我遇到了相同的错误。您能看一下并为我提供解决方案吗?

最佳答案

如果您想跳过 RDD 额外代码,您可以使用以下代码

输入文件 csv(; 分隔,每个记录由下一行分隔)

"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"
44;"technician";"single";"secondary";"no";29;"yes";"no";"unknown";5;"may";151;1;-1;0;"unknown";"no"
  • 定义结构架构
  • 读取;分隔文件
  • 直接读取 header=true 和预定义架构为 Dataframe 的 csv
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}

object ProcessSemiColonCsv {

def main(args: Array[String]): Unit = {

val spark = Constant.getSparkSess

val schema = StructType( List(StructField("age", IntegerType, true),
StructField("job", StringType, true) ,
StructField("marital", StringType, true),
StructField("education", StringType, true) ,
StructField("default", StringType, true),
StructField("balance", IntegerType, true) ,
StructField("housing", StringType, true) ,
StructField("loan", StringType, true) ,
StructField("contact", StringType, true) ,
StructField("day", IntegerType, true) ,
StructField("month", StringType, true) ,
StructField("duration", IntegerType, true) ,
StructField("campaign", IntegerType, true) ,
StructField("pdays", IntegerType, true) ,
StructField("previous", IntegerType, true) ,
StructField("poutcome", StringType, true) ,
StructField("y", StringType, true)) )

val df = spark.read
.option("delimiter", ";")
.option("header", "true")
.schema(schema)
.csv("src/main/resources/SemiColon.csv")

df.show()
df.printSchema()
}

}

关于java - Apache Spark 和 scala,执行查询时出错,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62363787/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com