gpt4 book ai didi

r - 通过推断模式 : double 读取 csv 文件时出现 sparklyr 异常

转载 作者:行者123 更新时间:2023-12-04 20:00:19 37 4
gpt4 key购买 nike

我正在尝试使用 spark_read_csv 函数将 csv 读入 Spark。我在推断模式时遇到异常,即当我设置 infer_schema=TRUE 时遇到异常。

spark_read_csv(sc,"myDf",DatasetUrl)

我遇到以下异常:

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 90.0 failed 1 times, most recent failure: Lost task 0.0 in stage 90.0 (TID 151, localhost): java.text.ParseException: Unparseable number: "cr1_fd_dttm" at java.text.NumberFormat.parse(NumberFormat.java:385) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$4.apply$mcD$sp(CSVInferSchema.scala:259)

但是,当我尝试设置 infer_schema=FALSE 时,正如预期的那样,所有内容都被读取为 chr 类型。

这是 cr1_fd_dttm 列中数据的样子:

      cr1_fd_dttm
<chr>
1 0.0
2 1.45679112E12
3 1.45679166E12
4 1.45679154E12
5 1.45679274E12
6 0.0
7 0.0
8 0.0
9 0.0
10 1.45679118E12

有人可以帮我吗?

谢谢

最佳答案

我只是读取文件而没有立即将其放入内存,强制字段为数字,然后将这些结果加载到内存中。所以关键是将 memory 设置为 FALSE,将 infer_schema 设置为 FALSE,传递列列表,强制,然后使用 compute() 来将结果保存到 Spark 内存中。这是一个冗长但有效的示例:

mapped_flights <- spark_read_csv(sc, "mapped_flights", 
path = "s3a://flights-data/full",
memory = FALSE,
infer_schema = FALSE,
columns = list(
Year = "character",
Month = "character",
DayofMonth = "character",
DayOfWeek = "character",
DepTime = "character",
CRSDepTime = "character",
ArrTime = "character",
CRSArrTime = "character",
UniqueCarrier = "character",
FlightNum = "character",
TailNum = "character",
ActualElapsedTime = "character",
CRSElapsedTime = "character",
AirTime = "character",
ArrDelay = "character",
DepDelay = "character",
Origin = "character",
Dest = "character",
Distance = "character",
TaxiIn = "character",
TaxiOut = "character",
Cancelled = "character",
CancellationCode = "character",
Diverted = "character",
CarrierDelay = "character",
WeatherDelay = "character",
NASDelay = "character",
SecurityDelay = "character",
LateAircraftDelay = "character")
)


flights <- mapped_flights %>% mutate(
Year = as.integer(Year),
Month = as.integer(Month),
DayofMonth = as.integer(DayofMonth),
DayOfWeek = as.integer(DayOfWeek),
DepTime = as.integer(DepTime),
CRSDepTime = as.integer(CRSDepTime),
CRSArrTime = as.integer(CRSArrTime),
ArrTime = as.integer(ArrTime),
ActualElapsedTime = as.integer(ActualElapsedTime),
CRSElapsedTime = as.integer(CRSElapsedTime),
AirTime = as.integer(AirTime),
ArrDelay = as.double(ArrDelay),
DepDelay = as.double(DepDelay),
Distance = as.integer(Distance),
TaxiIn = as.integer(TaxiIn),
TaxiOut = as.integer(TaxiOut),
Cancelled = as.integer(Cancelled),
Diverted = as.integer(Diverted),
CarrierDelay = as.integer(CarrierDelay),
WeatherDelay = as.integer(WeatherDelay),
NASDelay = as.integer(NASDelay),
SecurityDelay = as.integer(SecurityDelay),
LateAircraftDelay = as.integer(LateAircraftDelay)) %>% compute("flights")

关于r - 通过推断模式 : double 读取 csv 文件时出现 sparklyr 异常,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42922766/

37 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com