gpt4 book ai didi

r - 如何在 SparkR 中建立逻辑回归模型

转载 作者:可可西里 更新时间:2023-11-01 14:41:07 25 4
gpt4 key购买 nike

我是 Spark 和 SparkR 的新手。我已经成功安装了 Spark 和 SparkR。

当我尝试使用 R 和 Spark 通过存储在 HDFS 中的 csv 文件构建逻辑回归模型时,我收到错误“维数不正确”。

我的代码是:

points <- cache(lapplyPartition(textFile(sc, "hdfs://localhost:54310/Henry/data.csv"), readPartition))

collect(points)

w <- runif(n=D, min = -1, max = 1)

cat("Initial w: ", w, "\n")

# Compute logistic regression gradient for a matrix of data points
gradient <- function(partition) {
partition = partition[[1]]
Y <- partition[, 1] # point labels (first column of input file)
X <- partition[, -1] # point coordinates
# For each point (x, y), compute gradient function

dot <- X %*% w
logit <- 1 / (1 + exp(-Y * dot))
grad <- t(X) %*% ((logit - 1) * Y)
list(grad)
}


for (i in 1:iterations) {
cat("On iteration ", i, "\n")
w <- w - reduce(lapplyPartition(points, gradient), "+")
}

错误信息是:

On iteration  1 
Error in partition[, 1] : incorrect number of dimensions
Calls: do.call ... func -> FUN -> FUN -> Reduce -> <Anonymous> -> FUN -> FUN
Execution halted
14/09/27 01:38:13 ERROR Executor: Exception in task 0.0 in stage 181.0 (TID 189)
java.lang.NullPointerException
at edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:125)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:701)
14/09/27 01:38:13 WARN TaskSetManager: Lost task 0.0 in stage 181.0 (TID 189, localhost): java.lang.NullPointerException:
edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:125)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:701)
14/09/27 01:38:13 ERROR TaskSetManager: Task 0 in stage 181.0 failed 1 times; aborting job
Error in .jcall(getJRDD(rdd), "Ljava/util/List;", "collect") : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 181.0 failed 1 times, most recent failure: Lost task 0.0 in stage 181.0 (TID 189, localhost): java.lang.NullPointerException: edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:125) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:701) Driver stacktrace:

数据维度(样本):

data <- read.csv("/home/Henry/data.csv")

dim(data)

[1] 17 541

此错误的可能原因是什么?

最佳答案

问题在于 textFile() 读取一些文本数据并返回一个分布式的 strings 集合,每个字符串对应于文本文件的一行。因此稍后在程序 partition[, -1] 中失败。该程序的真正意图似乎是将 视为数据帧的分布式集合。我们正在努力尽快在 SparkR 中提供数据帧支持 (SPARKR-1)。

要解决此问题,只需使用字符串操作操作您的分区即可正确提取XY。其他一些方法包括(我想您以前可能已经看过)从一开始就生成不同类型的分布式集合,就像这里所做的那样:examples/logistic_regression.R .

关于r - 如何在 SparkR 中建立逻辑回归模型,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26067715/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com