scala - 无法执行用户定义的函数($anonfun$9 : (string) => double) on using String Indexer for multiple columns-6ren

scala - 无法执行用户定义的函数($anonfun$9 : (string) => double) on using String Indexer for multiple columns

转载作者：行者123 更新时间：2023-12-04 22:52:48

我正在尝试在多列上应用字符串索引器。这是我的代码

val stringIndexers = Categorical_Model.map { colName =>new StringIndexer().setInputCol(colName).setOutputCol(colName + "_indexed")}

var dfStringIndexed = stringIndexers(0).fit(df3).transform(df3) // 'fit's a model then 'transform's data
for(x<-1 to stringIndexers.length-1)
{dfStringIndexed = stringIndexers(x).fit(dfStringIndexed).transform(dfStringIndexed)
}
dfStringIndexed = dfStringIndexed.drop(Categorical_Model: _*)

Schema 显示所有可空的列都为 false

stringIndexers 数组显示如下

stringIndexers: Array[org.apache.spark.ml.feature.StringIndexer] = Array(strIdx_c53c3bdf464c, strIdx_61e685c520f7, strIdx_d6e59b2fc69d, ......)


dfStringIndexed.show(10)

这会引发以下错误

org.apache.spark.SparkException: Failed to execute user defined function($anonfun$9: (string) =&gt; double)

为什么显示打印模式但没有可用数据。

更新:如果我像这样手动循环字符串索引器而不是循环。此代码有效。这很奇怪。

var dfStringIndexed = stringIndexers(0).fit(df3).transform(df3) // 'fit's a model then 'transform's data
dfStringIndexed = stringIndexers(1).fit(dfStringIndexed).transform(dfStringIndexed)
dfStringIndexed = stringIndexers(2).fit(dfStringIndexed).transform(dfStringIndexed)
dfStringIndexed = stringIndexers(3).fit(dfStringIndexed).transform(dfStringIndexed)
dfStringIndexed = stringIndexers(4).fit(dfStringIndexed).transform(dfStringIndexed)

根据要求添加 Stacktrace

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3273)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
  at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
  ... 63 elided
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$9: (string) => double)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:109)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
  ... 3 more
Caused by: org.apache.spark.SparkException: StringIndexer encountered NULL value. To handle or skip NULLS, try setting StringIndexer.handleInvalid.
  at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$9.apply(StringIndexer.scala:251)
  at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$9.apply(StringIndexer.scala:246)
  ... 19 more

最佳答案

我也遇到了类似的问题，即使是在 50 行的一个很小的子集上，在我进行字符串索引的列中都没有空值。但是即使我手动运行它也不起作用。

我可以通过包含 .setHandleInvalid("keep") 来避免错误，并且我已经检查了输出并且它没有做任何奇怪的事情，例如将所有内容设置为 0 或相同的值或任何其他内容。我仍然对这个决议感到不满，因为它似乎很不安全。很想知道您是否找到了更合理的答案和解决方案!

dfStringIndexed = stringIndexers(1).setHandleInvalid("keep").fit(dfStringIndexed).transform(dfStringIndexed)

我认为它也可以通过更改列的可空性来修复，即使它不包含空值，我按照这里做的

Can I change the nullability of a column in my Spark dataframe?

关于scala - 无法执行用户定义的函数($anonfun$9 : (string) => double) on using String Indexer for multiple columns，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57143187/

文章推荐： internet-explorer - 用于调试目的的 http 代理

文章推荐： rdf - 如何将关系数据库中的数据导入RDF？

文章推荐： visual-studio - 根据VC++编译器版本条件编译

java - 为什么 `index = index++` 不增加 `index` ？
这个问题已经有答案了: 已关闭14 年前。 ** 重复:What's the difference between X = X++; vs X++;? ** 所以，即使我知道你永远不会在代码中真正做到
c - 这条语句背后的逻辑是什么: for (--index; index >= 0; --index)?
我在一本C语言的书上找到了这个例子。此代码转换输入数字基数并将其存储在数组中。 #include int main(void) { const char base_digits[16] =
flutter - 未处理的异常 : RangeError (index): Index out of range: index should be less than
尝试使用“pdf_dart”库保存 pdf 时遇到问题。我认为问题与我从互联网下载以尝试附加到 pdf 的图像有关，但我不确定它是什么。代码 import 'dart:io'; import 'p
linux - 访问某些 index.php 或 index.html 时出现 Apache 403 错误，尽管每个 index.php 或 index.html 具有相似的权限
我的 Apache 服务器曾经可以正常工作，但它随机开始对几乎每个目录发出 403 错误。两个目录仍然有效，我怎样才能使/srv/www/htdocs 中的所有目录正常工作？我查看了两个可用目录的权
PHP 数组索引 : $array[$index] vs $array ["$index"] vs $array ["{$index}"]
这些索引到 PHP 数组的方法之间有什么区别(如果有的话): $array[$index] $array["$index"] $array["{$index}"] 我对性能和功能上的差异都感兴趣。更
indexing - 实现 Index 特征以返回一个不是引用的值
我有一个简单的结构，我想为其实现 Index，但作为 Rust 的新手，我在借用检查器方面遇到了很多麻烦。我的结构非常简单，我想让它存储一个开始值和步长值，然后当被 usize 索引时它应该返回 st
indexing - marklogic 中的 element-range-index 和 field-range-index 有什么区别？
我对 MarkLogic 中的 element-range-index 和 field-range-index 感到困惑。请借助示例来解释差异。最佳答案这两个都是标量索引:特定类型的基于值的排序
indexing - marklogic 中的 element-range-index 和 field-range-index 有什么区别？
我对 MarkLogic 中的 element-range-index 和 field-range-index 感到困惑。请借助示例来解释差异。最佳答案这两个都是标量索引:特定类型的基于值的排序
python - Pandas .at 抛出 ValueError : At based indexing on an integer index can only have integer indexers
所以我有一个 df，我在其中提取一个值以将其存储在另一个 df 中: import pandas as pd # Create data set d = {'foo':[100, 111, 222],
php - ci : google indexing address with index. php 但站点中没有与 index.php 的链接
我有一个由 codeigniter 编写的网站，我已经通过 htaccess 从地址中删除了 index.php RewriteCond $1 !^(index\.php|resources|robo
sql - MySQL: `... ADD INDEX(a); ... ADD INDEX(b);` 和 `... ADD INDEX(a,b);` 之间的区别？
谁能告诉我这两者有什么区别: ALTER TABLE x1 ADD INDEX(a); ALTER TABLE x1 ADD INDEX(b); 和 ALTER TABLE x1 ADD INDEX(
javascript - Firefox 上的嵌套 z-index 问题，较高的 z-index 落后于较低的 z-index
我在 Firefox 和其他浏览器上遇到嵌套 z-index 的问题，我有一个 div，z-index 为 30000，位于 label 下方> zindex 为 9000。我认为这是由 z-inde
c++ - 如果 index == 0，为什么 v [index] < v [index - 1] 返回 true？
Link to the function image编写了一个函数来查找中枢元素(起始/最低)的索引排序和旋转数组。我解决了这个问题并正在检查边缘情况，它甚至适用于索引为零的情况。任何人都可以解
python - 类型错误 : cannot perform __sub__ with this index type:
我正在尝试运行有关成人人口普查数据的示例代码。当我运行这段代码时: X_train, X_test, y_train, y_test = cross_validation.train_test_spl
apache - 如何 htaccess 将 index.html 重定向到 index.php 并将 index.php 重定向到/
我最近将我的 index.html 更改为 index.php - 我希望能够进行重定向以反射(reflect)这一点，然后还进行重写以强制 foo.com/index.php 成为 foo.com/
apache - 如何 htaccess 将 index.html 重定向到 index.php 并将 index.php 重定向到/
我最近将我的 index.html 更改为 index.php - 我希望能够进行重定向以反射(reflect)这一点，然后还进行重写以强制 foo.com/index.php 成为 foo.com/
python - <类 'pandas.indexes.numeric.Int64Index'> 的类型错误 : cannot do slice indexing on with these indexers [(2, )]
我有一个用户定义的函数，如下所示:- def genre(option,option_type,*limit): option_based = rank_data.loc[rank_data[
python - 减去索引 - TypeError : cannot perform __sub__ with this index type:
我有两个巨大的数据框我正在合并它们，但我不想有重复的列，因此我通过减去它们来选择列: cols_to_use=df_fin.columns-df_peers.columns.difference(['
javascript - 如何在 React Native 中使用 index.js 而不是 (index.ios.js, index.android.js) 进行跨平台应用程序？
感谢您从现在开始的回答，我是React Native的新手，我想做一个跨平台的应用所以我创建了index.js: import React from 'react'; import { Co
indexing - Field.Index.NOT_ANALYZED_NO_NORMS 是什么意思
我知道 not_analyzed 是什么意思。简而言之，该字段不会被指定的分析器标记化。然而，什么是 NO_NORMS 方法？我看到了文档，但请用简单的英语解释我。什么是索引时间字段和文档提升和字段

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

scala - 无法执行用户定义的函数($anonfun$9 : (string) => double) on using String Indexer for multiple columns