gpt4 book ai didi

apache-spark - IllegalArgumentException : Column must be of type struct,值:array> but was actually double.'

转载 作者:行者123 更新时间:2023-12-03 15:27:36 28 4
gpt4 key购买 nike

我有一个带有多个分类列的数据框。我正在尝试使用两列之间的内置函数查找卡方统计量:

from pyspark.ml.stat import ChiSquareTest

r = ChiSquareTest.test(df, 'feature1', 'feature2')

但是,它给了我错误:
IllegalArgumentException: 'requirement failed: Column feature1 must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually double.'
feature1的数据类型为:
feature1: double (nullable = true)

您能在这方面帮助我吗?

最佳答案

spark-ml不是典型的统计资料库。它非常面向ML。因此,它假定您将要在标签和一个要素或一组要素之间运行测试。
因此,类似于训练模型时,您需要根据标签组装要测试的功能。
对于您的情况,您可以按以下方式组装feature1:

from pyspark.ml.stat import ChiSquareTest
from pyspark.ml.feature import VectorAssembler

data = [(1, 2), (3, 4), (2, 1), (4, 3)]
df = spark.createDataFrame(data, ['feature1', 'feature2'])
assembler = VectorAssembler().setInputCols(['feature1']).setOutputCol('features')

ChiSquareTest.test(assembler.transform(df), 'features', 'feature2').show(false)
以防万一,scala中的代码:
import org.apache.spark.ml.stat.ChiSquareTest
import org.apache.spark.ml.feature.VectorAssembler

val df = Seq((1, 2, 3), (1, 2, 3), (4, 5, 6), (6, 5, 4))
.toDF("features", "feature2", "feature3")
val assembler = new VectorAssembler()
.setInputCols(Array("feature1"))
.setOutputCol("features")

ChiSquareTest.test(assembler.transform(df), "features", "feature2").show(false)

关于apache-spark - IllegalArgumentException : Column must be of type struct<type:tinyint,大小:int,索引:array<int>,值:array<double>> but was actually double.',我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61056160/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com