gpt4 book ai didi

python - PySpark:组合两个 VectorAssembler 的输出

转载 作者:行者123 更新时间:2023-12-05 03:41:27 24 4
gpt4 key购买 nike

我使用 pyspark 创建了两个 VectorAssembler,第一个具有多个数字列('colA'、'colB'、'colC'),第二个具有多个分类列('colD'、'colE'、I在每一列上应用 OneHotEncoder)。

我可以单独创建这些 VectorAssembler。如何将输出组合成一个向量列(以便我可以将其输入 Xgboost 模型)?

我尝试了以下方法,但出现“TypeError: can only concatenate str (not "list") to str”

# my dataframe with all columns is df

# VectorAssembler 1: with 3 numeric columns
numeric_cols = ['colA', 'colB', 'colC']
assembler = VectorAssembler(
inputCols= numeric_cols,
outputCol="numericFeatures"
)


# VectorAssembler 2: with 2 categorical columns
categ_cols = ['colD', 'colE']
indexers = [
StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))
for c in categ_cols
]
encoders = [
OneHotEncoder(
inputCol=indexer.getOutputCol(),
outputCol="{0}_encoded".format(indexer.getOutputCol()))
for indexer in indexers
]
assemblerCateg = VectorAssembler(
inputCols = [encoder.getOutputCol() for encoder in encoders],
outputCol = "categFeatures"
)


pipeline = Pipeline(stages = [assembler] + indexers + encoders + [assemblerCateg])
df2 = pipeline.fit(df).transform(df)

最佳答案

解决了!只需在管道之前使用另一个 VectorAssembler(最后):

assemblerAll = VectorAssembler(inputCols= ["numericFeatures", "categFeatures"], outputCol="allFeatures")
pipeline = Pipeline(stages = [assembler] + indexers + encoders + [assemblerCateg] + [assemblerAll])

关于python - PySpark:组合两个 VectorAssembler 的输出,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67753426/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com