gpt4 book ai didi

python - 带有 numpy 数组错误的 Fillna PySpark Dataframe

转载 作者:太空宇宙 更新时间:2023-11-03 15:54:32 25 4
gpt4 key购买 nike

以下是我的 Spark DataFrame 示例,其下方有 printSchema:

+--------------------+---+------+------+--------------------+
| device_id|age|gender| group| apps|
+--------------------+---+------+------+--------------------+
|-9073325454084204615| 24| M|M23-26| null|
|-8965335561582270637| 28| F|F27-28|[1.0,1.0,1.0,1.0,...|
|-8958861370644389191| 21| M| M22-|[4.0,0.0,0.0,0.0,...|
|-8956021912595401048| 21| M| M22-| null|
|-8910497777165914301| 25| F|F24-26| null|
+--------------------+---+------+------+--------------------+
only showing top 5 rows

root
|-- device_id: long (nullable = true)
|-- age: integer (nullle = true)
|-- gender: string (nullable = true)
|-- group: string (nullable = true)
|-- apps: vector (nullable = true)

我正在尝试用 np.zeros(19237) 填充“应用程序”列中的空值。但是当我执行

df.fillna({'apps': np.zeros(19237)}))

我得到一个错误

Py4JJavaError: An error occurred while calling o562.fill.
: java.lang.IllegalArgumentException: Unsupported value type java.util.ArrayList

或者如果我尝试

df.fillna({'apps': DenseVector(np.zeros(19237)})))

我得到一个错误

AttributeError: 'numpy.ndarray' object has no attribute '_get_object_id'

有什么想法吗?

最佳答案

DataFrameNaFunctions 仅支持原生(无 UDT)类型的子集,因此您需要一个 UDF。

from pyspark.sql.functions import coalesce, col, udf
from pyspark.ml.linalg import Vectors, VectorUDT

def zeros(n):
def zeros_():
return Vectors.sparse(n, {})
return udf(zeros_, VectorUDT())()

示例用法:

df = spark.createDataFrame(
[(1, Vectors.dense([1, 2, 3])), (2, None)],
("device_id", "apps"))

df.withColumn("apps", coalesce(col("apps"), zeros(3))).show()
+---------+-------------+
|device_id| apps|
+---------+-------------+
| 1|[1.0,2.0,3.0]|
| 2| (3,[],[])|
+---------+-------------+

关于python - 带有 numpy 数组错误的 Fillna PySpark Dataframe,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44396366/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com