gpt4 book ai didi

python - Spark Dataframe 中 `float` 与 `np.nan` 的比较

转载 作者:太空狗 更新时间:2023-10-29 22:28:24 25 4
gpt4 key购买 nike

这是预期的行为吗?我想提出一个 Spark 问题,但这似乎是一个基本功能,很难想象这里有一个错误。我错过了什么?

python

import numpy as np

>>> np.nan < 0.0
False

>>> np.nan > 0.0
False

PySpark

from pyspark.sql.functions import col

df = spark.createDataFrame([(np.nan, 0.0),(0.0, np.nan)])
df.show()
#+---+---+
#| _1| _2|
#+---+---+
#|NaN|0.0|
#|0.0|NaN|
#+---+---+

df.printSchema()
#root
# |-- _1: double (nullable = true)
# |-- _2: double (nullable = true)

df.select(col("_1")> col("_2")).show()
#+---------+
#|(_1 > _2)|
#+---------+
#| true|
#| false|
#+---------+

最佳答案

这既是预期的行为,也是记录在案的行为。引用NaN Semantics段官Spark SQL Guide (强调我的):

There is specially handling for not-a-number (NaN) when dealing with float or double types that does not exactly match standard floating point semantics. Specifically:

  • NaN = NaN returns true.
  • In aggregations, all NaN values are grouped together.
  • NaN is treated as a normal value in join keys.
  • NaN values go last when in ascending order, larger than any other numeric value.

Ad 如您所见,与 Python NaN 相比,排序行为并不是唯一的区别。特别是 Spark 认为 NaN 等于:

spark.sql("""
WITH table AS (SELECT CAST('NaN' AS float) AS x, cast('NaN' AS float) AS y)
SELECT x = y, x != y FROM table
""").show()
+-------+-------------+
|(x = y)|(NOT (x = y))|
+-------+-------------+
| true| false|
+-------+-------------+

而普通的 Python

float("NaN") == float("NaN"), float("NaN") != float("NaN")
(False, True)

和 NumPy

np.nan == np.nan, np.nan != np.nan
(False, True)

不要。

可以查看eqNullSafe docstring更多示例。

因此,为了获得所需的结果,您必须明确检查 NaN

from pyspark.sql.functions import col, isnan, when

when(isnan("_1") | isnan("_2"), False).otherwise(col("_1") > col("_2"))

关于python - Spark Dataframe 中 `float` 与 `np.nan` 的比较,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55227625/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com