gpt4 book ai didi

python - pyspark 数据帧的缓慢过滤

转载 作者:太空宇宙 更新时间:2023-11-04 08:29:53 26 4
gpt4 key购买 nike

我对过滤 pandas 和 pyspark 数据帧时的时差有疑问:

import time
import numpy as np
import pandas as pd
from random import shuffle

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = pd.DataFrame(np.random.randint(1000000, size=400000).reshape(-1, 2))
list_filter = list(range(10000))
shuffle(list_filter)

# pandas is fast
t0 = time.time()
df_filtered = df[df[0].isin(list_filter)]
print(time.time() - t0)
# 0.0072

df_spark = spark.createDataFrame(df)

# pyspark is slow
t0 = time.time()
df_spark_filtered = df_spark[df_spark[0].isin(list_filter)]
print(time.time() - t0)
# 3.1232

如果我将 list_filter 的长度增加到 10000,则执行时间为 0.01353 和 17.6768 秒。 isin 的 Pandas 实现 seems计算效率。你能解释一下为什么过滤 pyspark 数据帧这么慢吗?我怎样才能快速执行这种过滤?

最佳答案

您需要使用 join 代替带有 isin 子句的过滤器来加速 pyspark 中的过滤器操作:

import time
import numpy as np
import pandas as pd
from random import shuffle
import pyspark.sql.functions as F

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = pd.DataFrame(np.random.randint(1000000, size=400000).reshape(-1, 2))

df_spark = spark.createDataFrame(df)

list_filter = list(range(10000))
list_filter_df = spark.createDataFrame([[x] for x in list_filter], df_spark.columns[:1])
shuffle(list_filter)

# pandas is fast because everything in memory
t0 = time.time()
df_filtered = df[df[0].isin(list_filter)]
print(time.time() - t0)
# 0.0227580165863
# 0.0127580165863

# pyspark is slow because there is memory overhead, but broadcast make is mast compared to isin with lists
t0 = time.time()
df_spark_filtered = df_spark.join(F.broadcast(list_filter_df), df_spark.columns[:1])
print(time.time() - t0)
# 0.0571971035004
# 0.0471971035004

关于python - pyspark 数据帧的缓慢过滤,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53737952/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com