gpt4 book ai didi

python - 使用 Dask 删除 Dataframe 中高度相关的成对特征?

转载 作者:行者123 更新时间:2023-12-05 03:51:35 30 4
gpt4 key购买 nike

很难找到这样的例子,但我想以某种方式使用 Dask 删除成对相关的列,如果它们的相关阈值高于 0.99。我无法使用 Pandas 的 correlation 函数,因为我的数据集太大了,它会很快耗尽我的内存。我现在拥有的是一个缓慢的双 for 循环,它从第一列开始,逐一找到它与所有其他列之间的相关阈值,如果高于 0.99,则删除第二列比较列,然后从新的第二列开始,依此类推,有点像解决方案 found here ,然而,在所有列中以迭代形式执行此操作的速度非常慢,尽管实际上可以运行它而不会遇到内存问题。

我读过 API here ,并查看如何使用 Dask 删除列 here ,但需要一些帮助才能解决这个问题。我想知道是否有一种更快但内存友好的方法可以使用 Dask 删除 Pandas Dataframe 中高度相关的列?我想将 Pandas 数据帧输入函数,并让它在关联删除完成后返回 Pandas 数据帧。

任何人都有我可以查看的任何资源,或者有如何执行此操作的示例?

谢谢!

更新根据要求,这是我当前的相关性删除例程,如上所述:

print("Checking correlations of all columns...")

cols_to_drop_from_high_corr = []
corr_threshold = 0.99

for j in df.iloc[:,1:]: # Skip column 0

try: # encompass the below in a try/except, cuz dropping a col in the 2nd 'for' loop below will screw with this
# original list, so if a feature is no longer in there from dropping it prior, it'll throw an error

for k in df.iloc[:,1:]: # Start 2nd loop at first column also...

# If comparing the same column to itself, skip it
if (j == k):
continue

else:
try: # second try/except mandatory
correlation = abs(df[j].corr(df[k])) # Get the correlation of the first col and second col

if correlation > corr_threshold: # If they are highly correlated...
cols_to_drop_from_high_corr.append(k) # Add the second col to list for dropping when round is done before next round.")

except:
continue

# Once we have compared the first col with all of the other cols...
if len(cols_to_drop_from_high_corr) > 0:
df = df.drop(cols_to_drop_from_high_corr, axis=1) # Drop all the 2nd highly corr'd cols
cols_to_drop_from_high_corr = [] # Reset the list for next round
# print("Dropped all cols from most recent round. Continuing...")

except: # Now, if the first for loop tries to find a column that's been dropped already, just continue on
continue

print("Correlation dropping completed.")

更新使用下面的解决方案,我遇到了一些错误,由于我的 dask 语法知识有限,我希望能得到一些见解。运行 Windows 10、Python 3.6 和最新版本的 dask。

使用我的数据集上的代码(链接中的数据集显示“找不到文件”),我遇到了第一个错误:

ValueError: Exactly one of npartitions and chunksize must be specified.

所以我在from_pandas中指定npartitions=2,然后得到这个错误:

AttributeError: 'Array' object has no attribute 'compute_chunk_sizes'

我尝试将其更改为 .rechunk('auto'),但随后出现错误:

ValueError: Can not perform automatic rechunking with unknown (nan) chunk sizes

我的原始数据框是 1275 行和 3045 列的形状。 dask 数组形状表示 shape=(nan, 3045)。这是否有助于诊断问题?

最佳答案

我不确定这是否有帮助,但也许它可以作为一个起点。

Pandas

import pandas as pd
import numpy as np

url = "https://raw.githubusercontent.com/dylan-profiler/heatmaps/master/autos.clean.csv"

df = pd.read_csv(url)

# we check correlation for these columns only
cols = df.columns[-8:]

# columns in this df don't have a big
# correlation coefficient
corr_threshold = 0.5

corr = df[cols].corr().abs().values

# we take the upper triangular only
corr = np.triu(corr)

# we want high correlation but not diagonal elements
# it returns a bool matrix
out = (corr != 1) & (corr > corr_threshold)

# for every row we want only the True columns
cols_to_remove = []
for o in out:
cols_to_remove += cols[o].to_list()

cols_to_remove = list(set(cols_to_remove))

df = df.drop(cols_to_remove, axis=1)

任务

这里我只评论和pandas不同的步骤

import dask.dataframe as dd
import dask.array as da

url = "https://raw.githubusercontent.com/dylan-profiler/heatmaps/master/autos.clean.csv"

df = dd.read_csv(url)

cols = df.columns[-8:]

corr_threshold = 0.5

corr = df[cols].corr().abs().values

# with dask we need to rechunk
corr = corr.compute_chunk_sizes()

corr = da.triu(corr)

out = (corr != 1) & (corr > corr_threshold)

# dask is lazy
out = out.compute()

cols_to_remove = []
for o in out:
cols_to_remove += cols[o].to_list()

cols_to_remove = list(set(cols_to_remove))

df = df.drop(cols_to_remove, axis=1)

关于python - 使用 Dask 删除 Dataframe 中高度相关的成对特征?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62805288/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com