python - 使用 Dask 删除 Dataframe 中高度相关的成对特征？

转载作者：行者123 更新时间：2023-12-05 03:51:35

30

4

很难找到这样的例子，但我想以某种方式使用 Dask 删除成对相关的列，如果它们的相关阈值高于 0.99。我无法使用 Pandas 的 correlation 函数，因为我的数据集太大了，它会很快耗尽我的内存。我现在拥有的是一个缓慢的双 for 循环，它从第一列开始，逐一找到它与所有其他列之间的相关阈值，如果高于 0.99，则删除第二列比较列，然后从新的第二列开始，依此类推，有点像解决方案 found here ，然而，在所有列中以迭代形式执行此操作的速度非常慢，尽管实际上可以运行它而不会遇到内存问题。

我读过 API here ，并查看如何使用 Dask 删除列 here ，但需要一些帮助才能解决这个问题。我想知道是否有一种更快但内存友好的方法可以使用 Dask 删除 Pandas Dataframe 中高度相关的列？我想将 Pandas 数据帧输入函数，并让它在关联删除完成后返回 Pandas 数据帧。

任何人都有我可以查看的任何资源，或者有如何执行此操作的示例？

谢谢!

更新根据要求，这是我当前的相关性删除例程，如上所述:

print("Checking correlations of all columns...")

cols_to_drop_from_high_corr = []
corr_threshold = 0.99

for j in df.iloc[:,1:]:  # Skip column 0

    try:  # encompass the below in a try/except, cuz dropping a col in the 2nd 'for' loop below will screw with this
        # original list, so if a feature is no longer in there from dropping it prior, it'll throw an error

        for k in df.iloc[:,1:]:  # Start 2nd loop at first column also...

            # If comparing the same column to itself, skip it
            if (j == k):  
                continue

            else:
                try: # second try/except mandatory
                    correlation = abs(df[j].corr(df[k]))  # Get the correlation of the first col and second col

                    if correlation > corr_threshold:  # If they are highly correlated...
                        cols_to_drop_from_high_corr.append(k)  # Add the second col to list for dropping when round is done before next round.")
                
                except:
                    continue

        # Once we have compared the first col with all of the other cols...
        if len(cols_to_drop_from_high_corr) > 0:
            df = df.drop(cols_to_drop_from_high_corr, axis=1)  # Drop all the 2nd highly corr'd cols
            cols_to_drop_from_high_corr = [] # Reset the list for next round
            # print("Dropped all cols from most recent round. Continuing...")

    except:  # Now, if the first for loop tries to find a column that's been dropped already, just continue on
        continue

print("Correlation dropping completed.")

更新使用下面的解决方案，我遇到了一些错误，由于我的 dask 语法知识有限，我希望能得到一些见解。运行 Windows 10、Python 3.6 和最新版本的 dask。

使用我的数据集上的代码(链接中的数据集显示“找不到文件”)，我遇到了第一个错误:

ValueError: Exactly one of npartitions and chunksize must be specified.

所以我在from_pandas中指定npartitions=2，然后得到这个错误:

AttributeError: 'Array' object has no attribute 'compute_chunk_sizes'

我尝试将其更改为 .rechunk('auto')，但随后出现错误:

ValueError: Can not perform automatic rechunking with unknown (nan) chunk sizes

我的原始数据框是 1275 行和 3045 列的形状。 dask 数组形状表示 shape=(nan, 3045)。这是否有助于诊断问题？

最佳答案

我不确定这是否有帮助，但也许它可以作为一个起点。

Pandas

import pandas as pd
import numpy as np

url = "https://raw.githubusercontent.com/dylan-profiler/heatmaps/master/autos.clean.csv"

df = pd.read_csv(url)

# we check correlation for these columns only
cols = df.columns[-8:]

# columns in this df don't have a big 
# correlation coefficient
corr_threshold = 0.5

corr = df[cols].corr().abs().values

# we take the upper triangular only
corr = np.triu(corr)

# we want high correlation but not diagonal elements
# it returns a bool matrix
out = (corr != 1) & (corr > corr_threshold)

# for every row we want only the True columns
cols_to_remove = []
for o in out:
    cols_to_remove += cols[o].to_list()

cols_to_remove = list(set(cols_to_remove))

df = df.drop(cols_to_remove, axis=1)

任务

这里我只评论和pandas不同的步骤

import dask.dataframe as dd
import dask.array as da

url = "https://raw.githubusercontent.com/dylan-profiler/heatmaps/master/autos.clean.csv"

df = dd.read_csv(url)

cols = df.columns[-8:]

corr_threshold = 0.5

corr = df[cols].corr().abs().values

# with dask we need to rechunk
corr = corr.compute_chunk_sizes()

corr = da.triu(corr)

out = (corr != 1) & (corr > corr_threshold)

# dask is lazy
out = out.compute()

cols_to_remove = []
for o in out:
    cols_to_remove += cols[o].to_list()

cols_to_remove = list(set(cols_to_remove))

df = df.drop(cols_to_remove, axis=1)

关于python - 使用 Dask 删除 Dataframe 中高度相关的成对特征？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62805288/

30

4

0

文章推荐： python - django 'QuerySet' 对象中的反向关系没有属性 'name'

文章推荐： kubernetes - Helm 插值

文章推荐： javascript - 如何将 dataframe-js 导入 Google App 脚本？

javascript - 剑道图表 105% 高度，而不是 100% 高度？
我有下面的图表，它填充了显示器的宽度和高度。高度始终只比屏幕大一点，因此会出现滚动条以显示底部 20 像素左右。有没有办法让 Kendo UI 显示 100%，而不是 105% 的高度？在线示例:
html - 创建当前视口(viewport)高度 div 的 100% 高度
这个问题在这里已经有了答案: Why doesn't height: 100% work to expand divs to the screen height? (12 个答案) 关闭 9 年前
CSS:iframe 不显示 100% 高度...更像是 100px 高度
此页面 ( http://purcraft.com/madeinla/) 有问题，我正在尝试使用 iframe 元素显示此页面的内容:( http://purcraft.com/madeinla/ho
css - 将子 div 的高度设置为 parent 高度 - sibling 高度
我在一个父 div 中有 2 个子 div。 Child1 是标题，Child2 是正文。我希望将 Child 2 的高度设置为 Parent - Child1 的高度。 Child2 有内容，所以它
image - 根据父宽度/高度 CSS HTML 自动调整 img 宽度/高度
我正在尝试用图像填充窗口。我正在使用 CSS 来尝试解决这个问题，但我想知道是否有一种方法可以最大化图像的宽度/高度，直到所有空白区域都被填满，但又不会破坏质量。 .rel-img-co
css - 粘性页脚和 100% 高度 - 强制 div 获得 100% 高度
这个问题在这里已经有了答案: How to make a div 100% height of the browser window (41 个回答) 关闭 8 年前。
css - 将 Sprite 图标添加到标签。有没有办法只为 Sprite 设置宽度/高度，并为标签设置第二个宽度/高度？
这可能是一个新手问题，但是是否可以将 Sprite 图标添加到带有文本的标签中？例如: labeltext .icon { width: 30px height: 30px;
jquery - 使用 jquery 设置 div 高度(拉伸(stretch) div 高度)
我有 3 个 div，分别是 header、content 和 footer。页眉和页脚具有固定的高度，并且它们被设计为 float 在顶部和底部。我想要使用 jquery 自动计算中间的 con
javascript - 计算内部 div 的宽度/高度(以毫米为单位)(相对于已知宽度/高度(以毫米为单位)的外部 div。)
我有一个外部 div，其指定的宽度/高度(以毫米为单位)。 (mm只是赋值，不用于渲染)。里面有另一个 div，其实际宽度/高度(以 px 为单位)。两个 div 可以具有不同的比例。我想要做的
css - 如果我使用 HTML5 文档类型，为什么不能使我的 div 100% 高度？我如何获得它的 100% 高度
我正在为一个非常简单的画廊 webapp 进行布局排序，但是当我使用 HTML5 文档类型声明时，我的一些 div(100%)的高度会立即缩小，我不能似乎使用 CSS 将它们丰满起来。我的 HTML
css - 如果我使用 HTML5 文档类型，为什么不能使我的 div 100% 高度？我如何获得它的 100% 高度
我正在为一个非常简单的画廊 webapp 进行布局排序，但是当我使用 HTML5 文档类型声明时，我的一些 div(100%)的高度会立即缩小，我不能似乎使用 CSS 将它们丰满起来。我的 HTML
iphone - 如何更改 iphone 中 UISearchBar 的 UiSearchbar 高度、宽度、颜色和 Uitextfield 高度？
我想更改 UISearchBar。文本字段的高度和宽度。我的问题是如何更改 iphone 中 UISearchBar 中的 UiSearchbar 高度、宽度、颜色和 Uitextfield 高度？
html - 如果 parent 的 div 也是 100% 宽度/高度，如何给 child div 宽度/高度 100%
我想要两个宽度和高度均为 100% 的 div。我知道子 div 不会工作，因为父 div 没有特定的高度，但有没有办法解决这个问题？ HTML: CSS: body
jQuery if 高度
我有几个带有“priceText”类的 div，我试图实现如果 div.priceText 高度小于 100px，则隐藏 this div 中的图像。我无法让它工作。我已成功隐藏所有 .priceT
javascript获取表格中所有图像的自然宽度/高度
我正在尝试从 Image 列中列出的图像中获取实际图像尺寸，并将其显示在 Image Size 列中。我遇到的问题是，我只能获取第一张图片的大小，该图片会添加到 Image Size 列的每个单元格
jQuery获取容器中加载的图像宽度/高度
我正在使用一个插件，它要求我在加载图像后获取图像的宽度和高度，而不管图像的尺寸是如何确定的。
parsing - 如何在pdf中获得准确的字体大小(高度)
我有一个示例 pdf(已附)，它包括一个文本对象和一个高度几乎相同的矩形对象。然后我使用 itextrup 检查了 pdf 的内容，如下所示: 1 1 1 RG 1 1 1 rg 0.12 0 0 0
运行时的 WPF 高度
我是 WPF 新手。我试图解决的一个问题是如何在运行时获得正确的高度。在我的应用程序中，我将用户控件动态添加到代码隐藏中的 Stackpanel。 Usercontrol 包含一些 Texblock
WPF 高度/宽度
在自定义 WPF 控件中，我想将控件的宽度设置为高度的函数。例如:Width = Height/3 * x; 实现此目的的最佳方法是什么，以便控件正确且流畅地调整大小(和初始大小)？最佳答案您可以
WPF RibbonComboBox 高度
好吧，我本以为这是一个简单的问题，但显然它让我感到困惑。当我尝试设置 RibbonComboBox 的高度时，它不会移动它的实际大小，而是移动它周围的框。这是我的 XAML:

首页

博学

6Ren·AI

商城

python - 使用 Dask 删除 Dataframe 中高度相关的成对特征？

Pandas

任务