python - 快速信息增益计算-6ren

python - 快速信息增益计算

转载作者：太空狗更新时间：2023-10-29 20:22:55

26

4

我需要为 文本分类 的 >10k 文档中的 >100k 特征计算信息增益分数。下面的代码工作正常，但完整数据集的速度非常慢 - 在笔记本电脑上需要一个多小时。数据集是 20newsgroup，我正在使用 scikit-learn， chi2 scikit 中提供的功能运行速度非常快。

知道如何更快地计算此类数据集的信息增益吗？

def information_gain(x, y):

    def _entropy(values):
        counts = np.bincount(values)
        probs = counts[np.nonzero(counts)] / float(len(values))
        return - np.sum(probs * np.log(probs))

    def _information_gain(feature, y):
        feature_set_indices = np.nonzero(feature)[1]
        feature_not_set_indices = [i for i in feature_range if i not in feature_set_indices]
        entropy_x_set = _entropy(y[feature_set_indices])
        entropy_x_not_set = _entropy(y[feature_not_set_indices])

        return entropy_before - (((len(feature_set_indices) / float(feature_size)) * entropy_x_set)
                                 + ((len(feature_not_set_indices) / float(feature_size)) * entropy_x_not_set))

    feature_size = x.shape[0]
    feature_range = range(0, feature_size)
    entropy_before = _entropy(y)
    information_gain_scores = []

    for feature in x.T:
        information_gain_scores.append(_information_gain(feature, y))
    return information_gain_scores, []

编辑:

我合并了内部函数并运行 cProfiler 如下(在限制为 ~15k 特征和~1k 文档的数据集上):

cProfile.runctx(
    """for feature in x.T:
    feature_set_indices = np.nonzero(feature)[1]
    feature_not_set_indices = [i for i in feature_range if i not in feature_set_indices]

    values = y[feature_set_indices]
    counts = np.bincount(values)
    probs = counts[np.nonzero(counts)] / float(len(values))
    entropy_x_set = - np.sum(probs * np.log(probs))

    values = y[feature_not_set_indices]
    counts = np.bincount(values)
    probs = counts[np.nonzero(counts)] / float(len(values))
    entropy_x_not_set = - np.sum(probs * np.log(probs))

    result = entropy_before - (((len(feature_set_indices) / float(feature_size)) * entropy_x_set)
                             + ((len(feature_not_set_indices) / float(feature_size)) * entropy_x_not_set))
    information_gain_scores.append(result)""",
    globals(), locals())

tottime 前 20 名的结果:

ncalls  tottime percall cumtime percall filename:lineno(function)
1       60.27   60.27   65.48   65.48   <string>:1(<module>)
16171   1.362   0   2.801   0   csr.py:313(_get_row_slice)
16171   0.523   0   0.892   0   coo.py:201(_check)
16173   0.394   0   0.89    0   compressed.py:101(check_format)
210235  0.297   0   0.297   0   {numpy.core.multiarray.array}
16173   0.287   0   0.331   0   compressed.py:631(prune)
16171   0.197   0   1.529   0   compressed.py:534(tocoo)
16173   0.165   0   1.263   0   compressed.py:20(__init__)
16171   0.139   0   1.669   0   base.py:415(nonzero)
16171   0.124   0   1.201   0   coo.py:111(__init__)
32342   0.123   0   0.123   0   {method 'max' of 'numpy.ndarray' objects}
48513   0.117   0   0.218   0   sputils.py:93(isintlike)
32342   0.114   0   0.114   0   {method 'sum' of 'numpy.ndarray' objects}
16171   0.106   0   3.081   0   csr.py:186(__getitem__)
32342   0.105   0   0.105   0   {numpy.lib._compiled_base.bincount}
32344   0.09    0   0.094   0   base.py:59(set_shape)
210227  0.088   0   0.088   0   {isinstance}
48513   0.081   0   1.777   0   fromnumeric.py:1129(nonzero)
32342   0.078   0   0.078   0   {method 'min' of 'numpy.ndarray' objects}
97032   0.066   0   0.153   0   numeric.py:167(asarray)

看起来大部分时间都花在了 _get_row_slice 上。我不完全确定第一行，看起来它涵盖了我提供给 cProfile.runctx 的整个 block ，虽然我不知道为什么第一行 totime 之间有这么大的差距=60.27 和第二个 tottime=1.362。差价花在了哪里？是否可以在 cProfile 中检查它？

基本上，看起来问题出在稀疏矩阵运算(切片、获取元素)——解决方案可能是使用矩阵代数计算信息增益(如 chi2 is implemented in scikit )。但我不知道如何用矩阵运算来表达这个计算...有人有想法吗？？

最佳答案

一年过去了，不知道还有没有用。但是现在我恰好面临着同样的文本分类任务。我已经使用 nonzero() 重写了您的代码为稀疏矩阵提供的函数。然后我就扫描nz，统计对应的y_value，计算熵。

以下代码仅需几秒即可运行 news20 数据集(使用 libsvm 稀疏矩阵格式加载)。

def information_gain(X, y): def _calIg(): entropy_x_set = 0 entropy_x_not_set = 0 for c in classCnt: probs = classCnt[c] / float(featureTot) entropy_x_set = entropy_x_set - probs * np.log(probs) probs = (classTotCnt[c] - classCnt[c]) / float(tot - featureTot) entropy_x_not_set = entropy_x_not_set - probs * np.log(probs) for c in classTotCnt: if c not in classCnt: probs = classTotCnt[c] / float(tot - featureTot) entropy_x_not_set = entropy_x_not_set - probs * np.log(probs) return entropy_before - ((featureTot / float(tot)) * entropy_x_set + ((tot - featureTot) / float(tot)) * entropy_x_not_set) tot = X.shape[0] classTotCnt = {} entropy_before = 0 for i in y: if i not in classTotCnt: classTotCnt[i] = 1 else: classTotCnt[i] = classTotCnt[i] + 1 for c in classTotCnt: probs = classTotCnt[c] / float(tot) entropy_before = entropy_before - probs * np.log(probs) nz = X.T.nonzero() pre = 0 classCnt = {} featureTot = 0 information_gain = [] for i in range(0, len(nz[0])): if (i != 0 and nz[0][i] != pre): for notappear in range(pre+1, nz[0][i]): information_gain.append(0) ig = _calIg() information_gain.append(ig) pre = nz[0][i] classCnt = {} featureTot = 0 featureTot = featureTot + 1 yclass = y[nz[1][i]] if yclass not in classCnt: classCnt[yclass] = 1 else: classCnt[yclass] = classCnt[yclass] + 1 ig = _calIg() information_gain.append(ig) return np.asarray(information_gain)

关于python - 快速信息增益计算，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25462407/

26

4

0

文章推荐： python - 一对多 Flask | SQL炼金术

文章推荐： c# - 如何在 SeriLog Sink 中获取当前的 HttpContext？

文章推荐： C#解构和重载

文章推荐： python - 无法获得在 Tornado 中工作的 SSL 客户端证书

audio - 如何在OpenAL中设置 channel 增益？
我试过了 alBufferf (myChannelId, AL_MAX_GAIN (and AL_GAIN), volumeValue); 并收到错误0xA002。最佳答案 0xA002是Linux
matlab - 如何更改相机参数(自动曝光、快门速度、增益)？
我正在使用 Matlab 从 2 点灰度相机 (Flea2) 捕捉图像，我想更改相机的一些参数，例如自动曝光、增益和快门速度。到目前为止，我已经使用了这些命令: %Creating the two v
c# - 调整网络摄像头亮度(曝光/增益)C#
我正在尝试调整网络摄像头的亮度。我需要 3 张不同亮度设置的不同照片。我不想让它成为手动的，所以如果想在程序中包含设置。下面是我正在使用的代码。使用方法 GetFrame() 可以从网络摄像头获取下
algorithm - 具有随机 1bit 增益/损失的编码
我想问一个我试图自己回答但无法想出任何解决方案的问题。我想知道任何具有这些属性的算法(或者是否有可能至少证明一个算法是否存在) +-----------+ status_
objective-c - 如何在内置输入(OSX Core音频/音频单元)上设置输入电平(增益)？
我有一个OSX应用程序，该应用程序使用音频单元记录音频数据。可以将音频单元的输入设置为任何可用的输入源，包括内置输入。问题是，我从内置输入获得的音频经常被剪切，而在诸如Audacity(甚至Quick

首页

博学

6Ren·AI

商城

python - 快速信息增益计算