python - 熵计算的奇怪结果-6ren

python - 熵计算的奇怪结果

转载作者：塔克拉玛干更新时间：2023-11-03 06:07:17

25

4

我正在尝试编写一个函数来正确计算给定数据集的熵。但是，我得到了非常奇怪的熵值。

我遵循的理解是所有熵计算必须介于 0 和 1 之间，但我始终得到高于 2 的值。

注意:我必须为此使用log base 2

有人可以解释为什么我会产生不正确的熵结果吗？我正在测试的数据集是 ecoli dataset from the UCI Machine Learning Repository

import numpy
import math


#################### DATA HANDLING LIBRARY ####################
def csv_to_array(file):
    # Open the file, and load it in delimiting on the ',' for a comma separated value file
    data = open(file, 'r')
    data = numpy.loadtxt(data, delimiter=',')

    # Loop through the data in the array
    for index in range(len(data)):
        # Utilize a try catch to try and convert to float, if it can't convert to float, converts to 0
        try:
            data[index] = [float(x) for x in data[index]]
        except Exception:
            data[index] = 0
        except ValueError:
            data[index] = 0

    # Return the now type-formatted data
    return data


# Function that utilizes the numpy library to randomize the dataset.
def randomize_data(csv):
    csv = numpy.random.shuffle(csv)
    return csv


# Function to split the data into test, training set, and validation sets
def split_data(csv):
    # Call the randomize data function
    randomize_data(csv)
    # Grab the number of rows and calculate where to split
    num_rows = csv.shape[0]
    validation_split = int(num_rows * 0.10)
    training_split = int(num_rows * 0.72)
    testing_split = int(num_rows * 0.18)

    # Validation set as the first 10% of the data
    validation_set = csv[:validation_split]
    # Training set as the next 72
    training_set = csv[validation_split:training_split + validation_split]
    # Testing set as the last 18
    testing_set = csv[training_split + validation_split:]
    # Split the data into classes vs actual data
    training_cols = training_set.shape[1]
    testing_cols = testing_set.shape[1]
    validation_cols = validation_set.shape[1]
    training_classes = training_set[:, training_cols - 1]
    testing_classes = testing_set[:, testing_cols - 1]
    validation_classes = validation_set[:, validation_cols - 1]

    # Take the sets and remove the last (classification) column
    training_set = training_set[:-1]
    testing_set = testing_set[:-1]
    validation_set = validation_set[:-1]

    # Return the datasets
    return testing_set, testing_classes, training_set, training_classes, validation_set, validation_classes


#################### DATA HANDLING LIBRARY ####################

# This function returns the list of classes, and their associated weights (i.e. distributions)
# for a given dataset
def class_distribution(dataset):
    # Ensure the dataset is a numpy array
    dataset = numpy.asarray(dataset)
    # Collect # of total rows and columns, using numpy
    num_total_rows = dataset.shape[0]
    num_columns = dataset.shape[1]
    # Create a numpy array of just the classes
    classes = dataset[:, num_columns - 1]
    # Use numpy.unique to remove duplicates
    classes = numpy.unique(classes)
    # Create an empty array for the class weights
    class_weights = []

    # Loop through the classes one by one
    for aclass in classes:
        # Create storage variables
        total = 0
        weight = 0
        # Now loop through the dataset
        for row in dataset:
            # If the class of the dataset is equal to the current class you are evaluating, increase the total
            if numpy.array_equal(aclass, row[-1]):
                total = total + 1
            # If not, continue
            else:
                continue
        # Divide the # of occurences by total rows
        weight = float((total / num_total_rows))
        # Add that weight to the list of class weights
        class_weights.append(weight)

    # Turn the weights into a numpy array
    class_weights = numpy.asarray(class_weights)
    # Return the array
    return classes, class_weights

# This function returns the entropy for a given dataset
# Can be used across an entire csv, or just for a column of data (feature)
def get_entropy(dataset):
    # Set initial entropy
    entropy = 0.0
    # Determine the classes and their frequencies (weights) of the dataset
    classes, class_freq = class_distribution(dataset)
    # Utilize numpy's quicksort to test the most occurring class first
    numpy.sort(class_freq)
    # Determine the max entropy for the dataset
    max_entropy = math.log(len(classes), 2)
    print("MAX ENTROPY FOR THIS DATASET: ", max_entropy)
    # Loop through the frequencies and use given formula to calculate entropy
    # For...Each simulates the sequence operator
    for freq in class_freq:
        entropy += float(-freq * math.log(freq, 2))

    # Return the entropy value
    return entropy



def main():
    ecol = csv_to_array('ecoli.csv')
    testing_set, testing_classes, training_set, training_classes, validation_set, validation_classes = split_data(ecol)

    entropy = get_entropy(ecol)
    print(entropy)

main()

最佳答案

以下函数用于计算熵:

# Function to return Shannon's Entropy
def entropy(attributes, dataset, targetAttr):
    freq = {}
    entropy = 0.0
    index = 0
    for item in attributes:
        if (targetAttr == item):
            break
        else:
            index = index + 1
    index = index - 1
    for item in dataset:
        if ((item[index]) in freq):
            # Increase the index
            freq[item[index]] += 1.0
        else:
            # Initialize it by setting it to 0
            freq[item[index]] = 1.0

    for freq in freq.values():
        entropy = entropy + (-freq / len(dataset)) * math.log(freq / len(dataset), 2)
    return entropy

正如@MattTimmermans 所指出的，熵的值实际上取决于类的数量。对于严格的 2 类，它包含在 0 到 1(含) 范围内。然而，对于超过 2 个类(正在测试的类)，熵是用不同的公式计算的(转换为上面的 Pythonic 代码)。 This post here更详细地解释了这些数学和计算。

关于python - 熵计算的奇怪结果，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55290793/

25

4

0

文章推荐： PHP:使用 simplexml 遍历 XML 文件的所有级别

文章推荐： php - 如何在 PHP 中向对象添加方法？

文章推荐： javascript - 元素相对于另一个元素的非线性运动

javascript - 谷歌地图自动完成弹回已经清除的文本......奇怪......奇怪......奇怪
我有这种来自 Google map 自动完成的奇怪行为(或者我可能错过了某事)...想法？奇怪的: 您在输入中输入某物，例如“伦敦” 您按 [ENTER] 你按下 [CLEAR] 按钮你点击进入'输
Java意外类型做字符串比较，奇怪
这段代码与《Learning Java》(Oracle Press Books)一书中的代码完全一样，但它不起作用。我不明白为什么它不起作用，它应该起作用。我用 OpenJDK 和 Sun JDK 7
Powershell 对新行使用反引号 - 奇怪
示例 1 中究竟发生了什么？这是如何解析的？ # doesnt split on , [String]::Join(",",("aaaaa,aaaaa,aaaaa,aaaaa,aaaaa,aa
iphone - 指针类型不兼容？？奇怪
我需要获得方程式系统的解决方案。为此，我使用函数sgesv_()。一切都很好，它使我感到解决方案的正确结果。但是我得到一个奇怪的警告。警告:从不兼容的指针类型传递'sgesv_'的参数3 我正在
ios - 奇怪!动画完成后是否一直调用函数？
我目前在制作动画时遇到一个奇怪的问题: [UIView animateWithDuration:3 delay:0
jQuery 不工作 - 奇怪
alert('works'); $(window).load(function () { alert('does not work'); });
java - 静态内部类 - 奇怪
我的代码: public class MyTest { public class StringSorter implements Comparator { public
JavaScript 对象更新行为(奇怪？)
我正在学习 JavaScript。尝试理解代码， function foo (){ var a = b = {name: 'Hai'}; document.write(a.name +''
c++ - GetLastError() != 奇怪
这个问题不太可能帮助任何 future 的访问者；它只与一个小的地理区域、一个特定的时间点或一个非常狭窄的情况有关，这些情况并不普遍适用于互联网的全局受众。为了帮助使这个问题更广泛地适用，visit
Linux 环境 -i 奇怪
这按预期工作: [dgorur@ted ~]$ env -i env [dgorur@ted ~]$ 这样做: [dgorur@ted ~]$ env -i which date which: no
c++ - 指针增量 - 奇怪
struct BLA { int size_; int size()const{ return size_; } } int x; BLA b[ 2 ]; BLA * p = &b[
css - 图像垂直对齐与 css - 奇怪
我有以下代码: #test img {vertical-align: middle;} div#test { border: 1px solid green; height: 150px; li
gcc - (奇怪？)GCC 预处理器行为
我想大多数使用过 C/C++ 的人都对预处理器的工作原理有一定的直觉(或多或少)。直到今天我也是这么认为的，但事实证明我的直觉是错误的。故事是这样的: 今天我尝试了一些东西，但我无法解释结果。首先考虑
OCMock只生效一次，奇怪，为什么？或者我这边有什么问题？
我想为 TnSettings 做 mock，是的，如果通过以下方法编写代码，它就可以工作，问题是我们需要为每个案例编写 mock 代码，如果我们只 mock 一次然后执行多个案例，那么第二个将报告异常
c - 溢出？找不到来源，奇怪
我的项目中有以下两个结构 typedef volatile struct { unsigned char rx_buf[MAX_UART_BUF]; //Input buffer over U
c# - 奇怪，regex.split方法匹配一个null元素
Regex rx = new Regex(@"[+-]"); string[] substrings = rx.Split(expression); expression = "-9a3dcb
java - JDBC 连接被对等方关闭(奇怪)
我的两个应用程序遇到了一个奇怪的问题。这是设置: 两个 tomcat/java 应用程序，在同一个网络中运行，连接到相同的 MS-SQL-Server。一个应用程序，恰好按顺序位于 DMZ 中可从互联
Android OnLongClickListener 奇怪/不可靠的行为
我目前正在与 Android Api Lvl 8 上的 OnLongClickListener 作斗争。拿这段代码: this.webView.setOnLongClickListener(new
java - JUnit 奇怪 - 我未指定预期数量
这个问题不太可能帮助任何 future 的访问者；它只与一个小的地理区域、一个特定的时间点或一个非常狭窄的情况相关，这些情况并不普遍适用于互联网的全局受众。为了帮助使这个问题更广泛地适用，visit
objective-c - performSelectorOnMainThread 奇怪
只是遇到了奇怪的事情。我有以下代码: -(void)ImageDownloadCompleat { [self performSelectorOnMainThread:@selector(up

首页

博学

6Ren·AI

商城

python - 熵计算的奇怪结果