- iOS/Objective-C 元类和类别
- objective-c - -1001 错误,当 NSURLSession 通过 httpproxy 和/etc/hosts
- java - 使用网络类获取 url 地址
- ios - 推送通知中不播放声音
我正在尝试编写一个函数来正确计算给定数据集的熵。但是,我得到了非常奇怪的熵值。
我遵循的理解是所有 熵计算必须介于 0 和 1 之间,但我始终得到高于 2 的值。
注意:我必须为此使用log base 2
有人可以解释为什么我会产生不正确的熵结果吗?我正在测试的数据集是 ecoli dataset from the UCI Machine Learning Repository
import numpy
import math
#################### DATA HANDLING LIBRARY ####################
def csv_to_array(file):
# Open the file, and load it in delimiting on the ',' for a comma separated value file
data = open(file, 'r')
data = numpy.loadtxt(data, delimiter=',')
# Loop through the data in the array
for index in range(len(data)):
# Utilize a try catch to try and convert to float, if it can't convert to float, converts to 0
try:
data[index] = [float(x) for x in data[index]]
except Exception:
data[index] = 0
except ValueError:
data[index] = 0
# Return the now type-formatted data
return data
# Function that utilizes the numpy library to randomize the dataset.
def randomize_data(csv):
csv = numpy.random.shuffle(csv)
return csv
# Function to split the data into test, training set, and validation sets
def split_data(csv):
# Call the randomize data function
randomize_data(csv)
# Grab the number of rows and calculate where to split
num_rows = csv.shape[0]
validation_split = int(num_rows * 0.10)
training_split = int(num_rows * 0.72)
testing_split = int(num_rows * 0.18)
# Validation set as the first 10% of the data
validation_set = csv[:validation_split]
# Training set as the next 72
training_set = csv[validation_split:training_split + validation_split]
# Testing set as the last 18
testing_set = csv[training_split + validation_split:]
# Split the data into classes vs actual data
training_cols = training_set.shape[1]
testing_cols = testing_set.shape[1]
validation_cols = validation_set.shape[1]
training_classes = training_set[:, training_cols - 1]
testing_classes = testing_set[:, testing_cols - 1]
validation_classes = validation_set[:, validation_cols - 1]
# Take the sets and remove the last (classification) column
training_set = training_set[:-1]
testing_set = testing_set[:-1]
validation_set = validation_set[:-1]
# Return the datasets
return testing_set, testing_classes, training_set, training_classes, validation_set, validation_classes
#################### DATA HANDLING LIBRARY ####################
# This function returns the list of classes, and their associated weights (i.e. distributions)
# for a given dataset
def class_distribution(dataset):
# Ensure the dataset is a numpy array
dataset = numpy.asarray(dataset)
# Collect # of total rows and columns, using numpy
num_total_rows = dataset.shape[0]
num_columns = dataset.shape[1]
# Create a numpy array of just the classes
classes = dataset[:, num_columns - 1]
# Use numpy.unique to remove duplicates
classes = numpy.unique(classes)
# Create an empty array for the class weights
class_weights = []
# Loop through the classes one by one
for aclass in classes:
# Create storage variables
total = 0
weight = 0
# Now loop through the dataset
for row in dataset:
# If the class of the dataset is equal to the current class you are evaluating, increase the total
if numpy.array_equal(aclass, row[-1]):
total = total + 1
# If not, continue
else:
continue
# Divide the # of occurences by total rows
weight = float((total / num_total_rows))
# Add that weight to the list of class weights
class_weights.append(weight)
# Turn the weights into a numpy array
class_weights = numpy.asarray(class_weights)
# Return the array
return classes, class_weights
# This function returns the entropy for a given dataset
# Can be used across an entire csv, or just for a column of data (feature)
def get_entropy(dataset):
# Set initial entropy
entropy = 0.0
# Determine the classes and their frequencies (weights) of the dataset
classes, class_freq = class_distribution(dataset)
# Utilize numpy's quicksort to test the most occurring class first
numpy.sort(class_freq)
# Determine the max entropy for the dataset
max_entropy = math.log(len(classes), 2)
print("MAX ENTROPY FOR THIS DATASET: ", max_entropy)
# Loop through the frequencies and use given formula to calculate entropy
# For...Each simulates the sequence operator
for freq in class_freq:
entropy += float(-freq * math.log(freq, 2))
# Return the entropy value
return entropy
def main():
ecol = csv_to_array('ecoli.csv')
testing_set, testing_classes, training_set, training_classes, validation_set, validation_classes = split_data(ecol)
entropy = get_entropy(ecol)
print(entropy)
main()
最佳答案
以下函数用于计算熵:
# Function to return Shannon's Entropy
def entropy(attributes, dataset, targetAttr):
freq = {}
entropy = 0.0
index = 0
for item in attributes:
if (targetAttr == item):
break
else:
index = index + 1
index = index - 1
for item in dataset:
if ((item[index]) in freq):
# Increase the index
freq[item[index]] += 1.0
else:
# Initialize it by setting it to 0
freq[item[index]] = 1.0
for freq in freq.values():
entropy = entropy + (-freq / len(dataset)) * math.log(freq / len(dataset), 2)
return entropy
正如@MattTimmermans 所指出的,熵的值实际上取决于类的数量。对于严格的 2 类,它包含在 0 到 1(含)
范围内。然而,对于超过 2 个类(正在测试的类),熵是用不同的公式计算的(转换为上面的 Pythonic 代码)。 This post here更详细地解释了这些数学和计算。
关于python - 熵计算的奇怪结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55290793/
我有这种来自 Google map 自动完成的奇怪行为(或者我可能错过了某事)...想法?奇怪的: 您在输入中输入某物,例如“伦敦” 您按 [ENTER] 你按下 [CLEAR] 按钮 你点击进入'输
这段代码与《Learning Java》(Oracle Press Books)一书中的代码完全一样,但它不起作用。我不明白为什么它不起作用,它应该起作用。我用 OpenJDK 和 Sun JDK 7
示例 1 中究竟发生了什么?这是如何解析的? # doesnt split on , [String]::Join(",",("aaaaa,aaaaa,aaaaa,aaaaa,aaaaa,aa
我需要获得方程式系统的解决方案。为此,我使用函数sgesv_()。 一切都很好,它使我感到解决方案的正确结果。 但是我得到一个奇怪的警告。 警告:从不兼容的指针类型传递'sgesv_'的参数3 我正在
我目前在制作动画时遇到一个奇怪的问题: [UIView animateWithDuration:3 delay:0
alert('works'); $(window).load(function () { alert('does not work'); });
我的代码: public class MyTest { public class StringSorter implements Comparator { public
我正在学习 JavaScript。尝试理解代码, function foo (){ var a = b = {name: 'Hai'}; document.write(a.name +''
这个问题不太可能帮助任何 future 的访问者;它只与一个小的地理区域、一个特定的时间点或一个非常狭窄的情况有关,这些情况并不普遍适用于互联网的全局受众。为了帮助使这个问题更广泛地适用,visit
这按预期工作: [dgorur@ted ~]$ env -i env [dgorur@ted ~]$ 这样做: [dgorur@ted ~]$ env -i which date which: no
struct BLA { int size_; int size()const{ return size_; } } int x; BLA b[ 2 ]; BLA * p = &b[
我有以下代码: #test img {vertical-align: middle;} div#test { border: 1px solid green; height: 150px; li
我想大多数使用过 C/C++ 的人都对预处理器的工作原理有一定的直觉(或多或少)。直到今天我也是这么认为的,但事实证明我的直觉是错误的。故事是这样的: 今天我尝试了一些东西,但我无法解释结果。首先考虑
我想为 TnSettings 做 mock,是的,如果通过以下方法编写代码,它就可以工作,问题是我们需要为每个案例编写 mock 代码,如果我们只 mock 一次然后执行多个案例,那么第二个将报告异常
我的项目中有以下两个结构 typedef volatile struct { unsigned char rx_buf[MAX_UART_BUF]; //Input buffer over U
Regex rx = new Regex(@"[+-]"); string[] substrings = rx.Split(expression); expression = "-9a3dcb
我的两个应用程序遇到了一个奇怪的问题。这是设置: 两个 tomcat/java 应用程序,在同一个网络中运行,连接到相同的 MS-SQL-Server。一个应用程序,恰好按顺序位于 DMZ 中可从互联
我目前正在与 Android Api Lvl 8 上的 OnLongClickListener 作斗争。 拿这段代码: this.webView.setOnLongClickListener(new
这个问题不太可能帮助任何 future 的访问者;它只与一个小的地理区域、一个特定的时间点或一个非常狭窄的情况相关,这些情况并不普遍适用于互联网的全局受众。为了帮助使这个问题更广泛地适用,visit
只是遇到了奇怪的事情。我有以下代码: -(void)ImageDownloadCompleat { [self performSelectorOnMainThread:@selector(up
我是一名优秀的程序员,十分优秀!