gpt4 book ai didi

Python将数据集反规范化为类似 View 的矩阵

转载 作者:行者123 更新时间:2023-12-04 18:05:36 24 4
gpt4 key购买 nike

我有一个规范化的(数据库方面的)数据集,其中有 3 列(~5000 行),例如:

user        phrase  tfw
517187571 able 1
517187571 abroad 0.4
1037767202 abuse 0.272727
517187571 accuse 0.8
803230586 acknowledge 0.4
...

我需要将其转换为类似 View 的矩阵,其中行将是用户,列将是短语,在矩阵内部,适当的行/列索引中将包含 tfw 的值。任何人都知道如何在 python 中有效地执行此操作?所需的输出将是(对于上面的示例):

user/phrase   able   abroad   abuse    accuse   acknowledge
517187571 1 0 0 0 0
517187571 0 0.4 0 0 0
1037767202 0 0 0.272727 0 0
...

我尝试在 SQL 查询中在 MySQL DB 上执行此操作,但想出了这个不起作用的天才查询:

SELECT
CONCAT('SELECT user,',
GROUP_CONCAT(sums),
' FROM clustering_normalized_dataset GROUP BY user')
FROM (
SELECT CONCAT('SUM(phrase=\'', phrase, '\') AS `', phrase, '`') sums
FROM clustering_normalized_dataset
GROUP BY phrase
ORDER BY COUNT(*) DESC
) s
INTO @sql;

PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

最佳答案

使用库 pandas ,这是一个具有简单主元的单线性。

data = [
[517187571, "able",1],
[517187571, "abroad", 0.4],
[1037767202, "abuse", 0.272727],
[517187571, "accuse", 0.8],
[803230586, "acknowledge", 0.4]]

import pandas as pd
df = pd.DataFrame(data,columns=("user","phrase","tfw"))
print df.pivot("user","phrase","tfw")

这给出了

phrase      able  abroad     abuse  accuse  acknowledge
user
517187571 1 0.4 NaN 0.8 NaN
803230586 NaN NaN NaN NaN 0.4
1037767202 NaN NaN 0.272727 NaN NaN

用 0.0 替换 Nan 是微不足道的,但有时保留它们以指示您没有该项目的数据是很好的。无论如何,您始终可以对有效范围求和。与您提出的其他方法相比,巨大的优势在于额外的数据不存储在内存中。

关于Python将数据集反规范化为类似 View 的矩阵,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27487002/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com