gpt4 book ai didi

python - 如何计算具有多个逗号分隔值的列中某个单词的实例数?

转载 作者:太空宇宙 更新时间:2023-11-04 02:48:10 25 4
gpt4 key购买 nike

所以我基本上是在分析调查数据集。数据集如下所示:

   Respondent       Country           HaveWorkedLanguage
0 1 United States Swift
1 2 United Kingdom JavaScript; Python; Ruby; SQL
2 3 United Kingdom Java; PHP; Python
3 4 United States Matlab; Python; R; SQL
4 5 Switzerland NaN
5 6 New Zealand JavaScript; PHP; Rust

如您所见,列 HaveWorkedLanguage 在每个单元格中都有具有单个值或多个值的实例。我想做的是分析每个国家最著名的语言。为此,我首先执行了这样的 groupby:

stu=students.groupby(['Country','HaveWorkedLanguage'])['Respondent'].count().reset_index()
stu.columns=[['Country','Known_Languages','Count']]

我得到了这样一个数据框:

    Country         Known_Languages                             Count
0 Afghanistan Assembly; C; C++; Hack; Java; JavaScript 1
1 Afghanistan C 1
2 Albania C#; Java; Python; SQL 1
3 Albania C++; C#; Java; JavaScript; PHP 1
4 Albania C++; C#; JavaScript; SQL 1
5 Albania C++; Java; JavaScript; PHP; SQL 2

我实际上想要一个数据框来显示国家和每种语言的数量,以便最高数量显示最著名的语言。数据框应该是这样的:

      Country           Known_Languages     Count
0 United States Java 100
1 United States Python 80

早些时候,我能够使用以下代码找到整体著名语言:

for i in ['C','C++','C#','Java','Python','R','JavaScript']:
print(i,':',survey['HaveWorkedLanguage'].apply(lambda x: i in str(x).split('; ')).value_counts()[1])

输出是:

C : 6974
C++ : 8155
C# : 12476
Java : 14524
Python : 11704
R : 1634
JavaScript : 22875

但现在我也想把这个国家和它联系起来。我该怎么做?

最佳答案

hwl = students.HaveWorkedLanguage
cty = students.Country
stu = hwl.str.get_dummies('; ').groupby(cty).sum()
pd.concat(
[stu.idxmax(1), stu.max(1)],
axis=1, keys=['Lang', 'Count']
)

Lang Count
Country
New Zealand JavaScript 1
Switzerland Java 0
United Kingdom Python 2
United States Matlab 1

项目/杀死
numpy 技术

mask = students.HaveWorkedLanguage.notnull().values
fc, uc = pd.factorize(students.Country.values.astype(str))
hwl = students.HaveWorkedLanguage.values.astype(str)
lol = np.core.defchararray.split(hwl, '; ')
lol[np.flatnonzero(~mask)] = [[]]
i = fc.repeat([len(l) for l in lol])
j, ul = pd.factorize(np.concatenate(lol))
n = uc.size
m = ul.size
counts = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
x = counts.argmax(1)
pd.DataFrame(
np.column_stack([ul[x], counts[np.arange(n), x]]),
uc, ['Lang', 'Count'])

Lang Count
United States Swift 1
United Kingdom Python 2
Switzerland Swift 0
New Zealand JavaScript 1

时间

%%timeit
hwl = students.HaveWorkedLanguage
cty = students.Country
stu = hwl.str.get_dummies('; ').groupby(cty).sum()
pd.concat(
[stu.idxmax(1), stu.max(1)],
axis=1, keys=['Lang', 'Count']
)
100 loops, best of 3: 3.22 ms per loop

%%timeit
mask = students.HaveWorkedLanguage.notnull().values
fc, uc = pd.factorize(students.Country.values.astype(str))
hwl = students.HaveWorkedLanguage.values.astype(str)
lol = np.core.defchararray.split(hwl, '; ')
lol[np.flatnonzero(~mask)] = [[]]
i = fc.repeat([len(l) for l in lol])
j, ul = pd.factorize(np.concatenate(lol))
n = uc.size
m = ul.size
counts = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
x = counts.argmax(1)
pd.DataFrame(np.column_stack([ul[x], counts[np.arange(n), x]]), uc, ['Lang', 'Count'])
1000 loops, best of 3: 570 µs per loop

关于python - 如何计算具有多个逗号分隔值的列中某个单词的实例数?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44580587/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com