gpt4 book ai didi

python - 如何在迭代 pandas 数据框时提高性能?

转载 作者:太空宇宙 更新时间:2023-11-03 14:10:00 27 4
gpt4 key购买 nike

我有两个 pandas 数据框。第一个包含从文本中提取的一元组列表、文本中出现一元组的计数和概率。结构如下所示:

unigram_df
word count prob
0 we 109 0.003615
1 investigated 20 0.000663
2 the 1125 0.037315
3 potential 36 0.001194
4 of 1122 0.037215

第二个包含从同一文本中提取的skipgram列表,以及文本中出现skipgram的计数和概率。它看起来像这样:

skipgram_df
word count prob
0 (we, investigated) 5 0.000055
1 (we, the) 31 0.000343
2 (we, potential) 2 0.000022
3 (investigated, the) 11 0.000122
4 (investigated, potential) 3 0.000033

现在,我想计算每个skipgram的逐点互信息,它基本上是skipgram概率除以其一元概率的乘积的对数。我为此编写了一个函数,它迭代skipgram df,并且它完全按照我想要的方式工作,但是我在性能方面有很大的问题,我想问是否有一种方法可以改进我的代码以使其计算pmi快点。

这是我的代码:

def calculate_pmi(row):
skipgram_prob = float(row[3])
x_unigram_prob = float(unigram_df.loc[unigram_df['word'] == row[1][0]]
['prob'])
y_unigram_prob = float(unigram_df.loc[unigram_df['word'] == row[1][1]]
['prob'])
pmi = math.log10(float(skipgram_prob / (x_unigram_prob * y_unigram_prob)))
result = str(str(row[1][0]) + ' ' + str(row[1][1]) + ' ' + str(pmi))
return result

pmi_list = list(map(calculate_pmi, skipgram_df.itertuples()))

目前该函数的性能约为 483.18it/s,这非常慢,因为我有数十万个 Skipgram 需要迭代。欢迎大家提出意见。谢谢。

最佳答案

对于 pandas 的新用户来说,这是一个很好的问题和练习。仅将 df.iterrows 用作最后的手段,即使如此,也要考虑替代方案。这是正确选择的情况相对较少。

下面是如何矢量化计算的示例。

import pandas as pd
import numpy as np

uni = pd.DataFrame([['we', 109, 0.003615], ['investigated', 20, 0.000663],
['the', 1125, 0.037315], ['potential', 36, 0.001194],
['of', 1122, 0.037215]], columns=['word', 'count', 'prob'])

skip = pd.DataFrame([[('we', 'investigated'), 5, 0.000055],
[('we', 'the'), 31, 0.000343],
[('we', 'potential'), 2, 0.000022],
[('investigated', 'the'), 11, 0.000122],
[('investigated', 'potential'), 3, 0.000033]],
columns=['word', 'count', 'prob'])

# first split column of tuples in skip
skip[['word1', 'word2']] = skip['word'].apply(pd.Series)

# set index of uni to 'word'
uni = uni.set_index('word')

# merge prob1 & prob2 from uni to skip
skip['prob1'] = skip['word1'].map(uni['prob'].get)
skip['prob2'] = skip['word2'].map(uni['prob'].get)

# perform calculation and filter columns
skip['result'] = np.log(skip['prob'] / (skip['prob1'] * skip['prob2']))
skip = skip[['word', 'count', 'prob', 'result']]

关于python - 如何在迭代 pandas 数据框时提高性能?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48579924/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com