gpt4 book ai didi

python - Pandas 数据帧 : limit the number of rows with a common subset value

转载 作者:行者123 更新时间:2023-12-02 19:48:57 26 4
gpt4 key购买 nike

我有这个数据集(以 .csv 文件输出):

email, link
0,,
1, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="90f8f5fcfcffd0f4fff7bef3fffd" rel="noreferrer noopener nofollow">[email protected]</a>, dog.com
2, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f09291829bb0949f97de939f9d" rel="noreferrer noopener nofollow">[email protected]</a>, dog.com
3, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="8ee9fce1f9e2ceeae1e9a0ede1e3" rel="noreferrer noopener nofollow">[email protected]</a>, dog.com
4, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="cba6aea4bc8ba8aabfe5a5aebf" rel="noreferrer noopener nofollow">[email protected]</a>, cat.net
5, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0c7c797e7e4c6f6d7822626978" rel="noreferrer noopener nofollow">[email protected]</a>, cat.net,
6, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="61120d040411210200154f0f0415" rel="noreferrer noopener nofollow">[email protected]</a>, cat.net
7, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ed9e8e9f888c80ad808283868894c38898" rel="noreferrer noopener nofollow">[email protected]</a>, monkey.eu
8, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="3f4d4a517f57504d4c5a115c5052" rel="noreferrer noopener nofollow">[email protected]</a>, horse.com

如您所见,某些链接是相同的,而电子邮件始终是唯一的。我想保留最多 2 行具有相同的链接,删除第三行及后续行,如下所示:

email, link
0,,
1, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ee868b828281ae8a8189c08d8183" rel="noreferrer noopener nofollow">[email protected]</a>, dog.com
2, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e68487948da6828981c885898b" rel="noreferrer noopener nofollow">[email protected]</a>, dog.com
3, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="4d2028223a0d2e2c3963232839" rel="noreferrer noopener nofollow">[email protected]</a>, cat.net
4, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="9cece9eeeedcfffde8b2f2f9e8" rel="noreferrer noopener nofollow">[email protected]</a>, cat.net,
5, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="afdcccddcacec2efc2c0c1c4cad681cada" rel="noreferrer noopener nofollow">[email protected]</a>, monkey.eu
6, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6c1e19022c04031e1f09420f0301" rel="noreferrer noopener nofollow">[email protected]</a>, horse.com

怎么做呢?我尝试使用此解决方案,但它只输出链接。由于子集(列表)的长度不同,将其与电子邮件地址合并会使一切变得困惑:

from collections import Counter

def keep_n_dupes(remove_from, how_many):
counts = Counter()
for item in remove_from:
counts[item] += 1
if counts[item] <= how_many:
yield item
new_links = list(keep_n_dupes(df['link'], 2))

最佳答案

使用groupby.head :

df.groupby('link').head(2)

email link
0 <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="5c34393030331c38333b723f3331" rel="noreferrer noopener nofollow">[email protected]</a> dog.com
1 <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="7c1e1d0e173c18131b521f1311" rel="noreferrer noopener nofollow">[email protected]</a> dog.com
3 <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="59343c362e193a382d77373c2d" rel="noreferrer noopener nofollow">[email protected]</a> cat.net
4 <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="48383d3a3a082b293c66262d3c" rel="noreferrer noopener nofollow">[email protected]</a> cat.net
6 <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="8ffcecfdeaeee2cfe2e0e1e4eaf6a1eafa" rel="noreferrer noopener nofollow">[email protected]</a> monkey.eu
7 <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="3a484f547a525548495f14595557" rel="noreferrer noopener nofollow">[email protected]</a> horse.com

关于python - Pandas 数据帧 : limit the number of rows with a common subset value,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58692016/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com