作者热门文章
- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我有这个数据集(以 .csv 文件输出):
email, link
0,,
1, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="90f8f5fcfcffd0f4fff7bef3fffd" rel="noreferrer noopener nofollow">[email protected]</a>, dog.com
2, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f09291829bb0949f97de939f9d" rel="noreferrer noopener nofollow">[email protected]</a>, dog.com
3, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="8ee9fce1f9e2ceeae1e9a0ede1e3" rel="noreferrer noopener nofollow">[email protected]</a>, dog.com
4, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="cba6aea4bc8ba8aabfe5a5aebf" rel="noreferrer noopener nofollow">[email protected]</a>, cat.net
5, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0c7c797e7e4c6f6d7822626978" rel="noreferrer noopener nofollow">[email protected]</a>, cat.net,
6, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="61120d040411210200154f0f0415" rel="noreferrer noopener nofollow">[email protected]</a>, cat.net
7, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ed9e8e9f888c80ad808283868894c38898" rel="noreferrer noopener nofollow">[email protected]</a>, monkey.eu
8, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="3f4d4a517f57504d4c5a115c5052" rel="noreferrer noopener nofollow">[email protected]</a>, horse.com
如您所见,某些链接是相同的,而电子邮件始终是唯一的。我想保留最多 2 行具有相同的链接,删除第三行及后续行,如下所示:
email, link
0,,
1, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ee868b828281ae8a8189c08d8183" rel="noreferrer noopener nofollow">[email protected]</a>, dog.com
2, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e68487948da6828981c885898b" rel="noreferrer noopener nofollow">[email protected]</a>, dog.com
3, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="4d2028223a0d2e2c3963232839" rel="noreferrer noopener nofollow">[email protected]</a>, cat.net
4, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="9cece9eeeedcfffde8b2f2f9e8" rel="noreferrer noopener nofollow">[email protected]</a>, cat.net,
5, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="afdcccddcacec2efc2c0c1c4cad681cada" rel="noreferrer noopener nofollow">[email protected]</a>, monkey.eu
6, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6c1e19022c04031e1f09420f0301" rel="noreferrer noopener nofollow">[email protected]</a>, horse.com
怎么做呢?我尝试使用此解决方案,但它只输出链接。由于子集(列表)的长度不同,将其与电子邮件地址合并会使一切变得困惑:
from collections import Counter
def keep_n_dupes(remove_from, how_many):
counts = Counter()
for item in remove_from:
counts[item] += 1
if counts[item] <= how_many:
yield item
new_links = list(keep_n_dupes(df['link'], 2))
最佳答案
使用groupby.head
:
df.groupby('link').head(2)
email link
0 <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="5c34393030331c38333b723f3331" rel="noreferrer noopener nofollow">[email protected]</a> dog.com
1 <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="7c1e1d0e173c18131b521f1311" rel="noreferrer noopener nofollow">[email protected]</a> dog.com
3 <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="59343c362e193a382d77373c2d" rel="noreferrer noopener nofollow">[email protected]</a> cat.net
4 <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="48383d3a3a082b293c66262d3c" rel="noreferrer noopener nofollow">[email protected]</a> cat.net
6 <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="8ffcecfdeaeee2cfe2e0e1e4eaf6a1eafa" rel="noreferrer noopener nofollow">[email protected]</a> monkey.eu
7 <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="3a484f547a525548495f14595557" rel="noreferrer noopener nofollow">[email protected]</a> horse.com
关于python - Pandas 数据帧 : limit the number of rows with a common subset value,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58692016/
我是一名优秀的程序员,十分优秀!