gpt4 book ai didi

python - Beautiful Soup - 如何清理提取数据?

转载 作者:太空宇宙 更新时间:2023-11-03 17:29:51 24 4
gpt4 key购买 nike

我的问题确实很微不足道,但作为 Python 初学者,我仍然找不到答案..

我使用以下代码从网络中提取一些数据:

from bs4 import BeautifulSoup
import urllib2

teams = ("http://walterfootball.com/fantasycheatsheet/2015/traditional")
page = urllib2.urlopen(teams)
soup = BeautifulSoup(page, "html.parser")

f = open('output.txt', 'w')

nfl = soup.findAll('li', "player")
lines = [span.get_text(strip=True) for span in nfl]

lines = str(lines)
f.write(lines)
f.close()

但是输出非常困惑。

有没有一种优雅的方式来获得这样的结果?

1. Eddie Lacy, RB, Green Bay Packers. Bye: 7 $60
2. LeVeon Bell, RB, Pittsburgh Steelers. Bye: 11 $60
3. Marshawn Lynch, RB, Seattle Seahawks. Bye: 9 $59
...

最佳答案

只需在列表上使用 str.join 并使用 .rstrip("+") 关闭 +:

nfl = soup.findAll('li', "player")
lines = ("{}. {}\n".format(ind,span.get_text(strip=True).rstrip("+"))
for ind, span in enumerate(nfl,1))
print("".join(lines))

这会给你:

1. Eddie Lacy, RB, Green Bay Packers. Bye: 7$60
2. LeVeon Bell, RB, Pittsburgh Steelers. Bye: 11$60
3. Marshawn Lynch, RB, Seattle Seahawks. Bye: 9$59
4. Adrian Peterson, RB, Minnesota Vikings. Bye: 5$59
5. Jamaal Charles, RB, Kansas City Chiefs. Bye: 9$54
..................

要分隔价格,我们可以拆分或使用 re.sub 在美元符号前添加一个空格并写入每一行:

import re
with open('output.txt', 'w') as f:
for line in lines:
line = re.sub("(\$\d+)$", r" \1", line, 1)
f.write(line)

现在的输出是:

1. Eddie Lacy, RB, Green Bay Packers. Bye: 7 $60
2. LeVeon Bell, RB, Pittsburgh Steelers. Bye: 11 $60
3. Marshawn Lynch, RB, Seattle Seahawks. Bye: 9 $59
4. Adrian Peterson, RB, Minnesota Vikings. Bye: 5 $59
5. Jamaal Charles, RB, Kansas City Chiefs. Bye: 9 $54

您可以使用 str.rsplit$ 上拆分一次并用空格重新连接来执行相同的操作:

with open('output.txt', 'w') as f:
for line in lines:
line,p = line.rsplit("$",1)
f.write("{} ${}".format(line,p))

关于python - Beautiful Soup - 如何清理提取数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32046792/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com