gpt4 book ai didi

python - 如何使用 python 从 beautifulsoup 输出中删除所有对齐和缩进?

转载 作者:太空宇宙 更新时间:2023-11-03 17:53:53 26 4
gpt4 key购买 nike

我试图从 HTML url 的许多不同表中获取信息,而不使用任何 HTML 缩进/制表符格式。我使用 get_text 生成我想要的内容,但它打印时带有大量空白和制表符。我尝试过 .strip 但这并没有达到我想要的效果。

这是我正在使用的 python 脚本:

import csv, simplejson, urllib,
url="http://www.thecomedystudio.com/schedule.html"
response=urllib.urlopen(url)
from bs4 import BeautifulSoup
html = response
soup = BeautifulSoup(html.read())
text = soup.get_text()
print text

最后,我想创建一个事件日历的 csv,但首先我想创建一个 .txt 或不需要太多手动清理的内容。

任何帮助表示赞赏。

最佳答案

您无需“清理”HTML 即可使用 BeautifulSoup 对其进行解析。

直接将日期和事件解析到 csv 文件中:

import csv
from urllib2 import urlopen

from bs4 import BeautifulSoup


url = "http://www.thecomedystudio.com/schedule.html"
soup = BeautifulSoup(urlopen(url))

with open('output.csv', 'wb') as f:
writer = csv.writer(f)

for item in soup.select('td div[align=center] > b'):
date = ' '.join(el.strip() for el in item.find_all(text=True))
event = item.parent.parent.find_next_sibling('td').get_text(strip=True)

writer.writerow([date, event])

运行脚本后output.csv的内容:

Fri 2.27.15,"Rick Canavan hosts with Christine An, Rachel Bloom, Dan Crohn, Wes Hazard, James Huessy, Kelly MacFarland, Peter Martin, Ted Pettingell."
Sat 2.28.15,"Rick Jenkins hosts Taylor Connelly, Lilian DeVane, Andrew Durso, Nate Johnson, Peter Martin, Andrew Mayer, Kofi Thomas, Tim Willis."
Sun 3.1.15,"Peter Martin hosts Sunday Funnies with Nonye Brown-West, Ryan Donahue, Joe Kozlowski, Casey Malone, Etrane Martinez, Kwasi Mensah, Anthony Zonfrelli, Christa Weiss and Sam Jay closing."
Tue 3.3.15,Mystery Lounge! The old-est and only-est magic show in New England! with guest comedian Ryan Donahue.
...
Thu 12.31.15,"New Year's Eve! with Rick Jenkins, Nathan Burke."
Fri 1.1.16,Rick Canavan hosts New Year's Day.

关于python - 如何使用 python 从 beautifulsoup 输出中删除所有对齐和缩进?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28775044/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com