gpt4 book ai didi

python - Beautiful Soup - 剥离 HTML 标签时返回奇怪的字符

转载 作者:行者123 更新时间:2023-12-01 09:31:15 25 4
gpt4 key购买 nike

我抄袭了 this 中的大部分代码接受 Stack Overflow 答案并插入以下代码(在 Python 2.7 中运行):

import SelectProxy
from bs4 import BeautifulSoup, NavigableString
import requests
import json

sys.path.append("G:\\Python27\\Kodi")

session = requests.Session()

url = 'http://www.tvguide.co.uk/mobile/channellisting.asp?ch=66'


headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.',
'Connection': 'keep-alive',
'Host': 'www.tvguide.co.uk',
'Referer': 'http://www.tvguide.co.uk/mobile/',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}

r = session.get(url, headers=headers)

print r.text



def strip_tags(html, invalid_tags):
soup = BeautifulSoup(html, "lxml")

for tag in soup.findAll(True):
if tag.name in invalid_tags:
s = ""

for c in tag.contents:
if not isinstance(c, NavigableString):
c = strip_tags(unicode(c), invalid_tags)
s += unicode(c)

tag.replaceWith(s)

return soup

invalid_tags = ['td', 'tr', 'div', 'a', 'span', 'br']
print strip_tags(html, invalid_tags)

...这可以删除标签,但是我现在在屏幕上打印了很多奇怪的文本:

</body></html>
<html><body>

The latest national and international stories as they break

<html><body>
</body></html>
<html><body></body></html>
<html><body>Rating: <html><body>3.1</body></html></body></html>
</body></html>
</body></html>
</body></html>

...任何人都可以告诉我我做错了什么吗?

谢谢

最佳答案

标签可以帮助您找到所需的文本。该页面中的大部分文本都位于 HTML 表格内,可以按如下方式提取:

from bs4 import BeautifulSoup
import requests
import re

r = requests.get('http://www.tvguide.co.uk/mobile/channellisting.asp?ch=66')
soup = BeautifulSoup(r.text, "html.parser")

for tr in soup.select('table tr'):
if not tr.script:
print ' -'.join(re.sub(r'\s+', ' ', td.text) for td in tr.find_all('td'))

这将为您提供输出开始:

6:00am - Breakfast A round-up of national and international news, plus sports reports, weather forecasts and arts and entertainment features. Including NewsWatch at 7.45 Rating: 1.4 
7:00am - Breakfast A round-up of national and international news, plus sports reports, weather forecasts and arts and entertainment features. Including NewsWatch at 7.45 Rating: 1.4
8:00am - Breakfast A round-up of national and international news, plus sports reports, weather forecasts and arts and entertainment features. Including NewsWatch at 7.45 Rating: 1.4
9:00am - BBC News The latest national and international stories as they break Rating: 3.1
10:00am - BBC News The latest national and international stories as they break Rating: 3.1
10:30am - The Travel Show 20/04/2018 Join the team on their journey of discovery as they explore new destinations around the globe and uncover hidden sides to some of the world's favourite holiday hotspots Rating: 4
11:00am - BBC News The latest national and international stories as they break Rating: 3.1
11:30am - Dateline London 21/04/2018 Foreign correspondents currently posted to London look at events in the UK through outsiders' eyes, and at how the issues of the week are being tackled around the world Rating: 6.3
12:00pm - BBC News The latest national and international stories as they break Rating: 3.1
12:30pm - Click 20/04/2018 A guide to the latest gadgets, websites, games and computer industry news Rating: 3.3

关于python - Beautiful Soup - 剥离 HTML 标签时返回奇怪的字符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49955536/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com