gpt4 book ai didi

python - beautiful soup 返回关闭标签而不是标签文本

转载 作者:太空宇宙 更新时间:2023-11-04 02:48:25 24 4
gpt4 key购买 nike

我有以下 rss 提要 (soundcloud) http://feeds.soundcloud.com/users/soundcloud:users:7393028/sounds.rss :

<item>
<pubDate>Mon, 05 Jun 2017 00:00:00 +0000</pubDate>
<link>https://example.com</link>
<item>

我尝试使用以下内容获取链接标签内容:

soup = BeautifulSoup(response, "lxml")


items = soup.findAll("item")
for i in items:
print i
created_at = i.find('pubdate')
created_at = created_at.contents[0][:16]

url = i.find('link')

This prints:

<link/>

如果我尝试 url = i.find('link').stringurl = i.find('link').content

我明白了

None

当我打印“i”项目时,它首先为链接打印一个关闭标签:

https://soundcloud.com/daptone-records/sharon-jones-the-dap-kings-white-christmas 00:02:23达通记录不Sharon Jones 和 Dap-Kings 的首张假日专辑现已发行!

如何让链接正常打开?

最佳答案

你可以做这样的事情,它会完成工作:

from bs4 import BeautifulSoup as bs 
from urllib.request import urlopen

url = 'http://feeds.soundcloud.com/users/soundcloud:users:7393028/sounds.rss'
data = urlopen(url).read()

parsed = bs(data, 'xml')
items = parsed.findAll('item')

for k in items:
# Here is how you can access to the tags inside item tag
print("Link:", k.link.text)
print("pubDate:", k.pubDate.text)

编辑:使用 lxml

当我尝试解析 <link>...</link>使用 BeautifulSoup 标记和 lxml我得到了一个无效标签。每个链接的标签都以 </link> 开头和 BeautifulSoup无法解析其数据。

所以,一个简单的 hack 是使用 regex ,这里是一个例子:

from bs4 import BeautifulSoup as bs 
from urllib.request import urlopen
import re

url = 'http://feeds.soundcloud.com/users/soundcloud:users:7393028/sounds.rss'
data = urlopen(url).read()

soup = bs(data, 'lxml')
aa = soup.findAll('item')

for k in aa:
link = re.findall('<link/>(.*?)\s+', str(k))
pubdate = k.find('pubdate').string
print("Link: {}\npubdate: {}".format(' '.join(link), pubdate))

两种方法都会输出:

Link: https://soundcloud.com/daptone-records/move-upstairs
pubDate: Tue, 21 Mar 2017 20:30:49 +0000
...
Link: https://soundcloud.com/daptone-records/the-frightnrs-id-rather-go-blind-1
pubDate: Sun, 28 Jun 2015 00:00:00 +0000

关于python - beautiful soup 返回关闭标签而不是标签文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44534715/

24 4 0