gpt4 book ai didi

python - 从带有 Beautiful Soup 的字符串列表中获取与字符串匹配的 HTML href 链接

转载 作者:行者123 更新时间:2023-12-02 00:48:25 25 4
gpt4 key购买 nike

我正在尝试从具有网址列表的网页中获取网址。我不想获取所有 url,只获取文本与列表中字符串文本匹配的那些。字符串列表是网页上链接文本的子集,是我通过提取的。刮 页面并删除我不想要的文本。我有一个存储在 filenames 中的字符串列表.
我正在尝试提取列表中包含字符串的链接。下面返回一个空列表

 r = requests.get(url)

soup = BeautifulSoup(r.content, 'html5lib')

links = soup.findAll('a', string = filenames[0])

file_links = [link['href'] for link in links if "export" in link['href']]
标签看起来像这样:
<p><a href="https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi">
ECZ Mathematics Paper 2 2019.</a></p>

<p><a href="https://drive.google.com/uc?export=download&id=1x_9E3PaviCuSsqfJqOsQKOwVlCWZ1jqf">
ECZ Mathematics Paper 1 2019.</a></p>

<p><a href="https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp">
ECZ Science Paper 3 2009.</a></p>

<p><a href="https://drive.google.com/uc?export=download&id=0B0lFc6TrfIg7aENYc1V6akRVVnc">
ECZ Civic Education Paper 2 2009.</a></p>
我想获得前三个而不是最后一个的 href 链接,因为字符串 'ECZ Civic Education Paper 2 2009.'不是我的字符串列表的一部分。网站链接是 here
我的字符串列表如下所示:

filenames = ['ECZ Mathematics Paper 2 2019.', 'ECZ Mathematics Paper 2 2019.',
'ECZ Science Paper 3 2009.']
我只想要前三个链接,因为链接的文本在我的列表中(文件名)。我不想要第四个链接,因为 href 链接旁边的文本 (ECZ Civic Education Paper 2 2009.) 不在我的列表中,因为我不想下载此文件。

最佳答案

试试这个方法,看看它是否有效:

   html = """    
<p><a href="https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi">
ECZ Mathematics Paper 2 2019.</a></p>
<p><a href="https://drive.google.com/uc?export=download&id=1x_9E3PaviCuSsqfJqOsQKOwVlCWZ1jqf">
ECZ Mathematics Paper 1 2019.</a></p>
<p><a href="https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp">
ECZ Science Paper 3 2009.</a></p>
<p><a href="https://drive.google.com/uc?export=download&id=0B0lFc6TrfIg7aENYc1V6akRVVnc">
ECZ Civic Education Paper 2 2009.</a></p>
"""
filenames = ['ECZ Mathematics Paper 2 2019.', 'ECZ Mathematics Paper 2 2019.',
'ECZ Science Paper 3 2009.']

soup = bs(html, 'html5lib')

all_links = soup.findAll('a')

for link in all_links:
for nam in filenames:
if link.text.strip()==nam:
print(link['href'])

输出:
https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi
https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi
https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp

关于python - 从带有 Beautiful Soup 的字符串列表中获取与字符串匹配的 HTML href 链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59620715/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com