gpt4 book ai didi

python - 如何在不知道标签/类的情况下使用搜索词来抓取网页?

转载 作者:行者123 更新时间:2023-11-28 16:57:57 26 4
gpt4 key购买 nike

我正在开发一个使用 Python(3.7) 和 BeautifulSoup(4) 来实现抓取解决方案的项目。

Note: I have searched to find a solution to my problem, but I couldn't find any solution because it's different from what usually we need for scraping. So, that's why, don't mark this as duplicate, please!

这个项目分为两部分:

  1. 我们根据搜索词获取了 Google 搜索结果的 URL(例如前 5 个)。
  2. 然后,我们必须抓取这些搜索结果的 URL 以从这些页面中获取搜索词的相关信息,因此我们不知道这些结果页面的实际类/标签。

那么,我们如何在不知道实际标签/类的情况下从网页中抓取搜索词的相关信息呢?

这是我到目前为止所做的:

soup = BeautifulSoup(driver.page_source, 'lxml')
result_div = soup.find_all('div', attrs={'class': 'g'})

links = []
titles = []
descriptions = []
for r in result_div:
# Checks if each element is present, else, raise exception
try:
link = r.find('a', href=True)
title = None
title = r.find('h3')

if isinstance(title, Tag):
title = title.get_text()

description = None
description = r.find('span', attrs={'class': 'st'})

if isinstance(description, Tag):
description = description.get_text()

# Check to make sure everything is present before appending
if link != '' and title != '' and description != '':
links.append(link['href'])
titles.append(title)
descriptions.append(description)
# Next loop if one element is not present
except Exception as e:
print(e)
continue

最佳答案

在 HTML 字符串中很容易找到包含关键字或正则表达式的元素,这就是您可以做到的。

这将返回 HTML 页面中包含您要查找的关键字的每个元素。

from bs4 import BeautifulSoup
import re

html_text = """
<h2>some other text</h2>
<p>text you want to find with keyword</p>
<h1>foo bar foo bar</h1>
<h2>text you want to find with keyword</h2>
<a href="someurl">No idea what is going on</a>
<div> text you want to find with keyword</div>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r'\bkeyword\b | \bkey_word\b | \something else\b | \bone_more_maybe\b')):
print(elem.parent)

关于python - 如何在不知道标签/类的情况下使用搜索词来抓取网页?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56573937/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com