gpt4 book ai didi

google-search - Google-Get 搜索 "featured snippet"?

转载 作者:行者123 更新时间:2023-12-05 06:39:59 25 4
gpt4 key购买 nike

如何提取

featured snippet

来自 Google 搜索结果页面?

最佳答案

如果您想抓取 Google 搜索结果片段,您可以使用 BeautifulSoup网络抓取库,但使用此解决方案,如果发出大量请求,则会出现问题。

您可以尝试通过在 user-agent 位置添加 headers 来解决阻塞问题。将被指定,这对于 Google 将请求识别为来自用户而不是来自机器人而不是阻止它是必要的:

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

一个额外的步骤可能是 rotate user-agents .

下面的代码示例显示了使用分页获取更多值的解决方案。您可以使用无限的 while 循环对所有页面进行分页。只要下一个按钮存在就可以分页(由页面上是否存在按钮选择器决定,在我们的例子中是 CSS 选择器“.d6cvqb a[id=pnnext]”,您需要增加 ["start "] 按 10 访问下一页(如果存在),否则,我们需要退出 while 循环:

if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break

检查 online IDE 中的代码.

from bs4 import BeautifulSoup
import requests, json, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "python", # query example
"hl": "en", # language
"gl": "us", # country of the search, US -> USA
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

page_num = 0

website_data = []

while True:
page_num += 1
print(f"page: {page_num}")

html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select(".tF2Cxc"):
title = result.select_one(".DKV0Md").text
try:
snippet = result.select_one(".lEBKkf").text
except:
snippet = None

website_data.append({
"title": title,
"snippet": snippet
})

if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break

print(json.dumps(website_data, indent=2, ensure_ascii=False))

示例输出:

[
{
"title": "Welcome to Python.org",
"snippet": "The official home of the Python Programming Language."
},
{
"title": "Python (programming language) - Wikipedia",
"snippet": "Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation."
},
{
"title": "Python Courses & Tutorials - Codecademy",
"snippet": "Python is a general-purpose, versatile, and powerful programming language. It's a great first language because Python code is concise and easy to read."
},
{
"title": "Python - GitHub",
"snippet": "Repositories related to the Python Programming language - Python. ... Collection of library stubs for Python, with static types. Python 3.3k 1.4k."
},
{
"title": "Learn Python - Free Interactive Python Tutorial",
"snippet": "learnpython.org is a free interactive Python tutorial for people who want to learn Python, fast."
},
# ...
]

您还可以使用 Google Search Engine Results API来自 SerpApi。它是带有免费计划的付费 API。不同之处在于它将绕过来自 Google 的 block (包括 CAPTCHA),无需创建解析器和维护它。

代码示例:

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os

params = {
"api_key": os.getenv("API_KEY"), # serpapi key
"engine": "google", # serpapi parser engine
"q": "python", # search query
"num": "100" # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}

search = GoogleSearch(params) # where data extraction happens

organic_results_data = []

while True:
results = search.get_dict() # JSON -> Python dictionary

for result in results["organic_results"]:
organic_results_data.append({
"title": result.get("title"),
"snippet": result.get("snippet")
})

if "next_link" in results.get("serpapi_pagination", []):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break

print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))

输出与 bs4 的答案完全相同。

关于google-search - Google-Get 搜索 "featured snippet"?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43977215/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com