gpt4 book ai didi

python - 如何使用 python 在谷歌搜索中提取描述?

转载 作者:行者123 更新时间:2023-12-05 07:39:57 31 4
gpt4 key购买 nike

我想从谷歌搜索中提取描述,现在我有这段代码:

from urlparse import urlparse, parse_qs
import urllib

from lxml.html import fromstring
from requests import get


url='https://www.google.com/search?q=Gotham'
raw = get(url).text
pg = fromstring(raw)
v=[]
for result in pg.cssselect(".r a"):
url = result.get("href")
if url.startswith("/url?"):
url = parse_qs(urlparse(url).query)['q']
print url[0]

提取与搜索相关的url,如何提取出现在url下的描述?

最佳答案

您可以使用 BeautifulSoup 抓取 Google 搜索描述网站网络抓取库。

要从所有页面收集信息,您可以使用带有 while True 循环的“分页”。 while 循环是一个无限循环,在我们的例子中,退出是出现一个切换到下一页的按钮,即 CSS 选择器“.d6cvqb a[id=pnnext]”:

if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break

您可以使用 CSS 选择器搜索来查找您需要的所有信息(描述、标题等),这些信息可以使用 SelectorGadget 在页面上轻松识别。 Chrome 扩展程序(如果网站是通过 JavaScript 呈现的,则不一定能完美运行)。

确保您使用的是 request headers user-agent充当“真实”用户访问。因为默认的 requests user-agentpython-requests并且网站知道它很可能是发送请求的脚本。 Check what's your user-agent .

检查 online IDE 中的代码.

from bs4 import BeautifulSoup
import requests, json, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "gotham", # query
"hl": "en", # language
"gl": "us", # country of the search, US -> USA
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

page_num = 0

website_data = []

while True:
page_num += 1
print(f"page: {page_num}")

html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select(".tF2Cxc"):
website_name = result.select_one(".yuRUbf a")["href"]
try:
description = result.select_one(".lEBKkf").text
except:
description = None

website_data.append({
"website_name": website_name,
"description": description
})

if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break

print(json.dumps(website_data, indent=2, ensure_ascii=False))

示例输出:

[
{
"website_name": "https://www.imdb.com/title/tt3749900/",
"description": "The show follows Jim as he cracks strange cases whilst trying to help a young Bruce Wayne solve the mystery of his parents' murder. It seemed each week for a ..."
},
{
"website_name": "https://www.netflix.com/watch/80023082",
"description": "When the key witness in a homicide ends up dead while being held for questioning, Gordon suspects an inside job and seeks details from an old friend."
},
{
"website_name": "https://www.gothamknightsgame.com/",
"description": "Gotham Knights is an open-world, action RPG set in the most dynamic and interactive Gotham City yet. In either solo-play or with one other hero, ..."
},
# ...
]

或者你也可以使用Google Search Engine Results API来自 SerpApi。它是带有免费计划的付费 API。不同之处在于它将绕过来自 Google 的 block (包括 CAPTCHA),无需创建解析器和维护它。

代码示例:

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os

params = {
"api_key": os.getenv("API_KEY"), # serpapi key
"engine": "google", # serpapi parser engine
"q": "gotham", # search query
"num": "100" # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}

search = GoogleSearch(params) # where data extraction happens

organic_results_data = []
page_num = 0

while True:
results = search.get_dict() # JSON -> Python dictionary

page_num += 1

for result in results["organic_results"]:
organic_results_data.append({
"title": result.get("title"),
"snippet": result.get("snippet")
})

if "next_link" in results.get("serpapi_pagination", []):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break

print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))

输出:

[
{
"title": "Gotham (TV Series 2014–2019) - IMDb",
"snippet": "The show follows Jim as he cracks strange cases whilst trying to help a young Bruce Wayne solve the mystery of his parents' murder. It seemed each week for a ..."
},
{
"title": "Gotham (TV series) - Wikipedia",
"snippet": "Gotham is an American superhero crime drama television series developed by Bruno Heller, produced by Warner Bros. Television and based on characters from ..."
},
# ...
]

关于python - 如何使用 python 在谷歌搜索中提取描述?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46641941/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com