gpt4 book ai didi

python - 无法从网页上抓取所有公司名称

转载 作者:行者123 更新时间:2023-12-03 16:03:01 25 4
gpt4 key购买 nike

我正在尝试从此 webpage 解析所有公司名称。周围有2431公司。但是,我在下面尝试过的方法可以获取1000结果。
通过开发工具,这是我可以看到的关于响应结果数量的信息:

hitsPerPage: 1000
index: "YCCompany_production"
nbHits: 2431 <------------------------
nbPages: 1
page: 0

How can I get the rest of the results using requests?


到目前为止,我已经尝试过:
import requests

url = 'https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries?'

params = {
'x-algolia-agent': 'Algolia for JavaScript (3.35.1); Browser; JS Helper (3.1.0)',
'x-algolia-application-id': '45BWZJ1SGC',
'x-algolia-api-key': 'NDYzYmNmMTRjYzU4MDE0ZWY0MTVmMTNiYzcwYzMyODFlMjQxMWI5YmZkMjEwMDAxMzE0OTZhZGZkNDNkYWZjMHJlc3RyaWN0SW5kaWNlcz0lNUIlMjJZQ0NvbXBhbnlfcHJvZHVjdGlvbiUyMiU1RCZ0YWdGaWx0ZXJzPSU1QiUyMiUyMiU1RCZhbmFseXRpY3NUYWdzPSU1QiUyMnljZGMlMjIlNUQ='
}
payload = {"requests":[{"indexName":"YCCompany_production","params":"hitsPerPage=1000&query=&page=0&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="}]}

with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
r = s.post(url,params=params,json=payload)
print(len(r.json()['results'][0]['hits']))

最佳答案

解决方法是,您可以使用字母作为搜索模式来模拟搜索。使用下面的代码,您将获得所有2431家公司作为字典,ID为键,完整的公司数据字典为值。

import requests
import string

params = {
'x-algolia-agent': 'Algolia for JavaScript (3.35.1); Browser; JS Helper (3.1.0)',
'x-algolia-application-id': '45BWZJ1SGC',
'x-algolia-api-key': 'NDYzYmNmMTRjYzU4MDE0ZWY0MTVmMTNiYzcwYzMyODFlMjQxMWI5YmZkMjEwMDAxMzE0OTZhZGZkNDNkYWZjMHJl'
'c3RyaWN0SW5kaWNlcz0lNUIlMjJZQ0NvbXBhbnlfcHJvZHVjdGlvbiUyMiU1RCZ0YWdGaWx0ZXJzPSU1QiUyMiUy'
'MiU1RCZhbmFseXRpY3NUYWdzPSU1QiUyMnljZGMlMjIlNUQ='
}

url = 'https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries'
result = dict()
for letter in string.ascii_lowercase:
print(letter)

payload = {
"requests": [{
"indexName": "YCCompany_production",
"params": "hitsPerPage=1000&query=" + letter + "&page=0&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="
}]
}

r = requests.post(url, params=params, json=payload)
result.update({h['id']: h for h in r.json()['results'][0]['hits']})

print(len(result))

关于python - 无法从网页上抓取所有公司名称,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65527354/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com