gpt4 book ai didi

python - 抓取 Amazon 作业结果时不断收到连接重置错误 10054

转载 作者:太空宇宙 更新时间:2023-11-03 21:02:42 24 4
gpt4 key购买 nike

显然,我对 Python 还很陌生,查看了我的代码,但未能完成它。

我正在抓取亚马逊职位搜索结果,但在对该网址发出大约 50 个请求后,不断收到连接重置错误 10054。我添加了 Crawlera 代理网络以防止被禁止,但仍然无法正常工作。我知道网址很长,但它似乎不需要在网址中添加太多其他单独的部分即可工作。结果页面总共有大约 12,000 个作业,每页有 10 个作业,所以我什至不知道抓取这么多数据是否是一个问题。亚马逊将 URL 中的每个页面显示为“result_limit=10”,因此我已经浏览了每个页面 10 秒,而不是每个请求 1 个页面。不确定这是否正确。另外,最后一页停在 9,990。

代码可以工作,但不确定如何传递连接错误。正如您所看到的,我添加了用户代理之类的东西,但不确定它是否能起到任何作用。任何帮助将不胜感激,因为我已经在这方面坚持了无数天和小时。谢谢!

def get_all_jobs(pages):
requests = 0
start_time = time()
total_runtime = datetime.now()

for page in pages:
try:
ua = UserAgent()
header = {
'User-Agent': ua.random
}
response = get('https://www.amazon.jobs/en/search.json?base_query=&city=&country=USA&county=&'
'facets%5B%5D=location&facets%5B%5D=business_category&facets%5B%5D=category&'
'facets%5B%5D=schedule_type_id&facets%5B%5D=employee_class&facets%5B%5D=normalized_location'
'&facets%5B%5D=job_function_id&job_function_id%5B%5D=job_function_corporate_80rdb4&'
'latitude=&loc_group_id=&loc_query=USA&longitude=&'
'normalized_location%5B%5D=Seattle%2C+Washington%2C+USA&'
'normalized_location%5B%5D=San+Francisco'
'%2C+California%2C+USA&normalized_location%5B%5D=Sunnyvale%2C+California%2C+USA&'
'normalized_location%5B%5D=Bellevue%2C+Washington%2C+USA&'
'normalized_location%5B%5D=East+Palo+Alto%2C+California%2C+USA&'
'normalized_location%5B%5D=Santa+Monica%2C+California%2C+USA&offset={}&query_options=&'
'radius=24km&region=&result_limit=10&schedule_type_id%5B%5D=Full-Time&'
'sort=relevant'.format(page),
headers=header,
proxies={
"http": "http://1ea01axxxxxxxxxxxxxxxxxxx:@proxy.crawlera.com:8010/"
}
)
# Monitor the frequency of requests
requests += 1

# Pauses the loop between 8 and 15 seconds
sleep(randint(8, 15))
current_time = time()
elapsed_time = current_time - start_time
print("Amazon Request:{}; Frequency: {} request/s; Total Run Time: {}".format(requests,
requests / elapsed_time, datetime.now() - total_runtime))
clear_output(wait=True)

# Throw a warning for non-200 status codes
if response.status_code != 200:
warn("Request: {}; Status code: {}".format(requests, response.status_code))

# Break the loop if number of requests is greater than expected
if requests > 999:
warn("Number of requests was greater than expected.")
break

yield from get_job_infos(response)

except AttributeError as e:
print(e)
continue


def get_job_infos(response):

amazon_jobs = json.loads(response.text)

for website in amazon_jobs['jobs']:
site = website['company_name']
title = website['title']
location = website['normalized_location']
job_link = 'https://www.amazon.jobs' + website['job_path']
yield site, title, location, job_link


def main():
# Page range starts from 0 and the middle value increases by 10 each page.
pages = [str(i) for i in range(0, 9990, 10)]

with open('amazon_jobs.csv', "w", newline='', encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["Website", "Title", "Location", "Job URL"])
writer.writerows(get_all_jobs(pages))


if __name__ == "__main__":
main()

最佳答案

我不是亚马逊反机器人政策的专家,但如果他们标记了您一次,您的 IP 可能会被标记一段时间,他们可能会限制您在特定时间范围内可以执行的类似请求的数量。谷歌搜索 urllib 的补丁,这样你就可以实时查看请求 header ,除了每个特定时间范围内的 ip/域名,亚马逊会查看你的请求 header 来确定你是否不是人类。将您发送的内容与常规浏览器请求 header 进行比较

只是标准做法,将 cookie 保留正常时间,使用正确的引荐来源网址和流行的用户代理所有这些都可以通过 requests 库、pip install requests、查看 session 对象来完成

看起来您正在向内部亚马逊网址发送请求,但没有引用 header ......这在普通浏览器中不会发生

另一个例子,保留一个用户代理的 cookie,然后切换到另一个也不是浏览器所做的事情

关于python - 抓取 Amazon 作业结果时不断收到连接重置错误 10054,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55622939/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com