gpt4 book ai didi

Webcrawler for Testing und learn(用于测试和学习的网络爬虫)

转载 作者:bug小助手 更新时间:2023-10-24 23:51:27 24 4
gpt4 key购买 nike



Hi I wanted to try to program a crawler.

嗨,我想试着编程一个爬行器。


I started with a very simple code but already when I execute it I get an error message.

我从一个非常简单的代码开始,但当我执行它时,我已经收到了一条错误消息。


What is wrong with the code?

代码出了什么问题?


I geht this Error at the source point.

我认为这个错误是从源头开始的。


Exception has occurred: ConnectTimeout
HTTPSConnectionPool(host='www.anisearch.de', port=443): Max retries exceeded with url: /anime/2788,naruto (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000002757873E090>, 'Connection to www.anisearch.de timed out. (connect timeout=None)'))
TimeoutError: [WinError 10060] Ein Verbindungsversuch ist fehlgeschlagen, da die Gegenstelle nach einer bestimmten Zeitspanne nicht richtig reagiert hat, oder die hergestellte Verbindung war fehlerhaft, da der verbundene Host nicht reagiert hat

The above exception was the direct cause of the following exception:

urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x000002757873E090>, 'Connection to www.anisearch.de timed out. (connect timeout=None)')

The above exception was the direct cause of the following exception:

urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.anisearch.de', port=443): Max retries exceeded with url: /anime/2788,naruto (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000002757873E090>, 'Connection to www.anisearch.de timed out. (connect timeout=None)'))

During handling of the above exception, another exception occurred:

File "C:\Users\admin\Documents\Crawler\anisearch_crawler.py", line 5, in <module>
source = requests.get(url)
^^^^^^^^^^^^^^^^^
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='www.anisearch.de', port=443): Max retries exceeded with url: /anime/2788,naruto (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000002757873E090>, 'Connection to www.anisearch.de timed out. (connect timeout=None)'))

It is clear to me that he cannot reach the page, but I do not understand why.
Here my Easy Code

我很清楚,他拿不到那一页,但我不明白为什么。以下是我的简单代码


`from bs4 import BeautifulSoup
import requests

url = "https://www.anisearch.de/anime/2788,naruto"
source = requests.get(url)
soup = BeautifulSoup(source.content, 'html.parser')

def info_anime(soup):
# Extracting the name of the anime from the <meta> tag
anime_name = soup.find('meta', {'name': 'title'})['content']
print("Anime : " + anime_name)


info_anime(soup)`

debug with vs code but i am not really unterstood in what case are the problem. It say "can not request" but why and how i solfe this problem that i can geht this information back.

调试与VS代码,但我并不是真正了解在什么情况下是问题。上面写着“不能请求”,但我为什么以及如何解决这个问题,我才能拿回这个信息。


The site has no api so i try with this methode

该站点没有API,所以我尝试使用此方法


更多回答

either the site is/was down or they have blocked your IP address when they detected you tried to scrape the site.

当他们检测到您试图抓取站点时,可能是站点已关闭,或者他们已经阻止了您的IP地址。

What is the solution against theme? I have only made a request nothing more.

针对主题的解决方案是什么?我只是提出了一个要求,仅此而已。

优秀答案推荐

you could try using HTMLSession instead of requests - that worked for me.

您可以尝试使用HTMLSession而不是请求--这对我很有效。


from bs4 import BeautifulSoup 
from requests_html import HTMLSession

url = "https://www.anisearch.de/anime/2788,naruto"
source = HTMLSession().get(url)
# source.raise_for_status() ## good habit in general

soup = BeautifulSoup(source.content, 'html.parser')

op


更多回答

Thx for the help but it still not work for me maybe to try with a user Agent like this? headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' } source = HTMLSession().get(url, headers=headers)

谢谢帮助,但它仍然不工作,我可能会尝试这样的用户代理?Headers={‘用户代理’:‘Mozilla/5.0(Windows NT10.0;Win64;x64)AppleWebKit/537.36(khtml,如壁虎)Chrome/58.0.3029.110 Safari/537.36’}来源=HTMLSession().get(url,Headers=Headers)

@D1skanime does it work for you then? I usually use HTMLSession bc I'm not good with headers, but it doesn't work 100%

@D1skAnime对你有效吗?我通常使用HTMLSession BC我不擅长使用标题,但它不能100%工作

No it doesn't work for me, that's why I ask if it works for you with header. Are there other possibilities? Must possibly try with a porxy that I start directly from python? Or what other possibilities are there?

不,它对我不起作用,这就是为什么我用Header问你它是否起作用。还有其他的可能性吗?一定要尝试一下我直接从Python开始的Porxy吗?或者还有什么其他的可能性?

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com