gpt4 book ai didi

python - 抓取动态元素

转载 作者:太空宇宙 更新时间:2023-11-04 00:07:19 24 4
gpt4 key购买 nike

下面是我的代码,它可以工作,但有时它不工作?我可以说 intermmeidate 问题,可能是因为页面中的动态元素?动态元素的解决方案是什么?

def collect_bottom_url(product_string):
"""
collect_bottom_url:
This function will accept product name as a argument.
create a url of product and then collect all the urls given in bottom of page for the product.

:return: list_of_urls
"""

url = 'https://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=' + product_string
# download the main webpage of product
webpage = requests.get(url)

# Store the main URL of Product in a list
list_of_urls = list()
list_of_urls.append(url)

# Create a web page of downloaded page using lxml parser
my_soup = BeautifulSoup(webpage.text, "lxml")

# find_all class = pagnLink in web page
urls_at_bottom = my_soup.find_all(class_='pagnLink')

empty_list = list()
for b_url in urls_at_bottom:
empty_list.append(b_url.find('a')['href'])

for item in empty_list:
item = "https://www.amazon.in/" + item
list_of_urls.append(item)
print(list_of_urls)


collect_bottom_url('book')

这是输出 1,很好:

['https://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=book', 'https://www.amazon.in//book/s?ie=UTF8&page=2&rh=i%3Aaps%2Ck%3Abook', 'https://www.amazon.in//book/s?ie=UTF8&page=3&rh=i%3Aaps%2Ck%3Abook']

这是不正确的输出 2:

['https://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=book']

最佳答案

它不是动态的,但它会询问验证码,因为您使用默认用户代理,请更改它。

headers= {"User-Agent" : 'Mozilla/5.0.............'}
def collect_bottom_url(product_string):
.....
webpage = requests.get(url, headers=headers)

用于动态页面使用Selenium .

关于python - 抓取动态元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53655327/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com