gpt4 book ai didi

python - 循环问题: BeautifulSoup only collecting some elements per page

转载 作者:行者123 更新时间:2023-12-01 07:15:51 24 4
gpt4 key购买 nike

我正在爬行多个页面来收集一些 HTML,但似乎 BeautifulSoup 只是收集一些随机选择的信息。我还在 Ubuntu 16.04 操作系统上使用 selenium 和 geckodriver 单击进入下一页。

# import libraries
import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import certifi
import urllib3
import pandas as pd
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import requests

# This URL is ok according to eBay's robots.txt:
urlpage = 'https://www.ebay.com/sch/i.html?_nkw=lululemon&_sacat=15724&rt=nc&LH_Sold=1&LH_Complete=1&_pgn=6'

http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
r = http.request('GET', urlpage)
page = urllib.request.urlopen(urlpage).read()
soup = BeautifulSoup(page, 'html.parser')

# Specify containers
item_containers = soup.find_all('div', {'class': 's-item__info clearfix'})
print(len(item_containers)) # should be about 4 dozen

driver = webdriver.Firefox()

# get web page
driver.get(urlpage)

# Loop through
for container in item_containers:
# If the item has a summary, then extract...:
if container.find('h3', class_ = 's-item__title s-item__title--has-tags') is not None:
# The summary
summary = container.find('h3', class_ = 's-item__title s-item__title--has-tags').text
summaries.append(summary)
# The color
#color = container.find('span', {'class': 's-item__dynamic s-item__dynamicAttributes2'})
#colors.append(color)
# The price
price = container.find('span', attrs = {'class':'POSITIVE'}).text
prices.append(price)

button = driver.find_elements_by_class_name('x-pagination__control')[1]
button.click()

driver.refresh()
time.sleep(20)

# driver.quit()

对于我在每页指定的每个标签,大约有 4 打元素需要收集,但在几页之后,我可能只有十几个元素。循环逻辑已关闭 - 请告知,我正在努力改进我的 python!

最佳答案

无需 Selenium 即可做到这一点。使用 Beautiful Soup 的请求。

from bs4 import BeautifulSoup
import requests
url="https://www.ebay.com/sch/i.html?_nkw=lululemon&_sacat=15724&rt=nc&LH_Sold=1&LH_Complete=1&_pgn=6"
html=requests.get(url).text
soup=BeautifulSoup(html,'html.parser')
summery=[]
price=[]
for item in soup.select('div.s-item__info.clearfix'):
if item.select_one("h3.s-item__title"):
summery.append(item.select_one("h3.s-item__title").text)
if item.select_one("span.s-item__price"):
price.append(item.select_one("span.s-item__price").text)

print(summery)
print(price)
<小时/>

对于分页,您可以使用 while 循环,并使用 page_number 指定您所在的页数。例如,我最多提供 10 页。

page_num=1
baseurl="https://www.ebay.com/sch/i.html?_nkw=lululemon&_sacat=15724&rt=nc&LH_Sold=1&LH_Complete=1&_pgn={}"

summery = []
price = []
while page_num<=10:
html = requests.get(baseurl.format(page_num)).text
soup = BeautifulSoup(html, 'html.parser')

for item in soup.select('div.s-item__info.clearfix'):
if item.select_one("h3.s-item__title"):
summery.append(item.select_one("h3.s-item__title").text)
if item.select_one("span.s-item__price"):
price.append(item.select_one("span.s-item__price").text)

page_num=page_num+1

print(summery)
print(price)

关于python - 循环问题: BeautifulSoup only collecting some elements per page,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57962413/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com