gpt4 book ai didi

python - 如何迭代 Ebay 中的页面

转载 作者:行者123 更新时间:2023-12-01 00:09:11 25 4
gpt4 key购买 nike

我正在为 Ebay 构建一个抓取工具。我正在尝试找出一种方法来操纵 Ebay url 的页码部分以转到下一页,直到没有更多页面为止(如果您在第 2 页,页码部分将类似于“_pgn=2”) 。我注意到,如果您输入的数字大于列表的最大页数,该页面将重新加载到最后一页,而不是给出类似页面不存在的错误。 (如果列表有 5 个页面,则最后一个列表的页码 url 部分 _pgn=5 将路由到同一页面(如果页码 url 部分为 _pgn=100)。我怎样才能实现一种方法从第一页开始,获取页面的html soup,从汤中获取我想要的所有相关数据,然后使用新页码加载下一页并再次启动该过程,直到有没有任何新页面可供抓取?我尝试使用 selenium xpath 和 math.ceil 来获取列表的结果数,结果数与 50(每页最大列表的默认数量)的商,并使用该商作为我的 max_page,但我收到错误消息元素不存在,即使它存在。 self.driver.findxpath('xpath').text. 243 是我试图通过 xpath 获得的。 That 243 is what I am trying to get with xpath

class EbayScraper(object):

def __init__(self, item, buying_type):
self.base_url = "https://www.ebay.com/sch/i.html?_nkw="
self.driver = webdriver.Chrome(r"chromedriver.exe")
self.item = item
self.buying_type = buying_type + "=1"
self.url_seperator = "&_sop=12&rt=nc&LH_"
self.url_seperator2 = "&_pgn="
self.page_num = "1"

def getPageUrl(self):
if self.buying_type == "Buy It Now=1":
self.buying_type = "BIN=1"

self.item = self.item.replace(" ", "+")

url = self.base_url + self.item + self.url_seperator + self.buying_type + self.url_seperator2 + self.page_num
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
return soup

def getInfo(self, soup):
for listing in soup.find_all("li", {"class": "s-item"}):
raw = listing.find_all("a", {"class": "s-item__link"})
if raw:
raw_price = listing.find_all("span", {"class": "s-item__price"})[0]
raw_title = listing.find_all("h3", {"class": "s-item__title"})[0]
raw_link = listing.find_all("a", {"class": "s-item__link"})[0]
raw_condition = listing.find_all("span", {"class": "SECONDARY_INFO"})[0]
condition = raw_condition.text
price = float(raw_price.text[1:])
title = raw_title.text
link = raw_link['href']
print(title)
print(condition)
print(price)
if self.buying_type != "BIN=1":
raw_time_left = listing.find_all("span", {"class": "s-item__time-left"})[0]
time_left = raw_time_left.text[:-4]
print(time_left)
print(link)
print('\n')



if __name__ == '__main__':
item = input("Item: ")
buying_type = input("Buying Type (e.g, 'Buy It Now' or 'Auction'): ")

instance = EbayScraper(item, buying_type)
page = instance.getPageUrl()
instance.getInfo(page)

最佳答案

如果您想迭代所有页面并收集所有结果,那么您的脚本需要检查是否有 next访问该页面后的页面

import requests
from bs4 import BeautifulSoup


class EbayScraper(object):

def __init__(self, item, buying_type):
...
self.currentPage = 1

def get_url(self, page=1):
if self.buying_type == "Buy It Now=1":
self.buying_type = "BIN=1"

self.item = self.item.replace(" ", "+")
# _ipg=200 means that expect a 200 items per page
return '{}{}{}{}{}{}&_ipg=200'.format(
self.base_url, self.item, self.url_seperator, self.buying_type,
self.url_seperator2, page
)

def page_has_next(self, soup):
container = soup.find('ol', 'x-pagination__ol')
currentPage = container.find('li', 'x-pagination__li--selected')
next_sibling = currentPage.next_sibling
if next_sibling is None:
print(container)
return next_sibling is not None

def iterate_page(self):
# this will loop if there are more pages otherwise end
while True:
page = instance.getPageUrl(self.currentPage)
instance.getInfo(page)
if self.page_has_next(page) is False:
break
else:
self.currentPage += 1

def getPageUrl(self, pageNum):
url = self.get_url(pageNum)
print('page: ', url)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
return soup

def getInfo(self, soup):
...


if __name__ == '__main__':
item = input("Item: ")
buying_type = input("Buying Type (e.g, 'Buy It Now' or 'Auction'): ")

instance = EbayScraper(item, buying_type)
instance.iterate_page()

这里的重要函数是page_has_nextiterate_page

  • page_has_next - 检查页面分页是否有另一个的函数 li selected 旁边的元素页。例如< 1 2 3 >如果我们在第 1 页,那么它会检查下一页是否有 2 -> 类似这样的内容

  • iterate_page - 一个循环直到没有 page_next 的函数

另请注意,除非您需要模仿用户点击或需要浏览器进行导航,否则您不需要 selenium。

关于python - 如何迭代 Ebay 中的页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59744646/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com