gpt4 book ai didi

python - 如何从一个随机网站上抓取所有产品?

转载 作者:太空狗 更新时间:2023-10-29 20:19:38 24 4
gpt4 key购买 nike

我试图从 this website 获取所有产品但不知何故,我不认为我选择了最好的方法,因为其中一些丢失了,我不知道为什么。这不是我第一次遇到这个问题。

我现在的做法是这样的:

  • 转到 index page网站的
  • 从那里获取所有类别 (A-Z 0-9)
  • 访问上述每个类别并从那里递归遍历所有子类别,直到到达产品页面
  • 当我到达产品页面时,检查产品是否有更多 SKU。如果有,请获取链接。否则,这是唯一的 SKU。

现在,下面的代码可以工作,但它并没有得到所有的产品,而且我看不出有任何原因会导致它跳过一些。也许我处理一切的方式是错误的。

from lxml import html
from random import randint
from string import ascii_uppercase
from time import sleep
from requests import Session


INDEX_PAGE = 'https://www.richelieu.com/us/en/index'
session_ = Session()


def retry(link):
wait = randint(0, 10)
try:
return session_.get(link).text
except Exception as e:
print('Retrying product page in {} seconds because: {}'.format(wait, e))
sleep(wait)
return retry(link)


def get_category_sections():
au = list(ascii_uppercase)
au.remove('Q')
au.remove('Y')
au.append('0-9')
return au


def get_categories():
html_ = retry(INDEX_PAGE)
page = html.fromstring(html_)
sections = get_category_sections()

for section in sections:
for link in page.xpath("//div[@id='index-{}']//li/a/@href".format(section)):
yield '{}?imgMode=m&sort=&nbPerPage=200'.format(link)


def dig_up_products(url):
html_ = retry(url)
page = html.fromstring(html_)

for link in page.xpath(
'//h2[contains(., "CATEGORIES")]/following-sibling::*[@id="carouselSegment2b"]//li//a/@href'
):
yield from dig_up_products(link)

for link in page.xpath('//ul[@id="prodResult"]/li//div[@class="imgWrapper"]/a/@href'):
yield link

for link in page.xpath('//*[@id="ts_resultList"]/div/nav/ul/li[last()]/a/@href'):
if link != '#':
yield from dig_up_products(link)


def check_if_more_products(tree):
more_prods = [
all_prod
for all_prod in tree.xpath("//div[@id='pm2_prodTableForm']//tbody/tr/td[1]//a/@href")
]
if not more_prods:
return False
return more_prods


def main():
for category_link in get_categories():
for product_link in dig_up_products(category_link):
product_page = retry(product_link)
product_tree = html.fromstring(product_page)
more_products = check_if_more_products(product_tree)
if not more_products:
print(product_link)
else:
for sku_product_link in more_products:
print(sku_product_link)


if __name__ == '__main__':
main()

现在,这个问题可能太笼统了,但我想知道当有人想从网站获取所有数据(在本例中为产品)时是否有可遵循的经验法则。有人可以引导我完成发现处理此类场景的最佳方法的整个过程吗?

最佳答案

如果您的最终目标是抓取每个类别的整个产品列表,则在索引页上定位每个类别的完整产品列表可能是有意义的。该程序使用 BeautifulSoup 在索引页面上查找每个类别,然后遍历每个类别下的每个产品页面。最终输出是 namedtuple 的故事列表,每个类别名称带有当前页面链接和每个链接的完整产品标题:

url = "https://www.richelieu.com/us/en/index"
import urllib
import re
from bs4 import BeautifulSoup as soup
from collections import namedtuple
import itertools
s = soup(str(urllib.urlopen(url).read()), 'lxml')
blocks = s.find_all('div', {'id': re.compile('index\-[A-Z]')})
results_data = {[c.text for c in i.find_all('h2', {'class':'h1'})][0]:[b['href'] for b in i.find_all('a', href=True)] for i in blocks}
final_data = []
category = namedtuple('category', 'abbr, link, products')
for category1, links in results_data.items():
for link in links:
page_data = str(urllib.urlopen(link).read())
print "link: ", link
page_links = re.findall(';page\=(.*?)#results">(.*?)</a>', page_data)
if not page_links:
final_page_data = soup(page_data, 'lxml')
final_titles = [i.text for i in final_page_data.find_all('h3', {'class':'itemHeading'})]
new_category = category(category1, link, final_titles)
final_data.append(new_category)

else:
page_numbers = set(itertools.chain(*list(map(list, page_links))))

full_page_links = ["{}?imgMode=m&sort=&nbPerPage=48&page={}#results".format(link, num) for num in page_numbers]
for page_result in full_page_links:
new_page_data = soup(str(urllib.urlopen(page_result).read()), 'lxml')
final_titles = [i.text for i in new_page_data.find_all('h3', {'class':'itemHeading'})]
new_category = category(category1, link, final_titles)
final_data.append(new_category)

print final_data

输出将以以下格式获得结果:

[category(abbr=u'A', link='https://www.richelieu.com/us/en/category/tools-and-shop-supplies/workshop-accessories/tool-accessories/sander-accessories/1058847', products=[u'Replacement Plate for MKT9924DB Belt Sander', u'Non-Grip Vacuum Pads', u'Sandpaper Belt 2\xbd " x 14" for Compact Belt Sander PC371 or PC371K', u'Stick-on Non-Vacuum Pads', u'5" Non-Vacuum Disc Pad Hook-Face', u'Sanding Filter Bag', u'Grip-on Vacuum Pads', u'Plates for Non-Vacuum (Grip-On) Dynabug II Disc Pads - 7.62 cm x 10.79 cm (3" x 4-1/4")', u'4" Abrasive for Finishing Tool', u'Sander Backing Pad for RO 150 Sander', u'StickFix Sander Pad for ETS 125 Sander', u'Sub-Base Pad for Stocked Sanders', u'(5") Non-Vacuum Disc Pad Vinyl-Face', u'Replacement Sub-Base Pads for Stocked Sanders', u"5'' Multi-Hole Non-Vaccum Pad", u'Sander Backing Pad for RO 90 DX Sander', u'Converting Sanding Pad', u'Stick-On Vacuum Pads', u'Replacement "Stik It" Sub Base', u'Drum Sander/Planer Sandpaper'])....

要访问每个属性,请像这样调用:

categories = [i.abbr for i in final_data]
links = [i.links for i in final_data]
products = [i.products for i in final_data]

我相信使用 BeautifulSoup 的好处是这个实例是它提供了对抓取的更高级别的控制并且很容易修改。例如,如果 OP 改变了他想要抓取的产品/索引的哪些方面的想法,则只需要对 find_all 参数进行简单的更改,因为上面代码的一般结构居中围绕索引页面中的每个产品类别。

关于python - 如何从一个随机网站上抓取所有产品?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48015149/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com