python - 从网站抓取分页链接的网页抓取问题-6ren

python - 从网站抓取分页链接的网页抓取问题

转载作者：行者123 更新时间：2023-12-01 09:17:47

我正在尝试从主页(完成)上所有列出的类别 URL 以及网站及其分页链接的其他子类别页面中抓取数据。网址是 here

我已经创建了Python脚本来提取模块化结构中的数据，因为我需要在一个单独的文件中从一个步骤到另一个步骤的所有URL的输出。但现在我面临着提取所有分页 URL 的问题，之后将从中提取数据。另外，我只从第一个子类别 URL 获取数据，而不是从所有列出的子类别 URL 中获取数据。

例如，在我的下面的脚本中，数据来自 >>>>>

一般实践(主类别页面)- http://www.medicalexpo.com/cat/general-practice-K.html以及进一步的听诊器(子类别页面)- http://www.medicalexpo.com/medical-manufacturer/stethoscope-2.html

即将到来。我想要来自此链接上给出的所有列出的子类别链接的数据

任何帮助我都能获得所需的输出，其中包含所有列出的子类别页面的产品 URL。

下面是代码:

import re
import time
import random
import selenium.webdriver.support.ui as ui
from selenium.common.exceptions import TimeoutException, NoSuchElementException 
from selenium import webdriver 
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from lxml import html  
from bs4 import BeautifulSoup
from datetime import datetime
import csv
import os
from fake_useragent import UserAgent

# Function to write data to a file:
def write_to_file(file,mode, data, newline=None, with_tab=None):   #**
    with open(file, mode, encoding='utf-8') as l:
        if with_tab == True:
            data = ''.join(data)
        if newline == True:
            data = data+'\n'
        l.write(data)

# Function for data from Module 1:
def send_link(link1):
    browser = webdriver.Chrome()
    browser.get(link1)
    current_page = browser.current_url 
    print (current_page) 
    soup = BeautifulSoup(browser.page_source,"lxml")
    tree = html.fromstring(str(soup))

# Added try and except in order to skip/pass attributes without any value.
    try:
        main_category_url = browser.find_elements_by_xpath("//li[@class=\"univers-group-item\"]/span/a[1][@href]")
        main_category_url = [i.get_attribute("href") for i in main_category_url[4:]]
        print(len(main_category_url))

    except NoSuchElementException:
        main_category_url = ''

    for index, data in enumerate(main_category_url):
        with open('Module_1_OP.tsv', 'a', encoding='utf-8') as outfile:
            data = (main_category_url[index] + "\n")
            outfile.write(data)

# Data Extraction for Categories under HEADERS:
    try:
        sub_category_url = browser.find_elements_by_xpath("//li[@class=\"category-group-item\"]/a[1][@href]")
        sub_category_url = [i.get_attribute("href") for i in sub_category_url[:]]
        print(len(sub_category_url))
    except NoSuchElementException:
        sub_category_url = ''

    for index, data in enumerate(sub_category_url):
        with open('Module_1_OP.tsv', 'a', encoding='utf-8') as outfile:
            data = (sub_category_url[index] + "\n")
            outfile.write(data)
            
    csvfile = open("Module_1_OP.tsv") 
    csvfilelist = csvfile.readlines()
    send_link2(csvfilelist)

# Function for data from Module 2:
def send_link2(links2): 
    browser = webdriver.Chrome()
    start = 7
    end = 10
    for link2 in (links2[start:end]):    
        print(link2) 

        ua = UserAgent() 
        try:
            ua = UserAgent()
        except FakeUserAgentError:
            pass

        ua.random == 'Chrome'

        proxies = [] 

        t0 = time.time()
        response_delay = time.time() - t0 
        time.sleep(10*response_delay) 
        time.sleep(random.randint(2,5)) 
        browser.get(link2) 
        current_page = browser.current_url 
        print (current_page) 
        soup = BeautifulSoup(browser.page_source,"lxml")
        tree = html.fromstring(str(soup))

        # Added try and except in order to skip/pass attributes without value.
        try:
            product_url = browser.find_elements_by_xpath('//ul[@class=\"category-grouplist\"]/li/a[1][@href]')
            product_url = [i.get_attribute("href") for i in product_url]
            print(len(product_url))
        except NoSuchElementException:
            product_url = ''

        try:
            product_title = browser.find_elements_by_xpath("//ul[@class=\"category-grouplist\"]/li/a[1][@href]") # Use FindelementS for extracting multiple section data
            product_title = [i.text for i in product_title[:]]
            print(product_title)
        except NoSuchElementException:
            product_title = ''
        
        for index, data2 in enumerate(product_title):
            with open('Module_1_2_OP.tsv', 'a', encoding='utf-8') as outfile:
                data2 = (current_page + "\t" + product_url[index] + "\t" + product_title[index] + "\n")
                outfile.write(data2)

        for index, data3 in enumerate(product_title):
            with open('Module_1_2_OP_URL.tsv', 'a', encoding='utf-8') as outfile:
                data3 = (product_url[index] + "\n")
                outfile.write(data3)

        csvfile = open("Module_1_2_OP_URL.tsv")
        csvfilelist = csvfile.readlines()
        send_link3(csvfilelist)

# Function for data from Module 3:
def send_link3(csvfilelist): 
    browser = webdriver.Chrome()
    for link3 in csvfilelist[:3]:
        print(link3) 
        browser.get(link3) 
        time.sleep(random.randint(2,5))
        current_page = browser.current_url 
        print (current_page) 
        soup = BeautifulSoup(browser.page_source,"lxml")
        tree = html.fromstring(str(soup))

        try:
            pagination = browser.find_elements_by_xpath("//div[@class=\"pagination-wrapper\"]/a[@href]")
            pagination = [i.get_attribute("href") for i in pagination]
            print(pagination)

        except NoSuchElementException:
            pagination = ''

        for index, data2 in enumerate(pagination):
            with open('Module_1_2_3_OP.tsv', 'a', encoding='utf-8') as outfile:
                data2 = (current_page + "\n" + pagination[index] + "\n")
                outfile.write(data2)

        dataset = open("Module_1_2_3_OP.tsv") 
        dataset_dup = dataset.readlines() 
        duplicate(dataset_dup)

# Used to remove duplicate records from a List:
def duplicate(dataset):
    dup_items = set()
    uniq_items = []
    for x in dataset:
        if x not in dup_items:
            uniq_items.append(x)
            dup_items.add(x)
            write_to_file('Listing_pagination_links.tsv','w', dup_items, newline=True, with_tab=True)

    csvfile = open("Listing_pagination_links.tsv") 
    csvfilelist = csvfile.readlines()
    send_link4(csvfilelist)

# Function for data from Module 4:
def send_link4(links3):
    browser = webdriver.Chrome()
    for link3 in links3:
      print(link3)
      browser.get(link3) 
      t0 = time.time()
      response_delay = time.time() - t0 
      time.sleep(10*response_delay) 
      time.sleep(random.randint(2,5)) 
      sub_category_page = browser.current_url 
      print (sub_category_page) 
      soup = BeautifulSoup(browser.page_source,"lxml")
      tree = html.fromstring(str(soup))

      # Added try and except in order to skip/pass attributes without value.
      try:
        product_url1 = browser.find_elements_by_xpath('//div[@class=\"inset-caption price-container\"]/a[1][@href]')
        product_url1 = [i.get_attribute("href") for i in product_url1]
        print(len(product_url1))
      except NoSuchElementException:
        product_url1 = ''

      for index, data in enumerate(product_url1):
        with open('Final_Output_' + datestring + '.tsv', 'a', encoding='utf-8') as outfile:
          data = (sub_category_page + "\t" + product_url1[index] + "\n")
          outfile.write(data)

# PROGRAM STARTS EXECUTING FROM HERE...
# Added to attach Real Date and Time field to Output filename
datestring = datetime.strftime(datetime.now(), '%Y-%m-%d-%H-%M-%S') # For filename
#datestring2 = datetime.strftime(datetime.now(), '%H-%M-%S') # For each record

send_link("http://www.medicalexpo.com/")

最佳答案

实际上你根本不需要 Selenium。下面的代码将获取网站上所有内容的类别、子类别和项目链接、名称和描述。

唯一棘手的部分是处理分页的 while 循环。原则是，如果网站上有“下一步”按钮，我们就需要加载更多内容。在这种情况下，网站实际上在下一个标签中为我们提供了“下一个”链接，因此很容易进行迭代，直到没有更多的下一个链接可供检索。

请记住，当您运行此命令时，可能需要一段时间。还要记住，您可能应该插休眠眠 - 例如1 秒 - 在 while 循环中的每个请求之间，以良好地对待服务器。

这样做会降低您被禁止/类似情况的风险。

import requests
from bs4 import BeautifulSoup
from time import sleep

items_list = [] # list of dictionaries with this content: category, sub_category, item_description, item_name, item_link 

r = requests.get("http://www.medicalexpo.com/")
soup = BeautifulSoup(r.text, "lxml")
cat_items = soup.find_all('li', class_="category-group-item")
cat_items = [[cat_item.get_text().strip(),cat_item.a.get('href')] for cat_item in cat_items]

# cat_items is now a list with elements like this:
# ['General practice','http://www.medicalexpo.com/cat/general-practice-K.html']
# to access the next level, we loop:

for category, category_link in cat_items[:1]:
    print("[*] Extracting data for category: {}".format(category))

    r = requests.get("http://www.medicalexpo.com/cat/general-practice-K.html")
    soup = BeautifulSoup(r.text, "lxml")
    # data of all sub_categories are located in an element with the id 'category-group'
    cat_group = soup.find('div', attrs={'id': 'category-group'})

    # the data lie in 'li'-tags
    li_elements = cat_group.find_all('li')
    sub_links = [[li.a.get('href'), li.get_text().strip()] for li in li_elements]

    # sub_links is now a list og elements like this:
    # ['http://www.medicalexpo.com/medical-manufacturer/stethoscope-2.html', 'Stethoscopes']

    # to access the last level we need to dig further in with a loop
    for sub_category_link, sub_category in sub_links:
        print("  [-] Extracting data for sub_category: {}".format(sub_category))
        local_count = 0
        load_page = True
        item_url = sub_category_link
        while load_page:
            print("     [-] Extracting data for item_url: {}".format(item_url))
            r = requests.get(item_url)
            soup = BeautifulSoup(r.text, "lxml")
            item_links = soup.find_all('div', class_="inset-caption price-container")[2:]
            for item in item_links:
                item_name = item.a.get_text().strip().split('\n')[0]
                item_link = item.a.get('href')
                try:
                    item_description = item.a.get_text().strip().split('\n')[1]
                except:
                    item_description = None
                item_dict = {
                    "category": category,
                    "subcategory": sub_category,
                    "item_name": item_name,
                    "item_link": item_link,
                    "item_description": item_description
                }
                items_list.append(item_dict)
                local_count +=1
            # all itempages has a pagination element
            # if there are more pages to load, it will have a "next"-class
            # if we are on the last page, the will not be a next class and "next_link" will return None
            pagination = soup.find(class_="pagination-wrapper")
            try:
                next_link = pagination.find(class_="next").get('href', None)
            except:
                next_link = None
            # consider inserting a sleep(1) right about here...
            # if the next_link exists it means that there are more pages to load
            # we'll then set the item_url = next_link and the While-loop will continue
            if next_link is not None:
                item_url = next_link
            else:
                load_page = False
        print("      [-] a total of {} item_links extracted for this sub_category".format(local_count))

# this will yield a list of dicts like this one:

# {'category': 'General practice',
#  'item_description': 'Flac duo',
#  'item_link': 'http://www.medicalexpo.com/prod/boso-bosch-sohn/product-67891-821119.html',
#  'item_name': 'single-head stethoscope',
#  'subcategory': 'Stethoscopes'}

# If you need to export to something like excel, uses pandas. Create a DataFrame and simple load it with the list
# pandas can the export the stuff to excel easily...

关于python - 从网站抓取分页链接的网页抓取问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51083617/

文章推荐： python - 在 python 中处理 Slack 按钮操作 POST 请求数据

文章推荐： python - 检查列表中的值是否位于 numpy 数组的相应行中

文章推荐： python - Pandas分类数据: Storing the transformation

java - 使用 selenium webdriver 从屏幕 A(网页)跳转到屏幕 B(网页)，绕过中间屏幕(网页)
这个问题与窗口处理或多个浏览器窗口的杂耍无关，而是关于在同一窗口中浏览 Web 应用程序的网页。我遇到这样的情况 1.我导航为屏幕 A->屏幕 x->屏幕 Y->屏幕 B 2.我需要捕获首次登录时屏幕
网页:系统会记录用户显示每个页面的时长
我有这个要求: The system will record the length of time the user displayed each page. 虽然在富客户端应用程序中微不足道，但我不
JavaScript 网页
我在调试 JavaScript 网页时遇到问题。我遇到困难的地方是我标记 (...) 的地方。我收到未定义的错误。我是否将函数 countDown(start, Increment) 中的参数(即 s
网页 |屏幕中央菜单栏
需要一些帮助。我刚开始学习 HTML，今天一直在研究如何制作菜单，但在这样做时遇到了问题。我似乎不知道如何在屏幕上居中显示菜单。这就是我目前所拥有的； Home
javascript - 我想通过单击按钮将小程序的任何参数发送到浏览器。 (网页)
我想通过单击按钮将小程序的任何参数发送到浏览器。 (HTML)。我知道按钮对象有一些方法，但不知道使用哪个。我怎样才能做到这一点？ps .: 我使用的是 jnlp 协议(protocol)。类似于:
java - 使用维基百科数据时如何提高性能？网页？
我应该使用Wikipedia的文章链接数据转储从组织的网站中提取代表性术语。为此，我已经- 抓取并下载了该组织的网页。 (〜110,000) 创建了Wikipedia ID和术语/标题的字典。 (约
javascript - 我如何从android调用javascript函数(网页)？
我的网页中包含 javascript 函数... function callFromAndroid(varName) { alert("call from android activated by
java - 用java编辑html/网页
我想创建一个 Java 应用程序，允许用户导入网页并能够在程序中对其进行编辑。导入网页将对其进行渲染，并且页面的组件(图像、文本等)将是可编辑或可拖动的，从而允许用户重新布局组件。例如，用户可以加
java - 将scrollPane添加到EditorPane(网页)
当我们按下按钮时，我向 JFrame 添加了一个网页(网页在同一框架中打开)。效果很好。但我想向其中添加一个scrollPane，但是当我添加 JScrollPane jsp = new JScrol
javascript - 网页 |定心
我在使用 particles.js 时无法将图像居中。图像居中，但略微偏离中心。为什么要这样做，我如何才能将它居中？ html particles.js demo CSS
javascript - 加载页面时如何修复音频不自动播放？ - 网页
我正在尝试在加载页面时播放音频，它应该非常简单但我无法完成。问题是它没有播放，我尝试检查自动播放的状态(真/假)，它说它在页面加载时播放，尽管它没有播放，还尝试制作一个将改变自动播放的功能状态为
javascript - 显示选定的图像、网页
我正在尝试显示用户从列表中选择的图像，但我在屏幕上看不到任何内容。 .container { position: relative; } .ce
html - Bootstrap 网页
这听起来有点奇怪，但我需要一些帮助，网页必须有一行必须包含三个部分，第一部分必须有 1 列的偏移量，并且部分之间的空间必须是 10px到目前为止，使用 Bootstrap 一切顺利。现在第二行将有
html - 网页 - 以毫米为单位的图像大小
这个问题在这里已经有了答案: Web and physical units (2 个答案) Div width in cm (inch) (6 个答案) 关闭 9 年前。
javascript - 网页。背景是如何构建的。
这个问题不太可能帮助任何 future 的访问者；它只与一个小的地理区域、一个特定的时间点或一个非常狭窄的情况有关，这些情况并不普遍适用于互联网的全局受众。为了帮助使这个问题更广泛地适用，visit
css - IPython笔记本设置笔记本面板的宽度(网页)
我想将我的 IPython 笔记本的宽度设置为 2500 像素，并且我希望它左对齐。我该怎么做？我使用这段代码来应用我自己的 CSS: from IPython.core.display impor
Javascript |网页 |饼图状态
关闭。这个问题需要更多focused .它目前不接受答案。想改进这个问题吗？更新问题，使其只关注一个问题 editing this post . 关闭 7 年前。 Improve this q
javascript - 如何接受姓名和号码输入并将其添加到文档/网页
我在 Word 中制作了一份文档，希望人们在其中添加自己的姓名以及他们的教学经验。我已将其保存为网页并发布到此处: http://epicforum.net/TS ...但操作部分实际上就是这样: h
javascript - 按下空格键时触发按钮单击(网页)
这个问题在这里已经有了答案: Execute JS code after pressing the spacebar (5 个答案) 关闭 4 年前。
javascript - 要求用户先登录 - 网页
我正在开发一个只有两个页面的网站。 1.登录 2.首页我正在使用 Angular 框架。 app.config(['$routeProvider', function ($routeProvider

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 从网站抓取分页链接的网页抓取问题