gpt4 book ai didi

python - XPATH 选择器无法选择 html 代码块

转载 作者:太空宇宙 更新时间:2023-11-03 20:51:20 25 4
gpt4 key购买 nike

我正在尝试从 alibaba.com 中提取一些数据。为此,我正在使用 scrapy。虽然它适用于大多数部分,但选择器似乎没有从公司配置文件中获取代码块。谁能帮我解决这个问题吗?

# -*- coding: utf-8 -*-
import scrapy
import csv
import os
import numpy as np

class AlibabaCrawlerSpider(scrapy.Spider):
name = 'alibaba_crawler'
allowed_domains = ['alibaba.com']
start_urls = ['http://alibaba.com/']
delimiter = '|'

def start_requests(self):
"""Read keywords from keywords file amd construct the search URL"""

with open(os.path.join(os.path.dirname(__file__), "../resources/keywords.csv")) as search_keywords:
for keyword in csv.DictReader(search_keywords):
search_text=keyword["keyword"]
url="https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText={0}&viewtype=G".format(
search_text)
# The meta is used to send our search text into the parser as metadata
yield scrapy.Request(url, callback = self.parse, meta = {"search_text": search_text})


def parse(self, response):
"""Function to process alibaba search results page"""
search_keyword=response.meta["search_text"]
products=response.xpath("//div[@class='item-main']")

# Defining the XPaths

XPATH_PRODUCT_LINK=".//div[@class='item-info']//h2/a/@href"

# iterating over search results
for product in products:

raw_product_link=product.xpath(XPATH_PRODUCT_LINK).extract()

print(raw_product_link)

product_link="https:" + raw_product_link[0] if raw_product_link else None

yield scrapy.Request(product_link, callback=self.parse_product)

break

def parse_product(self, response):

product=response.xpath("//div[@class='content-body']")

# Defining the XPaths

XPATH_COMPANY_FIELD=".//div[@class='tab-body']//div[contains(@class,'ls-company')]"#//div[@class='alisite']"#td[@class='field-title']/text()"

raw_company_field=product.xpath(XPATH_COMPANY_FIELD) #.extract()

print(raw_company_field)

我正在尝试打印 raw_company_field。到目前为止它一直有效。但当我转到以下级别时,它会给出空列表,例如阿利斯特及其他。 enter image description here

最佳答案

XPath 不会以这种方式检查类。

//div[@class='tab-body'] 这样的选择器只会匹配 tab-body 作为其唯一的类。要选择具有类的元素,您可以执行以下操作:

//div[contains(concat(' ',normalize-space(@class),' '),' tab-body ')]

或者使用 css 选择器代替:

div.tag-body

关于python - XPATH 选择器无法选择 html 代码块,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56298911/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com