gpt4 book ai didi

python - 使用python抓取html元素的内部标签时出错

转载 作者:行者123 更新时间:2023-12-01 09:27:11 25 4
gpt4 key购买 nike

最近我正在做练习,其中我提取了整个网页源数据。我对区域标签非常感兴趣。在区域标签中,我对 onclick 属性非常感兴趣。现在我们如何从特定元素中提取 onclick 属性。现在我们提取的数据如下所示,

<area class="borderimage" coords="21.32,14.4,933.96,180.56" href="javascript:void(0);" onclick="return show_pop('78545','51022929357','1')" onmouseover="borderit(this,'black','<b>इंदौर, गुरुवार, 10 मई , 2018  <b><br><bआप पढ़ रहे हैं देश का सबसे व...')" onmouseout="borderit(this,'white')" alt="<b>इंदौर, गुरुवार, 10 मई , 2018  <b><br><bआप पढ़ रहे हैं देश का सबसे व..." shape="rect">

我对 onclick 属性非常感兴趣,我的代码就像这些我已经完成的但没有任何效果,

paper_url  = 'http://epaper.bhaskar.com/indore/129/10052018/mpcg/1/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}

# Total number of pages available in these product
page = requests.get(paper_url,headers = headers)
page_response = page.text
parser = html.fromstring(page_response)
XPATH_Total_Pages = '//div[contains(@class,"fs12 fta w100 co_wh pdt5")]//text()'
raw_total_pages = parser.xpath(XPATH_Total_Pages)
lastpage=raw_total_pages[-1]
print(int(lastpage))
finallastpage=int(lastpage)
reviews_list = []
XPATH_PRODUCT_NAME = '//map[contains(@name,"Mapl")]'

#XPATH_PRODUCT_PRICE = '//span[@id="priceblock_ourprice"]/text()'

#raw_product_price = parser.xpath(XPATH_PRODUCT_PRICE)
#product_price = raw_product_price
raw_product_name = parser.xpath(XPATH_PRODUCT_NAME)

XPATH_REVIEW_SECTION_2 = '//area[@class="borderimage"]'
reviews = parser.xpath(XPATH_REVIEW_SECTION_2)

product_name =raw_product_name
#result = product_name.find(',')
#finalproductname = slice[0:product_name]
print(product_name)
print(reviews)

for review in reviews:
#soup = BeautifulSoup(str(review), "html.parser")
#parser2.feed(str(review))
#allattr = [tag.attrs for tag in review.findAll('onclick')]
#print(allattr)

XPATH_RATING = './/area[@data-hook="onclick"]'

raw_review_rating = review.xpath(XPATH_RATING)
#cleaning data
print(raw_review_rating)

最佳答案

如果我没猜错的话 - 你需要获取所有 onclick <area> 的属性页面上的标签。

尝试这样的事情:

import requests
from bs4 import BeautifulSoup

TAG_NAME = 'area'
ATTR_NAME = 'onclick'

url = 'http://epaper.bhaskar.com/indore/129/10052018/mpcg/1/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')

# there are 3 <area> tags on page; putting them into a list
area_onclick_attrs = [x[ATTR_NAME] for x in soup.findAll(TAG_NAME)]
print(area_onclick_attrs)

输出:

[
"return show_pophead('78545','51022929357','1')",
"return show_pop('78545','51022928950','4')",
"return show_pop('78545','51022929357','1')",
]

关于python - 使用python抓取html元素的内部标签时出错,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50287283/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com