gpt4 book ai didi

python - 编写一个 python 脚本,从基于任何类的 HTML 中找出 XPath

转载 作者:行者123 更新时间:2023-12-01 09:20:16 24 4
gpt4 key购买 nike

在Python中,我希望用户在控制台提示符中输入一个URL(获取输入并将其存储在某个变量中),例如,如果网页包含以下 HTML:

<html>
<head>
</head>
<body>
<div>
<h1 class="class_one">First heading</h1>
<p>Some text</p>
<div class="class_two">
<div class="class_three">
<div class="class_one">
<center class="class_two">
<h3 class="class_three">
</h3>
</center>
<center>
<h3 class="find_first_class">
Some text
</h3>
</center>
</div>
</div>
</div>
<div class="class_two">
<div class="class_three">
<div class="class_one">
<center class="class_two">
<h2 class="find_second_class">
</h2>
</center>
</div>
</div>
</div>
</div>
</body>
</html>

然后,CSV 应该包含网页 HTML 中每个类的行(因为类可以出现多次,所以我们可以为任何给定的类包含多行)。

现在,我想为页面上出现的所有类生成 XPath。到目前为止我所写的是:

import urllib2
from bs4 import BeautifulSoup

result = {}
user_url_list = raw_input("Please enter your urls separated by spaces : \n")
url_list = map(str, user_url_list.split())
for url in url_list:
try:
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
user_class_list = raw_input("Please enter the classes to parse for " + url + " separated by spaces : \n")
class_list = map(str, user_class_list.split())
for find_class in class_list:
try:
name_box = soup.find(attrs={'class': find_class})
print(xpath_soup(name_box))
break
except:
print("There was some error getting the xpath of class : " + find_class + " for url : " + url + "\n..trying next class now \n")
continue
except:
print(url + " is not valid, please enter correct full url \n")
continue
print(result)

最佳答案

这是 Orhan 提到的 try/except 逻辑。 lxml 解析它传递的文档,并可以通过 xpath 引用元素并提取类。之后,只需简单检查它们是否出现在所需的类中即可。 lxml 还允许通过 ElementTree 重建初始 xpath。

import csv
import requests
from lxml import etree

target_url = input('Which url is to be scraped?')

page = '''
<html>
<head>
</head>
<body>
<div>
<h1 class="class_one">First heading</h1>
<p>Some text</p>
<div class="class_two">
<div class="class_three">
<div class="class_one">
<center class="class_two">
<h3 class="class_three">
</h3>
</center>
<center>
<h3 class="find_first_class">
Some text
</h3>
</center>
</div>
</div>
</div>
<div class="class_two">
<div class="class_three">
<div class="class_one">
<center class="class_two">
<h2 class="find_second_class">
</h2>
</center>
</div>
</div>
</div>
</div>
</body>
</html>
'''

#response = requests.get(target_url)
#document = etree.parse(response.content)
classes_list = ['find_first_class', 'find_second_class']
expressions = []

document = etree.fromstring(page)

for element in document.xpath('//*'):
try:
ele_class = element.xpath("@class")[0]
print(ele_class)
if ele_class in classes_list:
tree = etree.ElementTree(element)
expressions.append((ele_class, tree.getpath(element)))
except IndexError:
print("No class in this element.")
continue

with open('test.csv', 'w') as f:
writer = csv.writer(f, delimiter=',')
writer.writerows(expressions)

关于python - 编写一个 python 脚本,从基于任何类的 HTML 中找出 XPath,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50846570/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com