gpt4 book ai didi

python - Selenium - 网页抓取相同内容但 xpath 略有不同的多个 url

转载 作者:太空宇宙 更新时间:2023-11-03 20:38:45 25 4
gpt4 key购买 nike

我正在使用 Selenium 来抓取同一表的多个 url,但这些表的 xpath 略有不同。

以下是我的编码:

my_urls = ["https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=0001548760",
"https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=0001366010",
"https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=0001164390"]

driver = webdriver.Chrome()
for url in my_urls:
driver.get(url)
export_table=driver.find_elements_by_xpath('')[0]
export_table.text

xpath1:/html/body/div/table[1]/tbody/tr[2]/td/table/tbody/tr[3]/td/table/tbody

xpath2:/html/body/div/table[1]/tbody/tr[2]/td/table/tbody/tr[2]/td/table/tbody

如何使用一个 xpath 从这些 url 中提取内容?并将所有结果导出到字典中?

感谢您的帮助!

最佳答案

如果您想从每个 xpath 获取文本,请尝试此操作。如果您希望每个 url 都有一个路径,那么您应该使用字典来在 url 和 xpath 之间建立映射。您可以迭代该字典来执行您想做的事情。

import json
from selenium import webdriver
my_urls = ["https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=0001548760",
"https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=0001366010",
"https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=0001164390"]

xpath1 = """/html/body/div/table[1]/tbody/tr[2]/td/table/tbody/tr[3]/td/table/tbody"""
xpath2 = """/html/body/div/table[1]/tbody/tr[2]/td/table/tbody/tr[2]/td/table/tbody"""

def getpath(element):
try:
return element[0].text
except IndexError as _:
return None

export_table = {}

driver = webdriver.Chrome("chromedriver.exe")
for url in my_urls:
driver.get(url)
export_table[url] = {path: getpath(driver.find_elements_by_xpath(path)) for path in [xpath1, xpath2]}

driver.close()

json.dumps(export_table)

输出

{
"https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=0001548760": {
"/html/body/div/table[1]/tbody/tr[2]/td/table/tbody/tr[3]/td/table/tbody": "Issuer Filings Transaction Date Type of Owner\\nFacebook Inc 0001326801 2019-04-26 director, 10 percent owner, officer: COB and CEO",
"/html/body/div/table[1]/tbody/tr[2]/td/table/tbody/tr[2]/td/table/tbody": "Mailing Address\\nC/O FACEBOOK, INC.\\n1601 WILLOW ROAD\\nMENLO PARK CA 94025"
},
"https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=0001366010": {
"/html/body/div/table[1]/tbody/tr[2]/td/table/tbody/tr[3]/td/table/tbody": "Issuer Filings Transaction Date Type of Owner\\nFacebook Inc 0001326801 2019-07-08 director, officer: Chief Operating Officer\\nSVMK Inc. 0001739936 2019-02-21 director\\nWALT DISNEY CO/\\nCurrent Name:TWDC Enterprises 18 Corp. 0001001039 2017-11-22 director\\nSTARBUCKS CORP 0000829224 2011-11-14 director\\neHealth, Inc. 0001333493 2008-06-10 director",
"/html/body/div/table[1]/tbody/tr[2]/td/table/tbody/tr[2]/td/table/tbody": "Mailing Address\\n1 FACEBOOK WAY\\nMENLO PARK CA 94025"
},
"https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=0001164390": {
"/html/body/div/table[1]/tbody/tr[2]/td/table/tbody/tr[3]/td/table/tbody": null,
"/html/body/div/table[1]/tbody/tr[2]/td/table/tbody/tr[2]/td/table/tbody": "Issuer Filings Transaction Date Type of Owner\\nACE LTD\\nCurrent Name:Chubb Ltd 0000896159 2019-06-06 officer: Executive Vice President*"
}
}

关于python - Selenium - 网页抓取相同内容但 xpath 略有不同的多个 url,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56994059/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com