gpt4 book ai didi

selenium - 尝试获取特定项目并使用 selenium 格式化它们时出现问题?

转载 作者:行者123 更新时间:2023-12-03 15:55:17 26 4
gpt4 key购买 nike

我正在抓取 website其中有一些表。具体来说,我想从所有表(如果存在)中提取第一列 (presentation) 和 company name(位于此 xpath 中:.//*[@id='accordion']//h3),像这样(二维格式):

['Mission Pharmacal (Reverified 01/21/2015)' , '250 mg (NDC 01780-500-01)']
['Hospira, Inc. (Reverified 11/07/2016)', '5 mEq/mL; 20 mL vial (NDC 0409-6043-01)']
['Shire US Inc. (Reverified 07/01/2016)', 'AGRYLIN® (anagrelide hydrochloride) Dosage Form: 0.5 mg capsules for oral administration (NDC 54092-063-01)']
['Teva Pharmaceuticals (Reverified 11/01/2016)', '1mg 100 (NDC 00172-5240-60)']
['Teva Pharmaceuticals (Reverified 11/01/2016)', '0.5 mg 10 (NDC 00172-5241-60)']
['Jazz Pharmaceuticals, Inc. (Revised 11/14/2016)', 'ERWINAZE 10,000 IU lyophilized powder supplied in a clear 3 mL glass vial 5 vial carton (NDC 57902-249-05)']
[' Jazz Pharmaceuticals, Inc. (Revised 11/14/2016)', 'ERWINAZE 10,000 IU lyophilized powder supplied in a clear 3 mL glass vial 1 vial (NDC 57902-249-01)']

到目前为止,我尝试了以下方法。但是,我不知道如何调整列表,也不明白为什么我没有从 Accordion 中捕捉到一些隐藏的项目。

在:

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get('http://www.accessdata.fda.gov/scripts/drugshortages/default.cfm')
links = driver.find_elements_by_xpath('''.//*[@id='tabs-1']//tbody//td[1]//a[2]''')
links = [x.get_attribute('href') for x in links]

lis = list()
for x in links:
driver.get(x)
#.//*[@id='accordion']//div//table

xpath_list = ['.//*[@id="accordion"]//div//tr//td[1]', ".//*[@id='accordion']//h3//a"]
full_content = [[x.text for x in driver.find_elements_by_xpath(xpath)] for xpath in xpath_list]
lis.append(full_content)

lis

输出:

[[['250 mg (NDC 01780-500-01)'], []],
[['5 mEq/mL; 20 mL vial (NDC 0409-6043-01)'], []],
[['AGRYLIN® (anagrelide hydrochloride) Dosage Form: 0.5 mg capsules for oral administration (NDC 54092-063-01)',
'',
''],
['Shire US Inc. (Reverified 07/01/2016)',
'Teva Pharmaceuticals (Reverified 11/01/2016)']],
[['ERWINAZE 10,000 IU lyophilized powder supplied in a clear 3 mL glass vial 5 vial carton (NDC 57902-249-05)',
'ERWINAZE 10,000 IU lyophilized powder supplied in a clear 3 mL glass vial 1 vial (NDC 57902-249-01)'],
['Jazz Pharmaceuticals, Inc. (Revised 11/14/2016)']],
[['0.4 mg/mL, 1 mL single-dose vial, package of 25 (NDC 00517-0401-25)',
'1 mg/mL, 1 mL single-dose vial, package of 25 (NDC 00517-1010-25)',
'',
'',
'',
'',
'',
''],......

最佳答案

import requests
from lxml.html import fromstring

r = requests.get('http://www.accessdata.fda.gov/scripts/drugshortages/dsp_ActiveIngredientDetails.cfm?AI=Atropine%20Sulfate%20Injection&st=c&tab=tabs-1')
html = fromstring(r.text)

在:

[i.text_content().strip() for i in html.xpath('//div[@id="accordion"]//h3')]

输出:

['American Regent/Luitpold (Reverified 11/10/2016)',
'Amphastar Pharmaceuticals, Inc./IMS (Reverified 08/18/2016)',
'Hospira, Inc. (Revised 11/07/2016)',
'West-Ward Pharmaceuticals (Revised 05/02/2016)']

在:

[i.xpath('.//td[1]//text()') for i in html.xpath('//div[@id="accordion"]//tbody')]

输出:

[['0.4 mg/mL, 1 mL single-dose vial, package of 25\r\n(NDC 00517-0401-25)',
'1 mg/mL, 1 mL single-dose vial, package of 25 (NDC 00517-1010-25)'],
['0.1 mg/mL; 10 mL Luer-Jet Prefilled Syringe\r\n(NDC 76329-3339-1, Old NDC 0548-3339-00) \r\n'],
['0.1 mg/mL; 10 mL Ansyr syringe\r\n(NDC 0409-1630-10)',
'0.05 mg/mL; 5 mL Ansyr syringe\r\n(NDC 0409-9630-05)',
'0.1 mg/mL; 5 mL Lifeshield syringe\r\n(NDC 0409-4910-34)',
'0.1 mg/mL; 10 mL Lifeshield syringe\r\n(NDC 0409-4911-34)'],
['0.4 mg/mL, 20 mL vial (NDC 0641-6006-10)\r\n']]

我使用 lxml 的 xpath,我希望这会有所帮助。顺便说一句,嵌套列表理解真的很难理解。也许你可以单独创建列表,而不是将它们压缩到一起。

关于selenium - 尝试获取特定项目并使用 selenium 格式化它们时出现问题?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40602450/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com