gpt4 book ai didi

python - 使用 Beautifulsoup 进行网页抓取并收集表格文本值

转载 作者:行者123 更新时间:2023-12-01 07:26:01 25 4
gpt4 key购买 nike

我的代码如下,它从 NSE 网站收集数据。基本上我想收集2个信息:

  1. What is the Announcement Subject
  2. Check whether any pdf file is available then print the link.

I am able to get the pdf link but unable to read the Announcement subject which is

MIC Electronics Limited has informed the Exchange regarding 'Resolution Plan of M/s. Cosyn Consortium in the matter of M/s. MIC Electronics Limited has been approved by Hon'ble NCLT, Hyderabad Bench

任何帮助。

import requests
import json
import bs4

base_url = 'https://www.nseindia.com'
url = 'https://www.nseindia.com/corporates/directLink/latestAnnouncementsCorpHome.jsp'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}

response = requests.get(url, headers=headers)
jsonStr = response.text.strip()
keys_needing_quotes = ['company:','date:','desc:','link:','symbol:']

for key in keys_needing_quotes:
jsonStr = jsonStr.replace(key, '"%s":' %(key[:-1]))

data = json.loads(jsonStr)
data = data['rows']
# print(data)

symbol_list = ['MIC']
for x in range(0, len(data)):
if data[x]['symbol'] in symbol_list:
response = requests.get(base_url + data[x]['link'], headers=headers)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
print(soup)

try:
pdf_file = base_url + soup.find_all('a', href=True)[0]['href']
print("File_Link:", pdf_file)
except:
print('PDF not found')

最佳答案

或者您可以使用:

for s in soup.find_all('td', 'tablehead'):
if 'Announcement' in s.text:
break

print(s.find_next_sibling().text))
# output:
# MIC Electronics Limited has informed the Exchange regarding 'Resolution Plan of M/s. Cosyn Consortium in the matter of M/s. MIC Electronics Limited has been approved by Hon'ble NCLT, Hyderabad Bench

关于python - 使用 Beautifulsoup 进行网页抓取并收集表格文本值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57448563/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com