gpt4 book ai didi

python - 从网站获取表格数据时出错

转载 作者:行者123 更新时间:2023-11-28 18:20:20 25 4
gpt4 key购买 nike

我正在尝试从网上为我的项目获取一些股票相关数据。我遇到了几个问题。
问题 1:
我试图从这个网站上抢 table http://sharesansar.com/c/today-share-price.html
它有效,但未按顺序抓取列。例如:“公司名称”列的值为“开盘价”。我该如何解决这个问题?
问题2:
我还尝试从 http://merolagani.com/CompanyDetail.aspx?symbol=ADBL 中获取公司特定数据在“历史价格”选项卡下。
这次在抓取表数据的时候报错了。我得到的报错是:

self.data[key].append(cols[index].get_text())

IndexError: list index out of range

代码如下所示:

import logging
import requests
from bs4 import BeautifulSoup
import pandas


module_logger = logging.getLogger('mainApp.dataGrabber')


class DataGrabberTable:
''' Grabs the table data from a certain url. '''

def __init__(self, url, csvfilename, columnName=[], tableclass=None):
module_logger.info("Inside 'DataGrabberTable' constructor.")
self.pgurl = url
self.tableclass = tableclass
self.csvfile = csvfilename
self.columnName = columnName

self.tableattrs = {'class':tableclass} #to be passed in find()

module_logger.info("Done.")


def run(self):
'''Call this to run the datagrabber. Returns 1 if error occurs.'''

module_logger.info("Inside 'DataGrabberTable.run()'.")

try:
self.rawpgdata = (requests.get(self.pgurl, timeout=5)).text
except Exception as e:
module_logger.warning('Error occured: {0}'.format(e))
return 1

#module_logger.info('Headers from the server:\n {0}'.format(self.rawpgdata.headers))

soup = BeautifulSoup(self.rawpgdata, 'lxml')

module_logger.info('Connected and parsed the data.')

table = soup.find('table',attrs = self.tableattrs)
rows = table.find_all('tr')[1:]

#initializing a dict in a format below
# data = {'col1' : [...], 'col2' : [...], }
#col1 and col2 are from columnName list
self.data = {}
self.data = dict(zip(self.columnName, [list() for i in range(len(self.columnName))]))

module_logger.info('Inside for loop.')
for row in rows:
cols = row.find_all('td')
index = 0
for key in self.data:
if index > len(cols): break
self.data[key].append(cols[index].get_text())
index += 1
module_logger.info('Completed the for loop.')

self.dataframe = pandas.DataFrame(self.data) #make pandas dataframe

module_logger.info('writing to file {0}'.format(self.csvfile))
self.dataframe.to_csv(self.csvfile)
module_logger.info('written to file {0}'.format(self.csvfile))

module_logger.info("Done.")
return 0

def getData(self):
""""Returns 'data' dictionary."""
return self.data




# Usage example

def main():
url = "http://sharesansar.com/c/today-share-price.html"
classname = "table"
fname = "data/sharesansardata.csv"
cols = [str(i) for i in range(18)] #make a list of columns

'''cols = [
'S.No', 'Company Name', 'Symbol', 'Open price', 'Max price',
'Min price','Closing price', 'Volume', 'Previous closing',
'Turnover','Difference',
'Diff percent', 'Range', 'Range percent', '90 days', '180 days',
'360 days', '52 weeks high', '52 weeks low']'''

d = DataGrabberTable(url, fname, cols, classname)
if d.run() is 1:
print('Data grabbing failed!')
else:
print('Data grabbing done.')


if __name__ == '__main__':
main()

一些建议会有所帮助。谢谢!

最佳答案

你的 col 列表缺少一个元素有 19 列,而不是 18:

>>> len([str(i) for i in range(18)])
18

此外,您似乎把事情复杂化了。应执行以下操作:

import requests
from bs4 import BeautifulSoup
import pandas as pd

price_response = requests.get('http://sharesansar.com/c/today-share-price.html')
price_table = BeautifulSoup(price_response.text, 'lxml').find('table', {'class': 'table'})
price_rows = [[cell.text for cell in row.find_all(['th', 'td'])] for row in price_table.find_all('tr')]
price_df = pd.DataFrame(price_rows[1:], columns=price_rows[0])

com_df = None
for symbol in price_df['Symbol']:
comp_response = requests.get('http://merolagani.com/CompanyDetail.aspx?symbol=%s' % symbol)
comp_table = BeautifulSoup(comp_response.text, 'lxml').find('table', {'class': 'table'})
com_header, com_value = list(), list()
for tbody in comp_table.find_all('tbody'):
comp_row = tbody.find('tr')
com_header.append(comp_row.find('th').text.strip().replace('\n', ' ').replace('\r', ' '))
com_value.append(comp_row.find('td').text.strip().replace('\n', ' ').replace('\r', ' '))
df = pd.DataFrame([com_value], columns=com_header)
com_df = df if com_df is None else pd.concat([com_df, df])

print(price_df)
print(com_df)

关于python - 从网站获取表格数据时出错,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45485290/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com