gpt4 book ai didi

python - 如何使用 Python 从文本表中抓取数据?

转载 作者:太空狗 更新时间:2023-10-30 02:34:53 26 4
gpt4 key购买 nike

我有以下文本,我想抓取数据项并将它们保存在 excel 中。有没有办法在 Python 中执行此操作?

text = """
ANNUAL COMPENSATION LONG-TERM COMPENSATION
--------------------------------------- -------------------------------------
AWARDS PAYOUTS
-------------------------- ----------
SECURITIES
OTHER RESTRICTED UNDERLYING ALL OTHER
NAME AND PRINCIPAL ANNUAL STOCK OPTIONS/ LTIP COPMPENSA-
POSITION YEAR SALARY ($) BONUS ($) COMPENSATION ($) AWARD(S) ($)(1) SAR'S (#) PAYOUTS($) TION($)(3)
------------------ ---- ---------- --------- ---------------- --------------- ---------- ---------- ----------
JOHN W. WOODS 1993 $595,000 $327,250 There is no $203,190.63 18,000 $ 29,295
Chairman, President, & 1992 $545,000 $245,250 compensation 166,287.50 18,825 (2) Not $ 29,123
Chief Executive Officer 1991 $515,000 $283,251 required to be 45,000 Applicable
of AmSouth & AmSouth disclosed in
Bank N.A. this column.
C. STANLEY BAILEY 1993 $266,667(4) $133,333 117,012.50 4,500 $ 11,648
Vice Chairman, AmSouth 1992 $210,000 $ 84,000 42,400.00 4,800 $ 12,400
& AmSouth Bank N.A. 1991 $186,750 $ 82,170 161,280.00 9,750
C. DOWD RITTER 1993 $266,667(4) $133,333 117,012.50 4,500 $ 13,566
Vice Chairman, AmSouth 1992 $210,000 $ 84,000 42,400.00 4,800 $ 12,920
& AmSouth Bank N.A. 1991 $188,625 $ 82,995 161,280.00 9,750
WILLIAM A. POWELL, JR. 1993 $211,335 $ 95,101 11,000 $124,548
President, AmSouth 1992 $330,000 $132,000 98,050.00 11,100 $ 22,225
and Vice Chairman, 1991 $308,000 $169,401 24,000
AmSouth Bank N.A.
Retired in 1993
A. FOX DEFUNIAK, III 1993 $217,000 $ 75,950 52,971.88 4,500 $ 11,122
Senior Executive Vice 1992 $200,000 $ 62,000 42,400.00 4,800 $ 11,240
President, Birmingham 1991 $177,500 $ 78,100 161,280.00 9,750
Banking Group,
AmSouth Bank N.A.
E. W. STEPHENSON, JR. 1993 $177,833 $ 71,133 52,971.88 3,400 $ 9,256
Senior Executive Vice 1992 $150,000 $ 45,000 27,825.00 3,150 $ 8,560
President, AmSouth 1991 $140,000 $ 52,488 107,520.00 6,750
and Chairman & Chief
Executive Officer,
AmSouth Bank of Florida
"""

现在,我只是想以带有“|”的 csv 样式格式获取它符号分隔数据项,然后手动提取数据到excel:

tmp = open('tmp.txt','w')
tmp.write(text)
tmp.close()

data1 = []

for line in open('tmp.txt'):
line = line.lower()
if 'SALARY' in line:
line = line.replace(' ','|')
line = line.replace('--', '')
line = line.replace('- -', '')
line = line.replace('- -', '')
line = line.replace('(1)', '')
line = line.replace('(2)', '')
line = line.replace('(3)', '')
line = line.replace('(4)', '')
line = line.replace('(5)', '')
line = line.replace('(6)', '')
line = line.replace('(7)', '')
line = line.replace('(8)', '')
line = line.replace('(9)', '')
line = line.replace('(10)', '')
line = line.replace('(11)', '')
line = line.replace('(S)', '')
line = line.replace('($)', '')
line = line.replace('(#)', '')
line = line.replace('$', '')
line = line.replace('-0-', '0')
line = line.replace(')', '|')
line = line.replace('(', '|-')
line = re.sub(r'\s(\d)', '|\\1', line)
line = line.replace(' ', '')
line = line.replace('||', '|')
data1.append(line)
data = ''.join(data1)

问题是我必须这样做数千次,并且遍历每个表并保存我需要的项目将花费很长时间。有没有一种方法可以创建一个字典来跟踪最左侧列中列出的每个人的年份、薪水、奖金、其他年度薪酬等信息?

最佳答案

这里有一些代码可以帮助您入门:

text = """JOHN ...""" # text without the header

# These can be inferred if necessary
cols = [0, 24, 29, 39, 43, 52, 71, 84, 95, 109, 117]

db = []
row = []
for line in text.strip().split("\n"):
data = [line[cols[i]:cols[i+1]] for i in xrange((len(cols)-1))]
if data[0][0] != " ":
if row:
db.append(row)
row = map(lambda x: [x], data)
else:
for i, c in enumerate(data):
row[i].append(c)
print db

这将生成一个数组,每个人都有一个元素。每个元素都是一个包含所有列的数组,并且包含一个包含所有行的数组。这样您就可以轻松访问不同的年份,或者执行诸如连接此人的头衔之类的操作:

for person in db:
print "Name:", person[0][0]
print " ".join(s.strip() for s in person[0][1:])
print

将产生:

Name: JOHN W. WOODS           
Chairman, President, & Chief Executive Officer of AmSouth & AmSouth Bank N.A.

Name: C. STANLEY ...

关于python - 如何使用 Python 从文本表中抓取数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/5873969/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com