gpt4 book ai didi

python - 使用 Tabula 从 PDF 中以字符串形式读取表格

转载 作者:行者123 更新时间:2023-12-05 02:56:31 29 4
gpt4 key购买 nike

我在 python 3.7 上使用 tabula-py 2.0.4、pandas 1.17.4。我正在尝试使用 tabula.read_pdf 将 PDF 表读取到数据框

from tabula import read_pdf
fn = "file.pdf"
print(read_pdf(fn, pages='all', multiple_tables=True)[0])

问题是值被读取为 float 而不是字符串。

我需要将其读取为字符串,因此如果值为 20.0000,我知道精确到小数点后第四位。现在它返回 20.0 而不是 20.0000。

PDF格式的输入数据看起来像 enter image description here

上面代码的输出是

enter image description here

最佳答案

您需要向 tabula.read_pdf 添加几个选项。解析 pdf 文件并以不同方式解释找到的列的示例:

import tabula

print(tabula.environment_info())

fname = ("https://github.com/chezou/tabula-py/raw/master/tests/resources/"
"data.pdf")

# Columns iterpreted as str
col2str = {'dtype': str}
kwargs = {'output_format': 'dataframe',
'pandas_options': col2str,
'stream': True}
df1 = tabula.read_pdf(fname, **kwargs)

print(df1[0].dtypes)
print(df1[0].head())

# Guessing column type
col2val = {'dtype': None}
kwargs = {'output_format': 'dataframe',
'pandas_options': col2val,
'stream': True}
df2 = tabula.read_pdf(fname, **kwargs)

print(df2[0].dtypes)
print(df2[0].head())

输出如下:

Python version:
3.7.6 (default, Jan 8 2020, 13:42:34)
[Clang 4.0.1 (tags/RELEASE_401/final)]
Java version:
openjdk version "13.0.2" 2020-01-14
OpenJDK Runtime Environment (build 13.0.2+8)
OpenJDK 64-Bit Server VM (build 13.0.2+8, mixed mode, sharing)
tabula-py version: 2.0.4
platform: Darwin-19.3.0-x86_64-i386-64bit
uname:
uname_result(system='Darwin', node='MacBook-Pro-10.local', release='19.3.0', version='Darwin Kernel Version 19.3.0: Thu Jan 9 20:58:23 PST 2020; root:xnu-6153.81.5~1/RELEASE_X86_64', machine='x86_64', processor='i386')
linux_distribution: ('Darwin', '19.3.0', '')
mac_ver: ('10.15.3', ('', '', ''), 'x86_64')

None
'pages' argument isn't specified.Will extract only from page 1 by default.
Unnamed: 0 object
mpg object
cyl object
disp object
hp object
drat object
wt object
qsec object
vs object
am object
gear object
carb object
dtype: object
Unnamed: 0 mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
'pages' argument isn't specified.Will extract only from page 1 by default.
Unnamed: 0 object
mpg float64
cyl int64
disp float64
hp int64
drat float64
wt float64
qsec float64
vs int64
am int64
gear int64
carb int64
dtype: object
Unnamed: 0 mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2


关于python - 使用 Tabula 从 PDF 中以字符串形式读取表格,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60448160/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com