python - Pandas - read_hdf 或 store.select 返回不正确的查询结果

转载作者：行者123 更新时间：2023-11-28 19:21:00

我有一个通过 pandas store.append 存储的大型数据集(400 万行，50 列)。当我使用 store.select 或 read_hdf 查询大于特定值的 2 列(即“(a > 10) & (b > 1)”时，我返回了 15,000 左右的行。

当我读入整个表格时，如 df，并执行 df[(df.a > 10) & (df.b > 1)]，我得到 30,000 行。我缩小了问题的范围 - 当我阅读整个表格并执行 df.query("(a > 10) & (b > 1)") 时，它是相同的 15,000 行，但是当我将引擎设置为 python ---> df.query("(a > 10) & (b > 1)", engine = 'python') 我得到了 30,000 行。

我怀疑是HDF中查询的eval/numexpr方法和Query方法有关。

a 和 b 列中的类型是 float64，即使我使用 float (即 1. 而不是 1)进行查询，问题仍然存在。

我将不胜感激任何反馈，或者如果其他人有同样的问题，我们需要解决这个问题。

问候，尼尔

========================

信息如下:

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 32
OS: Darwin
OS-release: 13.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.14.1
nose: 1.3.3
Cython: None
numpy: 1.8.0
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 1.2.1
sphinx: 1.2.2
patsy: 0.2.0
scikits.timeseries: 0.91.3
dateutil: 2.2
pytz: 2013.8
bottleneck: 0.7.0
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: 2.0.3
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: None
html5lib: 0.95-dev
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None

df.info() ---> 在选定的 15,000 行左右

Int64Index: 15533 entries, 67302 to 142465

Data columns (total 47 columns):

date 15533 non-null datetime64[ns]
text 15533 non-null object
date2 1090 non-null datetime64[ns]
x1 15533 non-null float64
x2 15533 non-null float64
x3 15533 non-null float64
x4 15533 non-null float64
x5 15533 non-null float64
x6 15533 non-null float64
x7 15533 non-null float64
x8 15533 non-null float64
x9 15533 non-null float64
x10 15533 non-null float64
x11 15533 non-null float64
x12 15533 non-null float64
x13 15533 non-null float64
x14 15533 non-null float64
x15 15533 non-null float64
x16 15533 non-null float64
x17 15533 non-null float64
x18 15533 non-null float64
a 15533 non-null float64
x19 15533 non-null float64
x20 15533 non-null float64
x21 15533 non-null float64
x22 15533 non-null float64
x23 15533 non-null float64
x24 15533 non-null float64
b 15533 non-null float64
x25 15533 non-null float64
x26 15533 non-null float64
x27 15533 non-null float64
x28 15533 non-null float64
x29 15533 non-null float64
x30 15533 non-null float64
x31 15497 non-null float64
x32 15497 non-null float64
x33 15497 non-null float64
x34 15497 non-null float64
x35 15533 non-null int64
x36 15533 non-null int64
x37 15533 non-null int64
x38 15533 non-null int64
x39 15533 non-null int64
x40 15533 non-null int64
x41 15533 non-null int64
x42 15533 non-null int64
dtypes: datetime64ns, float64(36), int64(8), object(1)

ptdump -av 文件

/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/MKT (Group) ''
/MKT._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['date', 'text', 'a', 'x20', 'x23', 'x24', 'b', 'x25', 'x26', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41', 'x42'],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
levels := 1,
nan_rep := 'nan',
non_index_axes := [(1, ['date', 'text', 'date2', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'a', 'x19', 'x20', 'x21', 'x22', 'x23', 'x24', 'b', 'x25', 'x26', 'x27', 'x28', 'x29', 'x30', 'x31', 'x32', 'x33', 'x34', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41', 'x42'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_frame',
values_cols := ['values_block_0', 'values_block_1', 'date', 'text', 'a', 'x20', 'x23', 'x24', 'b', 'x25', 'x26', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41', 'x42']]
/MKT/table (Table(3637597,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Int64Col(shape=(1,), dflt=0, pos=1),
"values_block_1": Float64Col(shape=(29,), dflt=0.0, pos=2),
"date": Int64Col(shape=(), dflt=0, pos=3),
"text": StringCol(itemsize=30, shape=(), dflt='', pos=4),
"a": Float64Col(shape=(), dflt=0.0, pos=5),
"x20": Float64Col(shape=(), dflt=0.0, pos=6),
"x23": Float64Col(shape=(), dflt=0.0, pos=7),
"x24": Float64Col(shape=(), dflt=0.0, pos=8),
"b": Float64Col(shape=(), dflt=0.0, pos=9),
"x25": Float64Col(shape=(), dflt=0.0, pos=10),
"x26": Float64Col(shape=(), dflt=0.0, pos=11),
"x35": Int64Col(shape=(), dflt=0, pos=12),
"x36": Int64Col(shape=(), dflt=0, pos=13),
"x37": Int64Col(shape=(), dflt=0, pos=14),
"x38": Int64Col(shape=(), dflt=0, pos=15),
"x39": Int64Col(shape=(), dflt=0, pos=16),
"x40": Int64Col(shape=(), dflt=0, pos=17),
"x41": Int64Col(shape=(), dflt=0, pos=18),
"x42": Int64Col(shape=(), dflt=0, pos=19)}
byteorder := 'little'
chunkshape := (322,)
autoindex := True
colindexes := {
"x41": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x20": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x37": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x42": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x26": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x38": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x40": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"date": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x36": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"text": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x23": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x39": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x25": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x24": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"a": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x35": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"b": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/MKT/table._v_attrs (AttributeSet), 83 attributes:
[CLASS := 'TABLE',
x23_dtype := 'float64',
x23_kind := ['x23'],
x20_dtype := 'float64',
x20_kind := ['x20'],
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_10_FILL := 0.0,
FIELD_10_NAME := 'x25',
FIELD_11_FILL := 0.0,
FIELD_11_NAME := 'x26',
FIELD_12_FILL := 0,
FIELD_12_NAME := 'x35',
FIELD_13_FILL := 0,
FIELD_13_NAME := 'x36',
FIELD_14_FILL := 0,
FIELD_14_NAME := 'x37',
FIELD_15_FILL := 0,
FIELD_15_NAME := 'x38',
FIELD_16_FILL := 0,
FIELD_16_NAME := 'x39',
FIELD_17_FILL := 0,
FIELD_17_NAME := 'x40',
FIELD_18_FILL := 0,
FIELD_18_NAME := 'x41',
FIELD_19_FILL := 0,
FIELD_19_NAME := 'x42',
FIELD_1_FILL := 0,
FIELD_1_NAME := 'values_block_0',
FIELD_2_FILL := 0.0,
FIELD_2_NAME := 'values_block_1',
FIELD_3_FILL := 0,
FIELD_3_NAME := 'date',
FIELD_4_FILL := '',
FIELD_4_NAME := 'text',
FIELD_5_FILL := 0.0,
FIELD_5_NAME := 'a',
FIELD_6_FILL := 0.0,
FIELD_6_NAME := 'x20',
FIELD_7_FILL := 0.0,
FIELD_7_NAME := 'x23',
FIELD_8_FILL := 0.0,
FIELD_8_NAME := 'x24',
FIELD_9_FILL := 0.0,
FIELD_9_NAME := 'b',
a_dtype := 'float64',
a_kind := ['a'],
NROWS := 3637597,
TITLE := '',
VERSION := '2.7',
x24_dtype := 'float64',
x24_kind := ['x24'],
b_dtype := 'float64',
b_kind := ['b'],
x25_dtype := 'float64',
x25_kind := ['x25'],
x26_dtype := 'float64',
x26_kind := ['x26'],
date_dtype := 'datetime64',
date_kind := ['date'],
x39_dtype := 'int64',
x39_kind := ['x39'],
x37_dtype := 'int64',
x37_kind := ['x37'],
x41_dtype := 'int64',
x41_kind := ['x41'],
x35_dtype := 'int64',
x35_kind := ['x35'],
x40_dtype := 'int64',
x40_kind := ['x40'],
x38_dtype := 'int64',
x38_kind := ['x38'],
x42_dtype := 'int64',
x42_kind := ['x42'],
x36_dtype := 'int64',
x36_kind := ['x36'],
index_kind := 'integer',
text_dtype := 'string240',
text_kind := ['text'],
values_block_0_dtype := 'datetime64',
values_block_0_kind := ['date2'],
values_block_1_dtype := 'float64',
values_block_1_kind := ['x22', 'x18', 'x21', 'x16', 'x19', 'x17', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x29', 'x30', 'x28', 'x2', 'x1', 'x3', 'x10', 'x27', 'x11', 'x12', 'x13', 'x14', 'x15', 'x33', 'x32', 'x34', 'x31']]

这是我在表中的读法:

df = DataFrame()store = pd.HDFStore('/Users/neil/MKT.h5')
df = store.select('MKT', "(a > 10) & (b > 1)")
store.close()

我是这样写/填表的:

store = pd.HDFStore('/Users/neil/MKT.h5')

listofsearchablevars = ['date', 'text', 'a', 'x20', 'x23', 'x24', 'b', 'x25', 'x26', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41', 'x42']

df = .....

store.append('MKT', df, data_columns = listofsearchablevars, nan_rep = 'nan', chunksize=500000, min_itemsize = {'values': 30})

store.close()

编辑:响应提供一些示例数据的请求....

数据

为清楚起见，让我们调用 15,000 结果:“INCORRECT”让我们称 30,000 结果为:“正确”让我们称项目为 CORRECT 而不是 INCORRECT:“仅在 CORRECT”

我已经确认，INCORRECT 中的所有行/项目都已完全找到正确。

这里有几行数据(每行只取第 10000 行和 10001 行):

只有正确的:

                    9869                 9870
date   2001-08-10 00:00:00  2001-08-17 00:00:00
text                   DCR                  DCR
date2                  NaN                  NaN
x19                    1.9               1.8396
x18                   1.98                  1.9
x20                    1.8                  1.8
x9                    2.54                 2.54
x10                   5.25                5.125
x11                  9.625                9.625
x12                   1.61                  1.7
x13                   1.05                 1.05
x14                   1.05                 1.05
x21                  75700                64800
x23               140992.7             116948.9
x24           0.0008284454         0.0007097211
x25            0.002580505          0.002630241
x26            0.001540047          0.001440302
x27            0.001850877          0.001832468
x5                  17.915               17.915
x8                  17.915               17.915
x2                 34.0379              32.9563
a                  34.0385             32.95643
x6               -42.80079            -42.80079
x7               -8.762288            -9.844354
x4                       0                    0
x1           -0.0003349149        -0.0003349149
x3           -0.0003349149        -0.0003349149
x28              1.579e+07            1.579e+07
b                 1.261029             1.302433
x29               1.284075             1.326236
x30               1.488814             1.537697
x22             -0.2891579           -0.3205045
x17                   0.31                 0.31
x15                   0.84                 0.84
x16                 2.5937               2.5937
x34                  6.895                7.105
x32               -1.29055             -1.35055
x31                  -0.77                -0.63
x33                 -0.665                -0.49
x38                      1                    1
x42                      0                    0
x36                      0                    0
x40                      0                    0
x35                      0                    0
x39                      0                    0
x37                      0                    0
x41                      0                    0

不正确:

                    153641               153642
date   2008-08-22 00:00:00  2008-08-29 00:00:00
text                   PRL                  PRL
date2                  NaN                  NaN
x19                    1.9                 1.88
x18                   1.95                 1.94
x20                   1.85                 1.87
x9                    2.07                 2.07
x10                   2.23                 2.23
x11                   2.94                 2.94
x12                   1.75                 1.75
x13                   1.71                 1.71
x14                   1.69                 1.69
x21                 133549                73525
x23               254119.1             140764.5
x24            0.001485416         0.0008315729
x25            0.001227271          0.001204803
x26            0.001006876          0.001048327
x27           0.0009764919         0.0009638125
x5                  18.008               18.008
x8                  18.058               18.058
x2                 34.2152               33.855
a                  34.3102             33.94904
x6               -35.07229            -35.07229
x7              -0.7620911            -1.123251
x4                       0                    0
x1               0.0111308            0.0111308
x3               0.0111308            0.0111308
x28             1.5488e+08           1.5488e+08
b                 1.251983             1.265302
x29               1.272828             1.286369
x30               1.247996             1.261273
x22              0.1368421            0.1489362
x17                   0.16                 0.16
x15                    0.2                  0.2
x16                   0.47                 0.47
x34                   2.25                 2.34
x32                  1.395                1.365
x31                   1.25                 1.31
x33                  1.175                 1.25
x38                      1                    1
x42                      0                    0
x36                      0                    0
x40                      0                    0
x35                      0                    0
x39                      0                    0
x37                      0                    0
x41                      0                    0

正确:

                    99723                99725
date   2009-11-27 00:00:00  2009-12-11 00:00:00
text                   ACL                  ACL
date2                  NaN                  NaN
x19                   1.17                  1.2
x18                   1.22                 1.39
x20                   1.11                 1.14
x9                    1.76                 1.76
x10                   1.76                 1.76
x11                   1.76                 1.76
x12                   0.63                 0.74
x13                   0.36                 0.36
x14                   0.17                 0.17
x21                 285474               709374
x23               333678.1             868999.7
x24           0.0005489386          0.001393863
x25            0.002350057          0.002279827
x26            0.002160912          0.002111369
x27            0.002428953          0.002244943
x5                 103.908              103.908
x8                 103.908              103.908
x2                121.5721             124.6894
a                 121.5724             124.6896
x6                92.16074             92.16074
x7                213.7331             216.8503
x4                       0                    0
x1            -0.008266928         -0.008266928
x3            -0.008266928         -0.008266928
x28             0.02743141           0.02703708
b                 1.037747             1.011804
x29               1.421532             1.385994
x30                1.52714             1.488961
x22               1.213675                  1.7
x17                   0.47                 0.47
x15                   0.48                 0.48
x16                   0.48                 0.48
x34                   0.32                 0.32
x32                   1.04                 1.04
x31                   -0.6                 -0.6
x33                -0.5901               -0.479
x38                      0                    0
x42                      0                    0
x36                      0                    0
x40                      0                    0
x35                      0                    0
x39                      0                    0
x37                      0                    0
x41                      0                    0

最佳答案

成功了!!!!我在数据中填充了所有 NaN，现在 read_hdf 返回正确的 30,000 行。 a 列有 NaN(这是查询中的 data_columns 之一，a > 10)。伙计，那是痛苦的。仅供引用 - 由于我的偏执狂，为了摆脱任何可能在未来重复出现的情况，我完全填写了整个表格(0)，因为我不能冒险通过表格中不正确或不完整的查询从该分析中得出结论.这肯定是一个 NaN 问题。

关于python - Pandas - read_hdf 或 store.select 返回不正确的查询结果，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25069622/

文章推荐： python - python 中 qq-plot(或 probplot)的逐点置信度包络线

文章推荐： javascript - 从此代码中删除 Jquery

文章推荐： javascript - 通过 getElementById 查找所选索引未返回正确的值

文章推荐： python - Django CONN_MAX_AGE 失败与 postgresql max_connections

javascript - 如果输入 a 或 b 正确/正确，我如何执行操作？
这个问题已经有答案了: How to do case insensitive string comparison? (23 个回答) 已关闭 3 年前。用户在我的输入栏中写入“足球”，然后执行第 6
javascript - 字符 id= + 是 + 正确= + 正确不正确...我怎样才能使它成为 javascript 中的字符串
啊，不习惯 javascript 中的字符串。 character_id= + id + correct= + correctOrIncorrect 这就是我需要制作成字符串的内容。如果您无法猜测字符
javascript - jQuery计算价格不起作用(正确)
$(function() { var base_price = 0; CalculatePrice(); $(".math1").on('change', function(e) { Calc
kubernetes - 将Spinnaker部署到Spinnaker将管理的同一kubernetes集群是否安全/正确？
我找不到任何文章回答问题:将Spinnaker部署到Spinnaker将管理的同一Kubernetes集群是否安全/正确？我主要是指生产，HA部署。最佳答案我认为Spinnaker和Kuberne
c++ - 正确/快速的方法来更改命令行Qt5源内部版本的配置
我正在使用MSVC在Windows上从源代码(官方源代码发布，而不是从仓库中)构建Qt5(Qt 5.15.0)。我正在设置环境。变量，依赖项等，然后运行具有1600万个选项的configure，最后
java - 计数时数组越界[正确]
我需要打印一个包含重复单词的数组。我的数组已经可以工作，但我不知道如何正确计算单词数。我已经知道，当我的索引计数器 (i) 为 49 时，并且当 (i) 想要计数到 50 时，我会收到错误，但我不知道
javascript - 正确/错误取决于屏幕尺寸动态？
我正在遵循一个指南，该指南允许 Google map 屏幕根据屏幕尺寸禁用滚动。我唯一挣扎的部分是编写一个代码，当我手动调整屏幕大小时动态更改 True/False 值。这是我按照说明操作的网站，但
java - 未调用子类中的方法(正确)
我有一个类“FileButton”。它的目的是将文件链接到 JButton，FileButton 继承自 JButton。子类继承自此以使用链接到按钮的文件做有用的事情。 JingleCardButt
php - 如何仅显示来自好友列表的帖子。 (正确)
我的 friend 数组只返回一个数字而不是所有数字。 ($myfriends = 3) 应该是…… ($myfriends = 3 5 7 8 9 12). 如果我让它进入 while 循环……整个
html - 在这种情况下使用整数作为类名是否可以接受/正确
这个问题在这里已经有了答案: Is there a workaround to make CSS classes with names that start with numbers valid?
javascript - 在窗口更改时自动调整元素大小(正确)
我正在制作一个 JavaScript 函数，当调整窗口大小时，它会自动将 div 的大小调整为与窗口相同的宽度/高度。该功能非常基本，但我注意到在调整窗口大小时出现明显的“绘制”滞后。在 JS fi
javascript - 删除导航栏的类 - 正确
此问题的基本视觉效果可在 http://sevenx.de/demo/bootstrap-carousel/inc.carousel/tabbed-slider.html 获得。 - 如果你想看一看。
c - 从将其内存分配给同一函数的函数返回字符串是否安全/正确？
我明白，如果我想从函数返回一个字符串文字或一个数组，我应该将其声明为静态的，这样当被调用的函数被返回时，内容就不会“消亡”。但我的问题是，当我在函数内部使用 malloc 分配内存时会怎样？在下面
mysql - 正确/错误值的适当数据字段类型？
在 mySQL 数据库中存储 true/false/1/0 值最合适(读取数据消耗最少)的数据字段是什么？我以前使用过一个字符长的 tinyint，但我不确定它是否是最佳解决方案？谢谢! 最佳答案
c++ - 正确，有效地读取文件
我想一次读取并处理CSV文件第一行中的条目(例如打印)。我假设使用Unix风格的\n换行符，没有条目长度超过255个字符，并且(现在)在EOF之前有一个换行符。这意味着它是fgets()后跟strto
c++ - “正确”无符号整数比较
所以，我们都知道 -1 > 2u == true 的 C/C++ 有符号/无符号比较规则，并且我有一种情况，我想有效地实现“正确”比较。我的问题是，考虑到人们熟悉的尽可能多的架构，哪种方法更有效。显
Java异常处理：如何写出“正确”但被编译器认为有语法错误的程序
**摘要：**文章的标题看似自相矛盾。本文分享自华为云社区《Java异常处理：如何写出“正确”但被编译器认为有语法错误的程序》，作者： Jerry Wang 。文章的标题看似自相矛盾，然而我在“正
r - 进行按行替换的“正确”方法
我有一个数据框，看起来像: dataDemo % mutate_each(funs(ifelse(. == '.', REF, as.character(.))), -POS) # POS REF
text - VBScript 正确/重新格式化带分隔符的文本文件？
有人可以帮助我使用 VBScript 重新格式化/正确格式化带分隔符的文本文件吗？我有一个文本文件 ^分界如下: AGREE^NAME^ADD1^ADD2^ADD3^ADD4^PCODE^BAL^A
java - 语言认证以及诸如适当、正确、合法等术语的使用
就目前而言，这个问题不适合我们的问答形式。我们希望答案得到事实、引用或专业知识的支持，但这个问题可能会引起辩论、争论、投票或扩展讨论。如果您觉得这个问题可以改进并可能重新打开，visit the he

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城