gpt4 book ai didi

java - Hadoop 和 Hive 中的位级查询

转载 作者:可可西里 更新时间:2023-11-01 15:10:12 25 4
gpt4 key购买 nike

我们有一个在 Hadoop 中进行位级查询的用例。它是这样的:

Given a set of variable-length records containing a date/time stamp and one or more sixteen bit data words, return a list of date/time stamps where some combination of arbitrary bits from one or more arbitrary data words is set to the values specified in the query.

示例...给定以下数据:

Timestamp             Word 1 bits                Word 2 bits
------------------ ---------------------- ---------------------
2017-06-16 08:15:05 0010 1101 1111 0000 1011 0010 1111 0010
2017-06-16 08:15:06 0010 1110 1111 0000 ...
2017-06-16 08:15:07 0010 1101 1111 0000 ...
2017-06-16 08:15:08 0010 1110 1111 0000
2017-06-16 08:15:09 0010 1101 1111 0000
2017-06-16 08:15:10 0010 1110 1111 0000

如果查询是“返回第 1 位 0 为 0 且第 1 位 1 为 1 的所有时间戳”,结果将是

Timestamp             Word 1 bits
------------------ ----------------------
2017-06-16 08:15:06 0010 1110 1111 0000
2017-06-16 08:15:08 0010 1110 1111 0000
2017-06-16 08:15:10 0010 1110 1111 0000
^^

数据以制表符分隔的形式作为十六进制值提供:

Timestamp             Word1  Word2  Word3  Word4  
------------------ ---- ---- ---- ----
2017-06-16 08:15:05 2DF0 ... a varying number of 16 bit data words continues out here.
2017-06-16 08:15:06 2EF0
2017-06-16 08:15:07 2DF0
2017-06-16 08:15:08 2EF0
2017-06-16 08:15:09 2DF0
2017-06-16 08:15:10 2EF0
...

我们一直在考虑如何在 Hadoop 配置单元中表示这些数据并对其进行查询。将每个数据字的每一位放入其自己的整数字段中似乎效率极低,但具有可直接由 Hadoop 查询的优点,假设 Hadoop 服务器可以容纳每条记录中可变数量的列。

为了解决这个问题,我提出了这样的建议,我们将这些数据作为一流的时间戳和16位无符号整数导入到hive中,然后为每个查询构造一个MapReduce作业,使用位提取Java函数构造一个具有时间戳字段和每个感兴趣的的临时表在其自己的一流整数中。从临时文件中获取最终结果所需的 Hadoop 查询可以说是微不足道的。

然而,目前提出的想法是将十六进制文本直接保存到数据湖中。我们的数据科学家似乎认为这样的安排将允许直接查询;也就是说,不需要临时表,并且十六进制格式提供了相当高效的存储。

这是如何运作的?是否有某种方法可以索引此类文本,然后对其进行某种位级文本搜索,从而屏蔽掉不感兴趣的位?

(我将接受有关如何以更好的方式解决此问题的建议。)

最佳答案

演示

数据.tsv

2017-06-16 08:15:05 2DF0
2017-06-16 08:15:06 2EF0 0000
2017-06-16 08:15:07 2DF0 AAAA BBBB CCCC
2017-06-16 08:15:08 2EF0 1111 2222
2017-06-16 08:15:09 2DF0
2017-06-16 08:15:10 2EF0 DDDD EEEE

create external table mytable
(
ts timestamp
,words string
)
row format delimited
fields terminated by '\t'
stored as textfile
tblproperties ('serialization.last.column.takes.rest'='true')
;

select  *

from mytable
;

+----------------------------+---------------------------+
| ts | words |
+----------------------------+---------------------------+
| 2017-06-16 08:15:05.000000 | 2DF0 |
| 2017-06-16 08:15:06.000000 | 2EF0 0000 |
| 2017-06-16 08:15:07.000000 | 2DF0 AAAA BBBB CCCC |
| 2017-06-16 08:15:08.000000 | 2EF0 1111 2222 |
| 2017-06-16 08:15:09.000000 | 2DF0 |
| 2017-06-16 08:15:10.000000 | 2EF0 DDDD EEEE |
+----------------------------+---------------------------+

select  ts
,split(words,'\\t') as words

from mytable
;

+----------------------------+-------------------------------+
| ts | words |
+----------------------------+-------------------------------+
| 2017-06-16 08:15:05.000000 | ["2DF0"] |
| 2017-06-16 08:15:06.000000 | ["2EF0","0000"] |
| 2017-06-16 08:15:07.000000 | ["2DF0","AAAA","BBBB","CCCC"] |
| 2017-06-16 08:15:08.000000 | ["2EF0","1111","2222"] |
| 2017-06-16 08:15:09.000000 | ["2DF0",""] |
| 2017-06-16 08:15:10.000000 | ["2EF0","DDDD","EEEE"] |
+----------------------------+-------------------------------+

select  ts
,lpad(conv(split(words,'\\t')[0],16,2),16,'0') as word1_bits

from mytable
;

+----------------------------+------------------+
| ts | word1_bits |
+----------------------------+------------------+
| 2017-06-16 08:15:05.000000 | 0010110111110000 |
| 2017-06-16 08:15:06.000000 | 0010111011110000 |
| 2017-06-16 08:15:07.000000 | 0010110111110000 |
| 2017-06-16 08:15:08.000000 | 0010111011110000 |
| 2017-06-16 08:15:09.000000 | 0010110111110000 |
| 2017-06-16 08:15:10.000000 | 0010111011110000 |
+----------------------------+------------------+

select  ts

from mytable

where substr(lpad(conv(split(words,'\\t')[0],16,2),16,'0'),7,2) = '10'
;

+----------------------------+
| ts |
+----------------------------+
| 2017-06-16 08:15:06.000000 |
| 2017-06-16 08:15:08.000000 |
| 2017-06-16 08:15:10.000000 |
+----------------------------+

替代数据结构

create external table mytable
(
ts timestamp
,word1 string
,word2 string
,word3 string
,word4 string
,word5 string
,word6 string
,word7 string
,word8 string
,word9 string
)
row format delimited
fields terminated by '\t'
stored as textfile
;

select * from mytable
;

+----------------------------+-------+--------+--------+--------+--------+--------+--------+--------+--------+
| ts | word1 | word2 | word3 | word4 | word5 | word6 | word7 | word8 | word9 |
+----------------------------+-------+--------+--------+--------+--------+--------+--------+--------+--------+
| 2017-06-16 08:15:05.000000 | 2DF0 | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) |
| 2017-06-16 08:15:06.000000 | 2EF0 | 0000 | (null) | (null) | (null) | (null) | (null) | (null) | (null) |
| 2017-06-16 08:15:07.000000 | 2DF0 | AAAA | BBBB | CCCC | (null) | (null) | (null) | (null) | (null) |
| 2017-06-16 08:15:08.000000 | 2EF0 | 1111 | 2222 | (null) | (null) | (null) | (null) | (null) | (null) |
| 2017-06-16 08:15:09.000000 | 2DF0 | | (null) | (null) | (null) | (null) | (null) | (null) | (null) |
| 2017-06-16 08:15:10.000000 | 2EF0 | DDDD | EEEE | (null) | (null) | (null) | (null) | (null) | (null) |
+----------------------------+-------+--------+--------+--------+--------+--------+--------+--------+--------+

关于java - Hadoop 和 Hive 中的位级查询,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44594600/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com