gpt4 book ai didi

java - 无法将 5k/秒的记录插入到 impala 中?

转载 作者:行者123 更新时间:2023-12-02 09:46:43 25 4
gpt4 key购买 nike

我正在探索 Impala 的 POC,但是我看不到任何显着的性能。我每秒无法插入 5000 条记录,最多只能每秒插入 200 条记录。考虑到任何数据库性能,这确实很慢。

我尝试了两种不同的方法,但都很慢:

  1. 使用Cloudera

    首先,我在系统上安装了 Cloudera 并添加了最新的 CDH 6.2 集群。我创建了一个 java 客户端来使用 ImpalaJDBC41 驱动程序插入数据。我可以插入记录,但速度很糟糕。我尝试通过增加 Impala Daemon Limit 和系统 RAM 来调整 impala,但没有帮助。最后我以为我的安装有问题或者什么问题,所以我改用了另一种方法。

  2. 使用 Cloudera VM

    Cloudera 还提供了用于测试目的的就绪虚拟机。我尝试了一下,看看它是否能提供更好的性能,但没有太大的改进。我仍然无法以 5k/秒的速度插入数据。

我不知道自己哪里需要改进。如果可以进行任何改进,我已将我的代码粘贴在下面。

实现速度(5k - 10k/秒)的理想 Impala 配置是什么?这个速度仍然远低于 Impala 的能力。

private static Connection connectViaDS() throws Exception {
Connection connection = null;
Class.forName("com.cloudera.impala.jdbc41.Driver");
connection = DriverManager.getConnection(CONNECTION_URL);
return connection;
}

private static void writeInABatchWithCompiledQuery(int records) {
int protocol_no = 233,s_port=20,d_port=34,packet=46,volume=58,duration=39,pps=76,
bps=65,bpp=89,i_vol=465,e_vol=345,i_pkt=5,e_pkt=54,s_i_ix=654,d_i_ix=444,_time=1000,flow=989;

String s_city = "Mumbai",s_country = "India", s_latt = "12.165.34c", s_long = "39.56.32d",
s_host="motadata",d_latt="29.25.43c",d_long="49.15.26c",d_city="Damouli",d_country="Nepal";

long e_date= 1275822966, e_time= 1370517366;

PreparedStatement preparedStatement;

int total = 1000*1000;
int counter =0;

Connection connection = null;
try {
connection = connectViaDS();

preparedStatement = connection.prepareStatement(sqlCompiledQuery);

Timestamp ed = new Timestamp(e_date);
Timestamp et = new Timestamp(e_time);

while(counter <total) {
for (int index = 1; index <= 5000; index++) {
counter++;

preparedStatement.setString(1, "s_ip" + String.valueOf(index));
preparedStatement.setString(2, "d_ip" + String.valueOf(index));
preparedStatement.setInt(3, protocol_no + index);
preparedStatement.setInt(4, s_port + index);
preparedStatement.setInt(5, d_port + index);
preparedStatement.setInt(6, packet + index);
preparedStatement.setInt(7, volume + index);
preparedStatement.setInt(8, duration + index);
preparedStatement.setInt(9, pps + index);
preparedStatement.setInt(10, bps + index);
preparedStatement.setInt(11, bpp + index);
preparedStatement.setString(12, s_latt + String.valueOf(index));
preparedStatement.setString(13, s_long + String.valueOf(index));
preparedStatement.setString(14, s_city + String.valueOf(index));
preparedStatement.setString(15, s_country + String.valueOf(index));
preparedStatement.setString(16, d_latt + String.valueOf(index));
preparedStatement.setString(17, d_long + String.valueOf(index));
preparedStatement.setString(18, d_city + String.valueOf(index));
preparedStatement.setString(19, d_country + String.valueOf(index));
preparedStatement.setInt(20, i_vol + index);
preparedStatement.setInt(21, e_vol + index);
preparedStatement.setInt(22, i_pkt + index);
preparedStatement.setInt(23, e_pkt + index);
preparedStatement.setInt(24, s_i_ix + index);
preparedStatement.setInt(25, d_i_ix + index);
preparedStatement.setString(26, s_host + String.valueOf(index));
preparedStatement.setTimestamp(27, ed);
preparedStatement.setTimestamp(28, et);
preparedStatement.setInt(29, _time);
preparedStatement.setInt(30, flow + index);
preparedStatement.addBatch();
}
preparedStatement.executeBatch();
preparedStatement.clearBatch();
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
connection.close();
} catch (SQLException e) {
e.printStackTrace();
}
}
}

数据正在以蜗牛般的速度更新。我尝试增加批量大小,但速度降低了。我不知道我的代码是否错误,或者我需要调整 Impala 以获得更好的性能。请指导。

我正在使用VM进行测试,以下是其他详细信息:

System.

Os - Ubuntu 16
RAM - 12 gb
Cloudera - CDH 6.2
Impala daemon limit - 2 gb
Java heap size impala daemon - 500mb
HDFS Java Heap Size of NameNode in Bytes - 500mb.

如果需要更多详细信息,请告诉我。

最佳答案

您无法在 12GB 的虚拟机上进行基准测试。看Impala's hardware requirements您会发现至少需要 128GB 内存。

  • Memory

128 GB or more recommended, ideally 256 GB or more. If the intermediate results during query processing on a particular node exceed the amount of memory available to Impala on that node, the query writes temporary work data to disk, which can lead to long query times. Note that because the work is parallelized, and intermediate results for aggregate queries are typically smaller than the original data, Impala can query and join tables that are much larger than the memory available on an individual node.

此外,虚拟机用于熟悉工具集,但它的功能还不够强大,甚至不足以成为开发环境。

引用文献

关于java - 无法将 5k/秒的记录插入到 impala 中?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56593400/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com