gpt4 book ai didi

java - DynamoDB 扫描查询和 BatchGet

转载 作者:行者123 更新时间:2023-11-30 03:28:15 27 4
gpt4 key购买 nike

我们有一个 Dynamo DB 表结构,其中包含 Hash 和 Range 作为主键。

Hash = date.random_number
Range = timestamp

如何获取 X 和 Y 时间戳内的项目?由于哈希 key 附加有 random_number,因此必须多次触发查询。是否可以给出多个哈希值和单个 RangeKeyCondition。

就成本和时间而言,什么最有效?

随机数范围为 1 到 10。

最佳答案

如果我理解正确,您有一个包含以下主键定义的表:

Hash Key  : date.random_number 
Range Key : timestamp

您必须记住的一件事是,无论您使用 GetItem 还是 Query,您都必须能够计算 Hash Key 在您的应用程序中,以便成功从表中检索一项或多项。

使用随机数作为哈希键的一部分是有意义的,这样您的记录就可以均匀分布在 DynamoDB 分区上,但是,您必须以应用程序可以做到的方式进行操作当您需要检索记录时仍然计算这些数字。

考虑到这一点,让我们创建满足指定要求所需的查询。您可用于从表中获取多个项目的 native AWS DynamoDB 操作有:

Query, BatchGetItem and Scan
  • 为了使用 BatchGetItem,您需要事先知道整个主键(哈希键和范围键),但事实并非如此。

  • Scan 操作实际上会遍历表中的每条记录,我认为这对于您的要求来说是不必要的。

  • 最后,查询操作允许您将EQ(相等)运算符应用于哈希键,从表中检索一项或多项 以及当您没有整个Range Key 或想要匹配多个时可以使用的许多其他运算符。

Range Key 条件的运算符选项为:EQ | LE | LT |通用电气| GT | BEGINS_WITH | 开始之间

在我看来,最适合您的要求的是 BETWEEN 运算符,也就是说,让我们看看如何使用所选的 SDK 构建查询:

Table table = dynamoDB.getTable(tableName);

String hashKey = "<YOUR_COMPUTED_HASH_KEY>";
String timestampX = "<YOUR_TIMESTAMP_X_VALUE>";
String timestampY = "<YOUR_TIMESTAMP_Y_VALUE>";

RangeKeyCondition rangeKeyCondition = new RangeKeyCondition("RangeKeyAttributeName").between(timestampX, timestampY);

ItemCollection<QueryOutcome> items = table.query("HashKeyAttributeName", hashKey,
rangeKeyCondition,
null, //FilterExpression - not used in this example
null, //ProjectionExpression - not used in this example
null, //ExpressionAttributeNames - not used in this example
null); //ExpressionAttributeValues - not used in this example

您可能需要查看以下帖子以获取有关 DynamoDB 主键的更多信息: DynamoDB: When to use what PK type?

问题:我担心的是由于附加了 random_number 而导致多次查询。有没有办法组合这些查询并命中 dynamoDB 一次?

您的担忧是完全可以理解的,但是,通过 BatchGetItem 获取所有记录的唯一方法是了解您想要获取的所有记录的整个主键 (HASH + RANGE)。虽然乍一看,最小化到服务器的 HTTP 往返似乎是最好的解决方案,但文档实际上建议您准确执行您正在做的事情,以避免热分区和预配置吞吐量的不均匀使用:

Design For Uniform Data Access Across Items In Your Tables

"Because you are randomizing the hash key, the writes to the table on each day are spread evenly across all of the hash key values; this will yield better parallelism and higher overall throughput. [...] To read all of the items for a given day, you would still need to Query each of the 2014-07-09.N keys (where N is 1 to 200), and your application would need to merge all of the results. However, you will avoid having a single "hot" hash key taking all of the workload."

来源:http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

这里还有一个有趣的点,建议在单个分区中适度使用读取...如果您从哈希键中删除随机数以便能够一次性获取所有记录,那么您很可能会陷入这一困境问题,无论您使用的是 ScanQuery 还是 BatchGetItem:

Guidelines for Query and Scan - Avoid Sudden Bursts of Read Activity

"Note that it is not just the burst of capacity units the Scan uses that is a problem. It is also because the scan is likely to consume all of its capacity units from the same partition because the scan requests read items that are next to each other on the partition. This means that the request is hitting the same partition, causing all of its capacity units to be consumed, and throttling other requests to that partition. If the request to read data had been spread across multiple partitions, then the operation would not have throttled a specific partition."

最后,由于您正在处理时间序列数据,因此研究文档建议的一些最佳实践可能会有所帮助:

Understand Access Patterns for Time Series Data

For each table that you create, you specify the throughput requirements. DynamoDB allocates and reserves resources to handle your throughput requirements with sustained low latency. When you design your application and tables, you should consider your application's access pattern to make the most efficient use of your table's resources.

Suppose you design a table to track customer behavior on your site, such as URLs that they click. You might design the table with hash and range type primary key with Customer ID as the hash attribute and date/time as the range attribute. In this application, customer data grows indefinitely over time; however, the applications might show uneven access pattern across all the items in the table where the latest customer data is more relevant and your application might access the latest items more frequently and as time passes these items are less accessed, eventually the older items are rarely accessed. If this is a known access pattern, you could take it into consideration when designing your table schema. Instead of storing all items in a single table, you could use multiple tables to store these items. For example, you could create tables to store monthly or weekly data. For the table storing data from the latest month or week, where data access rate is high, request higher throughput and for tables storing older data, you could dial down the throughput and save on resources.

You can save on resources by storing "hot" items in one table with higher throughput settings, and "cold" items in another table with lower throughput settings. You can remove old items by simply deleting the tables. You can optionally backup these tables to other storage options such as Amazon Simple Storage Service (Amazon S3). Deleting an entire table is significantly more efficient than removing items one-by-one, which essentially doubles the write throughput as you do as many delete operations as put operations.

来源:http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

关于java - DynamoDB 扫描查询和 BatchGet,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29674951/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com