gpt4 book ai didi

elasticsearch - 如何在 elasticsearch 中配置每个集群的分片数

转载 作者:行者123 更新时间:2023-11-29 02:47:57 25 4
gpt4 key购买 nike

ES中shards的配置不太懂。我对 ES 中的分片有几个问题:

  1. 主分片的数量是通过index.number_of_shards参数配置的吧?

    因此,这意味着每个索引配置分片数量。如果是这样,如果我有 2 个索引,那么我将在节点上有 10 个分片?

  2. 假设我有一个节点(节点 1)配置了 3 个分片和 1 个副本。然后,我在同一个集群中创建一个新节点(节点 2),有 2 个分片。所以,我假设我只会有两个分片的副本,对吧?

    此外,如果节点 1 出现故障会发生什么情况,集群如何“知道”它应该有 3 个分片而不是 2 个?由于我在节点 2 上只有 2 个分片,那么这意味着我丢失了节点 1 中其中一个分片的数据?

最佳答案

所以首先我会开始阅读索引、主分片、副本分片和节点以了解它们之间的差异:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/glossary.html

这是一个很好的描述:

2.3 Index Basics

The largest single unit of data in elasticsearch is an index. Indexes are logical and physical partitions of documents within elasticsearch. Documents and document types are unique per-index. Indexes have no knowledge of data contained in other indexes. From an operational standpoint, many performance and durability related options are set only at the per-index level. From a query perspective, while elasticsearch supports cross-index searches, in practice it usually makes more organizational sense to design for searches against individual indexes.

Elasticsearch indexes are most similar to the ‘database’ abstraction in the relational world. An elasticsearch index is a fully partitioned universe within a single running server instance. Documents and type mappings are scoped per index, making it safe to re-use names and ids across indexes. Indexes also have their own settings for cluster replication, sharding, custom text analysis, and many other concerns.

Indexes in elasticsearch are not 1:1 mappings to Lucene indexes, they are in fact sharded across a configurable number of Lucene indexes, 5 by default, with 1 replica per shard. A single machine may have a greater or lesser number of shards for a given index than other machines in the cluster. Elasticsearch tries to keep the total data across all indexes about equal on all machines, even if that means that certain indexes may be disproportionately represented on a given machine. Each shard has a configurable number of full replicas, which are always stored on unique instances. If the cluster is not big enough to support the specified number of replicas the cluster’s health will be reported as a degraded ‘yellow’ state. The basic dev setup for elasticsearch, consequently, always thinks that it’s operating in a degraded state given that by default indexes, a single running instance has no peers to replicate its data to. Note that this has no practical effect on its operation for development purposes. It is, however, recommended that elasticsearch always run on multiple servers in production environments. As a clustered database, many of data guarantees hinge on multiple nodes being available.

来自这里:http://exploringelasticsearch.com/modeling_data.html#sec-modeling-index-basics

当你创建一个索引时,你告诉它有多少主分片和副本分片http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-create-index.html . ES 默认为 5 个主分片和每个主分片 1 个副本分片,总共 10 个分片。

这些分片将根据集群中的节点数量进行平衡,前提是主节点及其副本不能位于同一节点上。因此,如果您从 2 个节点和默认的 5 个主分片和每个主分片 1 个副本开始,您将获得每个节点 5 个分片。添加更多节点,每个节点的分片数量下降。添加更多索引,每个节点的分片数量也会增加。

在所有情况下,节点数必须比配置的副本数大 1。因此,如果您配置 1 个副本,您应该有 2 个节点,以便主节点可以在一个节点上,而单个副本可以在另一个节点上,否则将不会分配副本,并且您的集群状态将为 Yellow。如果您将它配置为 2 个副本,这意味着 1 个主分片和 2 个副本分片,您至少需要 3 个节点才能将它们全部分开。等等。

您的问题无法直接回答,因为它们基于对 ES 工作原理的错误假设。您不添加带有分片的节点 - 您添加一个节点,然后 ES 将重新平衡整个集群中的现有分片。是的,如果你愿意,你确实可以对此进行一些控制,但在你更加熟悉 ES 操作之前我不会这样做。我会在这里阅读:http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-allocation.html在这里:http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-reroute.html在这里:http://exploringelasticsearch.com/advanced_techniques.html#advanced-routing

来自最后一个链接:

8.1.1 How Elasticsearch Routing Works

Understanding routing is important in large elasticsearch clusters. By exercising fine-grained control over routing the quantity of cluster resources used can be severely reduced, often by orders of magnitude.

The primary mechanism through which elasticsearch scales is sharding. Sharding is a common technique for splitting data and computation across multiple servers, where a property of a document has a function returning a consistent value applied to it in order to determine which server it will be stored on. The value used for this in elasticsearch is the document’s _id field by default. The algorithm used to convert a value to a shard id is what’s known as a consistent hashing algorithm.

Maintaining good cluster performance is contingent upon even shard balancing. If data is unevenly distributed across a cluster some machines will be over-utilized while others will remain mostly idle. To avoid this, we want as even a distribution of numbers coming out of our consistent hashing algorithm as possible. Document ids hash well generally because they are evenly distributed if they are either UUIDs or monotonically increasing ids (1,2,3,4 …).

This is the default approach, and it generally works well as it solves the problem of evening out data across the cluster. It also means that fetches for a single document only need to be routed to the shard that document hashes to. But what about routing queries? If, for instance, we are storing user history in elasticsearch, and are using UUIDs for each piece of user history data, user data will be stored evenly across the cluster. There’s some waste here, however, in that this means that our searches for that user’s data have poor data locality. Queries must be run on all shards within the index, and run against all possible data. Assuming that we have many users we can likely improve query performance by consistently routing all of a given user’s data to a single shard. Once the user’s data has been so-segmented, we’ll only need to execute across a single shard when performing operations on that user’s data.

关于elasticsearch - 如何在 elasticsearch 中配置每个集群的分片数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23926644/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com