Elastic Search Index- What is that?
Answer will not be as simple as sound. In layman language-
An index is a data structure for storing the mapping of fields to the corresponding documents. The objective is to allow faster searches, often at the expense of increased memory usage and preprocessing time.
Till now developers have worked on RDBMS . They know all about database, table, row/columns etc. I can try to relate with that
- Oracle => Databases => Tables => Columns/Rows
- ElasticSearch => Indices => Types => Documents with Properties
In ElasticSearch cluster can contain multiple Indices (databases), which in turn contain multiple Types (tables). These types hold multiple Documents (rows), and each document has Properties(columns).
So in your car manufacturing scenario, you may have a BMWFactory index. Within this index, you have three different types:
Each type then contains documents that correspond to that type (e.g. a X5 doc lives inside of the Cars type. This doc contains all the details about that particular car).
Searching and querying takes the format of: http://localhost:9200/[index]/[type]/[operation]
So to retrieve the Subaru document, I may do this:
$ curl -XGET localhost:9200/BMWFactory/Cars/X5
Now we clear with , what is Index in elasticSearch. Now If you have to index huge data, then sometime, it is very
time consuming process. Can take hours. Or If you have situation of having nightly batch operation for indexing,
then situation gets more worse.
How can we make performance better
- Make some master nodes, separate from Data nodes as it will reduce load on all your cluster.
- Disable OS swapping, ES takes care of that and Check your heap size on all your machinesHeap Sizing
- Check your documents are of similar size always, you can make use of bulk indexing and tweak you settings in there like chunk_size in number of records or in memory size
- If you are using script try to optimize that as they make the indexing slow, you can store the scripted value if possible as preprocessing, as ES is not designed to handle scripting.
- Check number of shards per node and try to balance that out across nodes using Routing
- Always use the bulk api, which indexes multiple documents in one request, and experiment with the right number of documents to send with each bulk request. The optimal size depends on many factors, but try to err in the direction of too few rather than too many documents. Use concurrent bulk requests with client-side threads or separate asynchronous requests.
- If your node is doing only heavy indexing, be sure indices.memory.index_buffer_size is large enough to give at most ~512 MB indexing buffer per active shard (beyond that indexing performance does not typically improve). Elasticsearch takes that setting (a percentage of the java heap or an absolute byte-size), and divides it equally .
Use modern solid-state disks (SSDs): they are far faster than even the fastest spinning disks. Not only do they have lower latency for random access and higher sequential IO, they are also better at the highly concurrent IO that is required for simultaneous indexing, merging and searching.
By default, Elasticsearch stores the original data in a special _source field. If you do not need it, disable it.
By default, Elasticsearch analyzes the input data of all fields in a special _all field. If you do not need it, disable it.
If you are using the _source field, there is no additional value in setting any other field to _stored.
If you are not using the _source field, only set those fields to _stored that you need to. Note, however, that using _source brings certain advantages, such as the ability to use the update API.
If your client speaks Java, consider using the NodeClient. A NodeClient joins the cluster and knows which nodes to address for certain requests, possibly saving one hop when compared to other clients. If you cannot use the NodeClient, e.g., due to security restrictions, see if you can use TransportClient before considering something else.
When the index manager send a node an index request to process, the node updates its own mapping and then sends that mapping to the master. While the master processes it, that node receives a state that includes an older version of the mapping. If there’s a conflict, it’s not bad (i.e. the cluster state will eventually have the correct mapping), but we send a refresh just in case from that node to the master. In order to make the index request more efficient, we have set this property on our data nodes.
The cluster.routing.allocation.cluster_concurrent_rebalance property determines the number of shards allowed for concurrent rebalance. This property needs to be set appropriately depending on the hardware being used, for example the number of CPUs, IO capacity, etc. If this property is not set appropriately, it can impact the ElasticSearch performance with indexing.
ElasticSearch node has several thread pools in order to improve how threads are managed within a node. At Loggly, we use bulk request extensively, and we have found that setting the right value for bulk thread pool using threadpool.bulk.queue_size property is crucial in order to avoid data loss or _bulk retries
- ElasticSearch node has several thread pools in order to improve how threads are managed within a node. At Loggly, we use bulk request extensively, and we have found that setting the right value for bulk thread pool using threadpool.bulk.queue_size property is crucial in order to avoid data loss or _bulk retries
Apart from there are many other ES configuration for better performance.The depth of configuration properties available in ElasticSearch as been a huge benefit to Loggly since our use cases take ElasticSearch to the edge of its design parameters.
Happy Indexing with Vinay in techartifact.
Some of information gathered from Internet.