Querying data in elastic search

Hi All,

A common issue in elastic search as querying data from ES server. As typical RDBMS server, we can write like

“select * from tablename where columnA is null”

Similarly, if we have to find this data in ES in kibana , then how we can write it

It will be like

GET IndexName/TypeName/_search
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "object_desc"
}
}
}
}
}

Similarly we can write this in java API ES 2.2 as below

QueryBuilder query = QueryBuilders.boolQuery()
.mustNot(QueryBuilders.existsQuery("object_desc"));

Client client = getClient();
SearchResponse response = client.prepareSearch(indexName)
.setTypes(type)
.setSearchType(SearchType.QUERY_AND_FETCH)
.setQuery(query)
.setFrom(0)
.setSize(10000)
.setExplain(false)
.execute()
.actionGet();
SearchHit[] searchHits = response.getHits().getHits();

ES query API is quite good and you can test in kibana console.

Happy searching with techartifact…

Shards and Replicas in Elasticsearch

Shards in Elastic Search- When we have a large number of documents, we may come to a point where a single node may not be enough—for example, because of RAM limitations, hard disk capacity, insufficient processing power, and inability to respond to client requests fast enough. In such a case, data can be divided into smaller parts called shards (where each shard is a separate Apache Lucene index). Each shard can be placed on a different server, and thus, your data can be spread among the cluster nodes. When you query an index that is built from multiple shards, Elasticsearch sends the query to each relevant shard and merges the result in such a way that your application doesn’t know about the shards. In addition to this, having multiple shards can speed up the indexing.

clustering allows us to store information volumes that exceed abilities of a single server. To achieve this requirement, ElasticSearch spread data to several physical Lucene indices. Those Lucene indices are called shards and the process of this spreading is called sharding. ElasticSearch can do this automatically and all parts of the index (shards) are visible to the user as one-big index. Note that besides this automation, it is crucial to tune this mechanism for particular use case because the number of shard index is built or is configured during index creation and cannot be changed later, at least currently.

So if you have an index with 100 documents and a cluster with 2 nodes, each node will hold 50 documents if the shard_number is 2. (Ignoring replicas of course)
That’s a little of the “infinite scaling magic ” because each machine in your cluster only have to deal with some pieces of your data.

 

Replica

In order to increase query throughput or achieve high availability, shard replicas can be used. A replica is just an exact copy of the shard, and each shard can have zero or more replicas. In other words, Elasticsearch can have many identical shards and one of them is automatically chosen as a place where the operations that change the index are directed. This special shard is called a primary shard, and the others are called replica shards. When the primary shard is lost (for example, a server holding the shard data is unavailable), the cluster will promote the replica to be the new primary shard.

Sharing allows us to push more data into ElasticSearch that is possible for a single node to handle. Replicas can help where load increases and a single node is not able to handle all the requests. The idea is simple: create additional copy of a shard, which can be used for queries just as original, primary shard. Note that we get safety for free. If the server with the shard is gone, ElasticSearch can use replica and no data is lost. Replicas can be added and removed at any time, so you can adjust their numbers when needed..

Replicas can be added or removed at runtime—primaries can’t You can change the number of replicas per shard at any time because replicas can always be created or removed. This doesn’t apply to the number of primary shards an index is divided into; you have to decide on the number of shards before creating the index. Keep in mind that too few shards limit how much you can scale, but too many shards impact performance. The default setting of five is typically a good start

 

A node is an instance of Elasticsearch. When you start Elasticsearch on your server, you have a node. If you start Elasticsearch on another server, it’s another node. You can even have more nodes on the same server by starting multiple Elasticsearch processes. Multiple nodes can join the same cluster. As we’ll discuss later in this chapter, starting nodes with the same cluster name and otherwise default settings is enough to make a cluster. With a cluster of multiple nodes, the same data can be spread across multiple servers. This helps performance because Elasticsearch has more resources to work with. It also helps reliability: if you have at least one replica per shard, any node can disappear and Elasticsearch will still serve you all the data. For an application that’s using Elasticsearch, having one or more nodes in a cluster is transparent. By default, you can connect to any node from the cluster and work with the whole data just as if you had a single node. Although clustering is good for performance and availability, it has its disadvantages: you have to make sure nodes can communicate with each other quickly enough and that you won’t have a split brain (two parts of the cluster that can’t communicate and think the other part dropped out). To address such issues,

WHAT HAPPENS WHEN YOU SEARCH AN INDEX?

 

When you search an index, Elasticsearch has to look in a complete set of shards for that index Those shards can be either primary or replicas because primary and replica shards typically contain the same documents. Elasticsearch distributes the search load between the primary and replica shards of the index you’re searching, making replicas useful for both search performance and fault tolerance. Next we’ll look at the details of what primary and replica shards are and how they’re allocated in an Elasticsearch cluster.

shards

 

Happy Sharding in elastic Search with Vinay…..  🙂

Elastic Search Index and Performance tuning tips- Part 2

Elastic Search Index- What is that?

Answer will not be as simple as sound. In layman language-

An index is a data structure for storing the mapping of fields to the corresponding documents. The objective is to allow faster searches, often at the expense of increased memory usage and preprocessing time.

Till now developers have worked on RDBMS . They know all about database, table, row/columns etc. I can try to relate with that

  • Oracle => Databases => Tables => Columns/Rows
  • ElasticSearch => Indices => Types => Documents with Properties

In ElasticSearch cluster can contain multiple Indices (databases), which in turn contain multiple Types (tables). These types hold multiple Documents (rows), and each document has Properties(columns).

So in your car manufacturing scenario, you may have a BMWFactory index. Within this index, you have three different types:

  • Employee
  • Cars
  • Spare_Parts

Each type then contains documents that correspond to that type (e.g. a X5 doc lives inside of the Cars type. This doc contains all the details about that particular car).

Searching and querying takes the format of: http://localhost:9200/[index]/[type]/[operation]

So to retrieve the Subaru document, I may do this:

$ curl -XGET localhost:9200/BMWFactory/Cars/X5


Now we clear with , what is Index in elasticSearch. Now If you have to index huge data, then sometime, it is very
time consuming process. Can take hours. Or If you have situation of having nightly batch operation for indexing,
then situation gets more worse.


How can we make performance better

  1. Make some master nodes, separate from Data nodes as it will reduce load on all your cluster.
  2. Disable OS swapping, ES takes care of that and Check your heap size on all your machinesHeap Sizing
  3. Check your documents are of similar size always, you can make use of bulk indexing and tweak you settings in there like chunk_size in number of records or in memory size
  4. If you are using script try to optimize that as they make the indexing slow, you can store the scripted value if possible as preprocessing, as ES is not designed to handle scripting.
  5. Check number of shards per node and try to balance that out across nodes using Routing
  6. Always use the bulk api, which indexes multiple documents in one request, and experiment with the right number of documents to send with each bulk request. The optimal size depends on many factors, but try to err in the direction of too few rather than too many documents. Use concurrent bulk requests with client-side threads or separate asynchronous requests.
  7. If your node is doing only heavy indexing, be sure indices.memory.index_buffer_size is large enough to give at most ~512 MB indexing buffer per active shard (beyond that indexing performance does not typically improve). Elasticsearch takes that setting (a percentage of the java heap or an absolute byte-size), and divides it equally .
  8. Use modern solid-state disks (SSDs): they are far faster than even the fastest spinning disks. Not only do they have lower latency for random access and higher sequential IO, they are also better at the highly concurrent IO that is required for simultaneous indexing, merging and searching.
  9. Do not place the index on a remotely mounted filesystem (e.g. NFS or SMB/CIFS); use storage local to the machine instead.
  10. By default, Elasticsearch stores the original data in a special _source field. If you do not need it, disable it.
  11. By default, Elasticsearch analyzes the input data of all fields in a special _all field. If you do not need it, disable it.
  12. If you are using the _source field, there is no additional value in setting any other field to _stored.
  13. If you are not using the _source field, only set those fields to _stored that you need to. Note, however, that using _source brings certain advantages, such as the ability to use the update API.
  14. If your client speaks Java, consider using the NodeClient. A NodeClient joins the cluster and knows which nodes to address for certain requests, possibly saving one hop when compared to other clients. If you cannot use the NodeClient, e.g., due to security restrictions, see if you can use TransportClient before considering something else.
  15. When the index manager send a node an index request to process, the node updates its own mapping and then sends that mapping to the master. While the master processes it, that node receives a state that includes an older version of the mapping. If there’s a conflict, it’s not bad (i.e. the cluster state will eventually have the correct mapping), but we send a refresh just in case from that node to the master. In order to make the index request more efficient, we have set this property on our data nodes.

    indices.cluster.send_refresh_mapping: false

  16. The cluster.routing.allocation.cluster_concurrent_rebalance property determines the number of shards allowed for concurrent rebalance. This property needs to be set appropriately depending on the hardware being used, for example the number of CPUs, IO capacity, etc. If this property is not set appropriately, it can impact the ElasticSearch performance with indexing.

    cluster.routing.allocation.cluster_concurrent_rebalance:2

  17. ElasticSearch node has several thread pools in order to improve how threads are managed within a node. At Loggly, we use bulk request extensively, and we have found that  setting the right value for bulk thread pool using threadpool.bulk.queue_size property is crucial in order to avoid data loss or _bulk retries

    threadpool.bulk.queue_size: 3000

  18. ElasticSearch node has several thread pools in order to improve how threads are managed within a node. At Loggly, we use bulk request extensively, and we have found that  setting the right value for bulk thread pool using threadpool.bulk.queue_size property is crucial in order to avoid data loss or _bulk retries


threadpool.bulk.queue_size: 3000

Apart from there are many other ES configuration for better performance.The depth of configuration properties available in ElasticSearch as been a huge benefit to Loggly since our use cases take ElasticSearch to the edge of its design parameters.

Happy Indexing with Vinay in techartifact.

Some of information gathered from Internet.

Ref- https://www.elastic.co/guide/en/elasticsearch/guide/current/index-doc.html

https://www.elastic.co/blog/performance-considerations-elasticsearch-indexing