Shards and Replicas in Elasticsearch

Shards in Elastic Search- When we have a large number of documents, we may come to a point where a single node may not be enough—for example, because of RAM limitations, hard disk capacity, insufficient processing power, and inability to respond to client requests fast enough. In such a case, data can be divided into smaller parts called shards (where each shard is a separate Apache Lucene index). Each shard can be placed on a different server, and thus, your data can be spread among the cluster nodes. When you query an index that is built from multiple shards, Elasticsearch sends the query to each relevant shard and merges the result in such a way that your application doesn’t know about the shards. In addition to this, having multiple shards can speed up the indexing.

clustering allows us to store information volumes that exceed abilities of a single server. To achieve this requirement, ElasticSearch spread data to several physical Lucene indices. Those Lucene indices are called shards and the process of this spreading is called sharding. ElasticSearch can do this automatically and all parts of the index (shards) are visible to the user as one-big index. Note that besides this automation, it is crucial to tune this mechanism for particular use case because the number of shard index is built or is configured during index creation and cannot be changed later, at least currently.

So if you have an index with 100 documents and a cluster with 2 nodes, each node will hold 50 documents if the shard_number is 2. (Ignoring replicas of course)
That’s a little of the “infinite scaling magic ” because each machine in your cluster only have to deal with some pieces of your data.

 

Replica

In order to increase query throughput or achieve high availability, shard replicas can be used. A replica is just an exact copy of the shard, and each shard can have zero or more replicas. In other words, Elasticsearch can have many identical shards and one of them is automatically chosen as a place where the operations that change the index are directed. This special shard is called a primary shard, and the others are called replica shards. When the primary shard is lost (for example, a server holding the shard data is unavailable), the cluster will promote the replica to be the new primary shard.

Sharing allows us to push more data into ElasticSearch that is possible for a single node to handle. Replicas can help where load increases and a single node is not able to handle all the requests. The idea is simple: create additional copy of a shard, which can be used for queries just as original, primary shard. Note that we get safety for free. If the server with the shard is gone, ElasticSearch can use replica and no data is lost. Replicas can be added and removed at any time, so you can adjust their numbers when needed..

Replicas can be added or removed at runtime—primaries can’t You can change the number of replicas per shard at any time because replicas can always be created or removed. This doesn’t apply to the number of primary shards an index is divided into; you have to decide on the number of shards before creating the index. Keep in mind that too few shards limit how much you can scale, but too many shards impact performance. The default setting of five is typically a good start

 

A node is an instance of Elasticsearch. When you start Elasticsearch on your server, you have a node. If you start Elasticsearch on another server, it’s another node. You can even have more nodes on the same server by starting multiple Elasticsearch processes. Multiple nodes can join the same cluster. As we’ll discuss later in this chapter, starting nodes with the same cluster name and otherwise default settings is enough to make a cluster. With a cluster of multiple nodes, the same data can be spread across multiple servers. This helps performance because Elasticsearch has more resources to work with. It also helps reliability: if you have at least one replica per shard, any node can disappear and Elasticsearch will still serve you all the data. For an application that’s using Elasticsearch, having one or more nodes in a cluster is transparent. By default, you can connect to any node from the cluster and work with the whole data just as if you had a single node. Although clustering is good for performance and availability, it has its disadvantages: you have to make sure nodes can communicate with each other quickly enough and that you won’t have a split brain (two parts of the cluster that can’t communicate and think the other part dropped out). To address such issues,

WHAT HAPPENS WHEN YOU SEARCH AN INDEX?

 

When you search an index, Elasticsearch has to look in a complete set of shards for that index Those shards can be either primary or replicas because primary and replica shards typically contain the same documents. Elasticsearch distributes the search load between the primary and replica shards of the index you’re searching, making replicas useful for both search performance and fault tolerance. Next we’ll look at the details of what primary and replica shards are and how they’re allocated in an Elasticsearch cluster.

shards

 

Happy Sharding in elastic Search with Vinay…..  🙂

Elastic Search-Beginner Tutorial for Oracle FMW users

ElasticSearch is an Open Source (Apache 2), Distributed Search Engine built on top of Apache Lucene.

Elasticsearch is a NOSQL, distributed full text database. Which means that this database is document based instead of using tables or schema, we use documents. Elasticsearch is much more than just Lucene and much more than “just” full text search. It is also:A distributed real-time document store where every field is indexed and searchable. A distributed search engine with real-time analytics. Capable of scaling to hundreds of servers and petabytes(Figure -1) of structured and unstructured data .

History – The project was started in 2010 by Shay Banon. Shay wanted to create a storage and search engine that would be easy to operate. Elasticsearch is based on the Lucene engine on top of which Shay added an http rest interface which resulted in a distributed search engine that is incredibly easy to scale and returns results at lightning speed

Need of Elasticsearch As a developer or business guy who is used to traditional relational databases, we often face challenges to find information in millions of record in rdbms table. Suppose developer had to search in millions of record in table with 100’s of column. Think about the search time. I am sure, who tried, and they frustrated to build a fast system. Situation gets worst, when it needed to search tables that had millions of records, resulting in overly complex database views/stored-procedures and adding full text search on relational database fields. Something which I personally dislike, as it made the database twice the size and the speed was not optimal either. Relational databases are simply not built for such operations.

In normal RDBMS table, we try searching like searchParam

Select * from tableName WHERE columnName LIKE ‘%searchParam %’;

I am sure, by like this, you can’t search everything and it’s not performance optimal.
Now with Elasticsearch we can achieve the speed we would like, as it lets us index millions of documents. Now definitely, we need a system of something, in which we can make search faster.

Real usecase-

Elasticsearch can be used for various usage, for example it can be used as a blog storage engine in case you would like your blog to be searchable. Traditional SQL doesn’t readily give you the means to do that.
How about Analytics tools? Most software generates tons of data that is worth analyzing, Elasticsearch comes with Logstash and Kibanato give you a full analytics system.
Finally, I like to see Elasticsearch as Data ware house, where you have documents with many different attributes and non-predictable schemas. Since Elasticsearch is schemaless, it won’t matter that you store various documents there, you will still be able to search them easily and quickly.On the other hand having a powerful tool like Kibana would allow you to have a custom dashboard that gives the opportunity for non-technical managers to view and analyze this data.

For me real use case to build search engine with over 5 erp for my organization. A search engine, where we can search almost everything within company different ERP system. In my organization,we have 5 different ERP system, which caters different use case for different business unit. So there are 5 data source, in which I need to make search. User will have one search field, on which they can search anything without mentioning what they want to search it.Portal should display all result based on search param to user. Sounds interesting? J Believe me it’s more challenging than, it sounds interesting 😛

ElasticSearch can be good fit here. We can get all information from 5 different ERP and do data indexing and after that search will be very faster and awesome , same as google.

How Elasticsearch saves data?
Elasticsearch does not have tables, and a schema is not required. Elasticsearch stores data documents that consist of JSON strings inside an index.

The field is like the columns of the SQL database and the value represents the data in the row cells.

When you save a document in Elasticsearch, you save it in an index. An

ElasticSearch is a great open source search engine built on top of Apache Lucene. Its features and upgrades allow it to basically function just like a schema-less JSON datastore that can be accessed using both search queries and regular database CRUD commands.

Here are the main “disadvantages” I see:

  • Transactions – There is no support for transactions or processing on data manipulation.
  • Data Availability – ES makes data available in “near real-time” which may require additional considerations in your application (ie: comments page where a user adds new comment, refreshing the page might not actually show the new post because the index is still updating).
  • Durability – ES is distributed and fairly stable but backups and durability are not as high priority as in other data stores. ElasticSearch has come a long way in the past few years since this original answer and now has better features, backup methods and even realtime indexing. Please review the official site for more information.

If you can deal with these issues then there’s certainly no reason why you can’t use ElasticSearch as your primary data store. It can actually lower complexity and improve performance by not having to duplicate your data but again this depends on your specific use case.

index is like a database in relational database. An index is saved across multiple shards and shards are then stored in one or more servers which are called nodes, multiple nodes form a cluster.


You can download the latest version of Elasticsearch from elasticsearch.org/download.

curl -L -O http://download.elasticsearch.org/PATH/TO/VERSION.zip
unzip elasticsearch-$VERSION.zip
cd  elasticsearch-$VERSION

Elasticsearch is now ready to run. You can start it up in the foreground with:

./bin/elasticsearch

Add -d if you want to run it in the background as a daemon.

Test it out by opening another terminal window and running:

curl ‘http://localhost:9200/?pretty’

You should see a response like this:

{

   “status”: 200,

   “name”: “Shrunken Bones”,

   “version”: {

      “number”: “1.4.0”,

      “lucene_version”: “4.10”

   },

   “tagline”: “You Know, for Search”

}

This means that your Elasticsearch cluster is up and running, and we can start experimenting with it.

Clusters and nodes

A node is a running instance of Elasticsearch. A cluster is a group of nodes with the same cluster.name that are working together to share data and to provide failover and scale, although a single node can form a cluster all by itself.


You should change the default cluster.name to something appropriate to you, like your own name, to stop your nodes from trying to join another cluster on the same network with the same name!

You can do this by editing the elasticsearch.yml file in the config/ directory, then restarting Elasticsearch. When Elasticsearch is running in the foreground, you can stop it by pressing Ctrl-C, otherwise you can shut it down with the api

curl –XPOST ‘http://localhost:9200/_shutdown’

Some of common terminologies in elastic search-

Cluster

A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is “elasticsearch”. This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.

Node

A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities. Just like a cluster, a node is identified by a name which by default is a random Marvel character name that is assigned to the node at startup. You can define any node name you want if you do not want the default. This name is important for administration purposes where you want to identify which servers in your network correspond to which nodes in your Elasticsearch cluster.

Index

An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. You will provide different index name for different data. An index is identified by a name and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.

Shards & Replicas

An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.

To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent “index” that can be hosted on any node in the cluster.

Now talking about how elastic search can help in Oracle ADF/WebCenter or fusion middleware technologies.

I will be publishing series of tutorial for elasticSearch. Will try to show, how we can use with ADF/Webcenter portal as well.Following architecture can be used in ADF/WebCenter Portal Application.

elasticsearchadf


Elastic Search is very fast in comparison to ADF Search. In ADF, you can search using query panel or custom search box. With this you can combine all data on backend and search with one inputText to all tables schema or all backend data. Search response time is in ms. In WebCenter Portal Oracle Secure Enterprise Search is also going to replace with Elastic Search. So its good time to get into this.

Elastic Search can also be used in WebCenter content to searching documents.  See one of great demo by Team Informatics in this youtube channel.

We can also use Apache Kafka with elastic search for real time data ingestion and searching. I will cover in coming post.

Till then, Happy searching by elasticSearch with Vinay in techartifact. Data source can be webservices, data base or live streaming data. In this, we have schedular to bring data or ingest data in elastic search server. ADF/WebCenter Portal application can consume data using querying into Elastic Search Server.


Ref – https://www.elastic.co/guide/en/elasticsearch/reference/2.2/_basic_concepts.html  http://joelabrahamsson.com/elasticsearch-101/