Elastic Search Index and Performance tuning tips- Part 2

Elastic Search Index- What is that?

Answer will not be as simple as sound. In layman language-

An index is a data structure for storing the mapping of fields to the corresponding documents. The objective is to allow faster searches, often at the expense of increased memory usage and preprocessing time.

Till now developers have worked on RDBMS . They know all about database, table, row/columns etc. I can try to relate with that

  • Oracle => Databases => Tables => Columns/Rows
  • ElasticSearch => Indices => Types => Documents with Properties

In ElasticSearch cluster can contain multiple Indices (databases), which in turn contain multiple Types (tables). These types hold multiple Documents (rows), and each document has Properties(columns).

So in your car manufacturing scenario, you may have a BMWFactory index. Within this index, you have three different types:

  • Employee
  • Cars
  • Spare_Parts

Each type then contains documents that correspond to that type (e.g. a X5 doc lives inside of the Cars type. This doc contains all the details about that particular car).

Searching and querying takes the format of: http://localhost:9200/[index]/[type]/[operation]

So to retrieve the Subaru document, I may do this:

$ curl -XGET localhost:9200/BMWFactory/Cars/X5


Now we clear with , what is Index in elasticSearch. Now If you have to index huge data, then sometime, it is very
time consuming process. Can take hours. Or If you have situation of having nightly batch operation for indexing,
then situation gets more worse.


How can we make performance better

  1. Make some master nodes, separate from Data nodes as it will reduce load on all your cluster.
  2. Disable OS swapping, ES takes care of that and Check your heap size on all your machinesHeap Sizing
  3. Check your documents are of similar size always, you can make use of bulk indexing and tweak you settings in there like chunk_size in number of records or in memory size
  4. If you are using script try to optimize that as they make the indexing slow, you can store the scripted value if possible as preprocessing, as ES is not designed to handle scripting.
  5. Check number of shards per node and try to balance that out across nodes using Routing
  6. Always use the bulk api, which indexes multiple documents in one request, and experiment with the right number of documents to send with each bulk request. The optimal size depends on many factors, but try to err in the direction of too few rather than too many documents. Use concurrent bulk requests with client-side threads or separate asynchronous requests.
  7. If your node is doing only heavy indexing, be sure indices.memory.index_buffer_size is large enough to give at most ~512 MB indexing buffer per active shard (beyond that indexing performance does not typically improve). Elasticsearch takes that setting (a percentage of the java heap or an absolute byte-size), and divides it equally .
  8. Use modern solid-state disks (SSDs): they are far faster than even the fastest spinning disks. Not only do they have lower latency for random access and higher sequential IO, they are also better at the highly concurrent IO that is required for simultaneous indexing, merging and searching.
  9. Do not place the index on a remotely mounted filesystem (e.g. NFS or SMB/CIFS); use storage local to the machine instead.
  10. By default, Elasticsearch stores the original data in a special _source field. If you do not need it, disable it.
  11. By default, Elasticsearch analyzes the input data of all fields in a special _all field. If you do not need it, disable it.
  12. If you are using the _source field, there is no additional value in setting any other field to _stored.
  13. If you are not using the _source field, only set those fields to _stored that you need to. Note, however, that using _source brings certain advantages, such as the ability to use the update API.
  14. If your client speaks Java, consider using the NodeClient. A NodeClient joins the cluster and knows which nodes to address for certain requests, possibly saving one hop when compared to other clients. If you cannot use the NodeClient, e.g., due to security restrictions, see if you can use TransportClient before considering something else.
  15. When the index manager send a node an index request to process, the node updates its own mapping and then sends that mapping to the master. While the master processes it, that node receives a state that includes an older version of the mapping. If there’s a conflict, it’s not bad (i.e. the cluster state will eventually have the correct mapping), but we send a refresh just in case from that node to the master. In order to make the index request more efficient, we have set this property on our data nodes.

    indices.cluster.send_refresh_mapping: false

  16. The cluster.routing.allocation.cluster_concurrent_rebalance property determines the number of shards allowed for concurrent rebalance. This property needs to be set appropriately depending on the hardware being used, for example the number of CPUs, IO capacity, etc. If this property is not set appropriately, it can impact the ElasticSearch performance with indexing.

    cluster.routing.allocation.cluster_concurrent_rebalance:2

  17. ElasticSearch node has several thread pools in order to improve how threads are managed within a node. At Loggly, we use bulk request extensively, and we have found that  setting the right value for bulk thread pool using threadpool.bulk.queue_size property is crucial in order to avoid data loss or _bulk retries

    threadpool.bulk.queue_size: 3000

  18. ElasticSearch node has several thread pools in order to improve how threads are managed within a node. At Loggly, we use bulk request extensively, and we have found that  setting the right value for bulk thread pool using threadpool.bulk.queue_size property is crucial in order to avoid data loss or _bulk retries


threadpool.bulk.queue_size: 3000

Apart from there are many other ES configuration for better performance.The depth of configuration properties available in ElasticSearch as been a huge benefit to Loggly since our use cases take ElasticSearch to the edge of its design parameters.

Happy Indexing with Vinay in techartifact.

Some of information gathered from Internet.

Ref- https://www.elastic.co/guide/en/elasticsearch/guide/current/index-doc.html

https://www.elastic.co/blog/performance-considerations-elasticsearch-indexing


Elastic Search-Beginner Tutorial for Oracle FMW users

ElasticSearch is an Open Source (Apache 2), Distributed Search Engine built on top of Apache Lucene.

Elasticsearch is a NOSQL, distributed full text database. Which means that this database is document based instead of using tables or schema, we use documents. Elasticsearch is much more than just Lucene and much more than “just” full text search. It is also:A distributed real-time document store where every field is indexed and searchable. A distributed search engine with real-time analytics. Capable of scaling to hundreds of servers and petabytes(Figure -1) of structured and unstructured data .

History – The project was started in 2010 by Shay Banon. Shay wanted to create a storage and search engine that would be easy to operate. Elasticsearch is based on the Lucene engine on top of which Shay added an http rest interface which resulted in a distributed search engine that is incredibly easy to scale and returns results at lightning speed

Need of Elasticsearch As a developer or business guy who is used to traditional relational databases, we often face challenges to find information in millions of record in rdbms table. Suppose developer had to search in millions of record in table with 100’s of column. Think about the search time. I am sure, who tried, and they frustrated to build a fast system. Situation gets worst, when it needed to search tables that had millions of records, resulting in overly complex database views/stored-procedures and adding full text search on relational database fields. Something which I personally dislike, as it made the database twice the size and the speed was not optimal either. Relational databases are simply not built for such operations.

In normal RDBMS table, we try searching like searchParam

Select * from tableName WHERE columnName LIKE ‘%searchParam %’;

I am sure, by like this, you can’t search everything and it’s not performance optimal.
Now with Elasticsearch we can achieve the speed we would like, as it lets us index millions of documents. Now definitely, we need a system of something, in which we can make search faster.

Real usecase-

Elasticsearch can be used for various usage, for example it can be used as a blog storage engine in case you would like your blog to be searchable. Traditional SQL doesn’t readily give you the means to do that.
How about Analytics tools? Most software generates tons of data that is worth analyzing, Elasticsearch comes with Logstash and Kibanato give you a full analytics system.
Finally, I like to see Elasticsearch as Data ware house, where you have documents with many different attributes and non-predictable schemas. Since Elasticsearch is schemaless, it won’t matter that you store various documents there, you will still be able to search them easily and quickly.On the other hand having a powerful tool like Kibana would allow you to have a custom dashboard that gives the opportunity for non-technical managers to view and analyze this data.

For me real use case to build search engine with over 5 erp for my organization. A search engine, where we can search almost everything within company different ERP system. In my organization,we have 5 different ERP system, which caters different use case for different business unit. So there are 5 data source, in which I need to make search. User will have one search field, on which they can search anything without mentioning what they want to search it.Portal should display all result based on search param to user. Sounds interesting? J Believe me it’s more challenging than, it sounds interesting 😛

ElasticSearch can be good fit here. We can get all information from 5 different ERP and do data indexing and after that search will be very faster and awesome , same as google.

How Elasticsearch saves data?
Elasticsearch does not have tables, and a schema is not required. Elasticsearch stores data documents that consist of JSON strings inside an index.

The field is like the columns of the SQL database and the value represents the data in the row cells.

When you save a document in Elasticsearch, you save it in an index. An

ElasticSearch is a great open source search engine built on top of Apache Lucene. Its features and upgrades allow it to basically function just like a schema-less JSON datastore that can be accessed using both search queries and regular database CRUD commands.

Here are the main “disadvantages” I see:

  • Transactions – There is no support for transactions or processing on data manipulation.
  • Data Availability – ES makes data available in “near real-time” which may require additional considerations in your application (ie: comments page where a user adds new comment, refreshing the page might not actually show the new post because the index is still updating).
  • Durability – ES is distributed and fairly stable but backups and durability are not as high priority as in other data stores. ElasticSearch has come a long way in the past few years since this original answer and now has better features, backup methods and even realtime indexing. Please review the official site for more information.

If you can deal with these issues then there’s certainly no reason why you can’t use ElasticSearch as your primary data store. It can actually lower complexity and improve performance by not having to duplicate your data but again this depends on your specific use case.

index is like a database in relational database. An index is saved across multiple shards and shards are then stored in one or more servers which are called nodes, multiple nodes form a cluster.


You can download the latest version of Elasticsearch from elasticsearch.org/download.

curl -L -O http://download.elasticsearch.org/PATH/TO/VERSION.zip
unzip elasticsearch-$VERSION.zip
cd  elasticsearch-$VERSION

Elasticsearch is now ready to run. You can start it up in the foreground with:

./bin/elasticsearch

Add -d if you want to run it in the background as a daemon.

Test it out by opening another terminal window and running:

curl ‘http://localhost:9200/?pretty’

You should see a response like this:

{

   “status”: 200,

   “name”: “Shrunken Bones”,

   “version”: {

      “number”: “1.4.0”,

      “lucene_version”: “4.10”

   },

   “tagline”: “You Know, for Search”

}

This means that your Elasticsearch cluster is up and running, and we can start experimenting with it.

Clusters and nodes

A node is a running instance of Elasticsearch. A cluster is a group of nodes with the same cluster.name that are working together to share data and to provide failover and scale, although a single node can form a cluster all by itself.


You should change the default cluster.name to something appropriate to you, like your own name, to stop your nodes from trying to join another cluster on the same network with the same name!

You can do this by editing the elasticsearch.yml file in the config/ directory, then restarting Elasticsearch. When Elasticsearch is running in the foreground, you can stop it by pressing Ctrl-C, otherwise you can shut it down with the api

curl –XPOST ‘http://localhost:9200/_shutdown’

Some of common terminologies in elastic search-

Cluster

A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is “elasticsearch”. This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.

Node

A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities. Just like a cluster, a node is identified by a name which by default is a random Marvel character name that is assigned to the node at startup. You can define any node name you want if you do not want the default. This name is important for administration purposes where you want to identify which servers in your network correspond to which nodes in your Elasticsearch cluster.

Index

An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. You will provide different index name for different data. An index is identified by a name and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.

Shards & Replicas

An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.

To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent “index” that can be hosted on any node in the cluster.

Now talking about how elastic search can help in Oracle ADF/WebCenter or fusion middleware technologies.

I will be publishing series of tutorial for elasticSearch. Will try to show, how we can use with ADF/Webcenter portal as well.Following architecture can be used in ADF/WebCenter Portal Application.

elasticsearchadf


Elastic Search is very fast in comparison to ADF Search. In ADF, you can search using query panel or custom search box. With this you can combine all data on backend and search with one inputText to all tables schema or all backend data. Search response time is in ms. In WebCenter Portal Oracle Secure Enterprise Search is also going to replace with Elastic Search. So its good time to get into this.

Elastic Search can also be used in WebCenter content to searching documents.  See one of great demo by Team Informatics in this youtube channel.

We can also use Apache Kafka with elastic search for real time data ingestion and searching. I will cover in coming post.

Till then, Happy searching by elasticSearch with Vinay in techartifact. Data source can be webservices, data base or live streaming data. In this, we have schedular to bring data or ingest data in elastic search server. ADF/WebCenter Portal application can consume data using querying into Elastic Search Server.


Ref – https://www.elastic.co/guide/en/elasticsearch/reference/2.2/_basic_concepts.html  http://joelabrahamsson.com/elasticsearch-101/