Elasticsearch Monitoring

July 27, 2019
By MYSQLGYAN

Elasticsearch Basics

MONITORING ELASTICSEARCH

Elasticsearch provides plenty of metrics that can help you detect signs of trouble and take action when you are facing with problems like unreliable nodes, out-of-memory errors, and long garbage collection times.
A few key areas to monitor are:

1) Cluster Health: Shards and Nodes

An Elasticsearch cluster can consist of one or more nodes.
A node is a member of the cluster, hosted on an individual server.
Adding additional nodes is what allows us to scale the cluster horizontally.
Indexes organize the data within the cluster. An index is a collection of documents which share a similar characteristic.

In large datasets, the size of an index might exceed the storage capacity on a single node.
We also want to ensure that we have redundant copies of our index, in case something happens to a node. Elasticsearch handles this by dividing an index into a defined number of shards.

Elasticsearch distributes the shards across all nodes in the cluster. By default, an Elasticsearch index has five shards with one replica. The result of this default configuration is an index divided into five shards, each with a single replica stored on a different node.

When monitoring your cluster, you can query the cluster health endpoint and receive information about the status of the cluster, the number of nodes, and the counts of active shards.
You can also see counts for relocating shards, initializing shards and unassigned shards.
An example response of such a request can be seen below.

		
curl -X GET "localhost:9200/_cluster/health"

		
{
  "cluster_name" : "testcluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 1,
  "active_shards" : 1,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 50.0
}

2) Search Performance: Request Rate and Latency

Indices level stats provide statistics on different operations happening on an index. The API provides statistics on the index level scope (though most stats can also be retrieved using node level scope).
The following returns high level aggregation and index level stats for all indices:

		
curl -X GET "localhost:9200/_stats"

Specific index stats can be retrieved using:

		
curl -X GET "localhost:9200/index1,index2/_stats"

3) Index Performance: Refresh and Merge Times

As documents are updated, added and removed from an index, the cluster needs to continually update their indexes and then refresh them across all the nodes.
All of this is taken care of by the cluster, and as a user, you have limited control over this process, other than to configure the refresh interval rate.
The cluster nodes stats API allows to retrieve one or more (or all) of the cluster nodes statistics.

		
curl -X GET "localhost:9200/_nodes/stats"
curl -X GET "localhost:9200/_nodes/nodeId1,nodeId2/stats"

The first command retrieves stats of all the nodes in the cluster. The second command selectively retrieves nodes stats of only nodeId1 and nodeId2.

Important Metrics for Index Performance :

Name	Explanation
Total refreshes	Count of the total number of refreshes.
Total time spent refreshing	Aggregation of all time spent refreshing. Measure in milliseconds.
Current merges	Merges currently being processed.
Total merges	Count of the total number of merges.
Total time spent merging	Aggregation of all time spent merging segments.

4) Node Health: Memory, Disk, and CPU Metrics

The nodes command shows the cluster topology :

		
curl -X GET "localhost:9200/_cat/nodes?v"

Important Metrics for Node Health :

Name	Explanation
Total disk capacity	Total disk capacity on the node’s host machine.
Total disk usage	Total disk usage on the node’s host machine.
Total available disk space	Total disk space available.
Percentage of disk used	CPercentage of disk which is already used.
Current RAM usage	Current memory usage (unit of measurement).
RAM percentage	Percentage of memory being used.
Maximum RAM	Total amount of memory on the node’s host machine.
CPU	Percentage of the CPU in use.

5) JVM Health: Heap, GC, and Pool Size

As a Java-based application, Elasticsearch runs within a Java Virtual Machine (JVM).
The JVM manages its memory within its ‘heap’ allocation and evicts objects from the head with a garbage collection process.
JVM metrics can be retrieved from the :

		
curl -X GET "localhost:9200/_nodes/_stats"

Important Metrics for JVM Health :

Name	Explanation
Memory usage	Usage statistics for heap and non-heap processes and pools.
Threads	Current threads in use, and maximum number.
Garbage collection	Counts and total time spent with garbage collection.