AWS OpenSearch¶
AWS OpenSearch, including connection counts, request counts, latency, slow queries, etc.
Configuration¶
Install Func¶
Recommended to enable Guance integration - extension - hosted Func: all prerequisites are automatically installed. Please continue with the script installation.
If you deploy Func on your own, refer to Self-deployed Func
Installation Script¶
Note: Please prepare an Amazon AK that meets the requirements in advance (for simplicity, you can directly grant global read-only permission
ReadOnlyAccess
)
Hosted Edition Activation Script¶
- Log in to the Guance console
- Click on the [Integration] menu and select [Cloud Account Management]
- Click [Add Cloud Account], choose [AWS]. If cloud account information has been configured before, skip this step.
- Click [Test], if the test is successful, click [Save]. If the test fails, check whether the related configuration information is correct and retest.
- In the [Cloud Account Management] list, you can see the added cloud accounts. Click on the corresponding cloud account and enter the details page.
- Click the [Integration] button on the cloud account details page. Under the
Not Installed
list, findAWS OpenSearch
, click the [Install] button, and follow the installation interface to complete the installation.
Manual Activation Script¶
- Log in to the Func console, click [Script Market], enter the official script market, and search for:
guance_aws_open_search
- After clicking [Install], input the corresponding parameters: AWS AK ID, AK Secret, and account name.
- Click [Deploy Startup Script], and the system will automatically create a
Startup
script set and configure the corresponding startup scripts automatically. - After activation, you can see the corresponding automatic trigger configuration in the "Management / Automatic Trigger Configuration". Click [Execute] to immediately execute once without waiting for the scheduled time. Wait a moment, and you can view the execution task records and corresponding logs.
Verification¶
- In "Management / Automatic Trigger Configuration", confirm whether the corresponding task has the corresponding automatic trigger configuration. You can also check the corresponding task records and logs for any abnormalities.
- In Guance, under "Infrastructure / Custom", check if there is asset information.
- In Guance, under "Metrics", check if there are corresponding monitoring data.
Metrics¶
After configuring AWS OpenSearch, the default metric sets are as follows. More metrics can be collected through configuration AWS Cloud Monitoring Metric Details
Cluster Metrics¶
Amazon OpenSearch service provides the following metrics for clusters.
Metric | Description |
---|---|
ClusterStatus.green |
A value of 1 indicates that all index shards have been allocated to nodes in the cluster. Related statistics: Maximum |
ClusterStatus.yellow |
A value of 1 indicates that all primary index shards have been allocated to nodes in the cluster, but at least one index's replica shard is not. For more information, see Yellow Cluster Status: Related statistics: Maximum |
ClusterStatus.red |
A value of 1 indicates that at least one index's primary and replica shards have not been allocated to nodes in the cluster. For more information, see Red Cluster Status: Related statistics: Maximum |
Shards.active |
The total number of active primary and replica shards. Related statistics: Maximum, Total |
Shards.unassigned |
The number of shards not assigned to nodes in the cluster. Related statistics: Maximum, Total |
Shards.delayedUnassigned |
The number of shards whose node allocation has been delayed due to timeout settings. Related statistics: Maximum, Total |
Shards.activePrimary |
The number of active primary shards. Related statistics: Maximum, Total |
Shards.initializing |
The number of initializing shards. Related statistics: Total |
Shards.relocating |
The number of relocating shards. Related statistics: Total |
Nodes |
The number of nodes in the OpenSearch service cluster, including dedicated master UltraWarm nodes and nodes. For more information, see Changing Configuration in Amazon OpenSearch Service: Related statistics: Maximum |
SearchableDocuments |
The total number of searchable documents across all data nodes in the cluster. Related statistics: Minimum, Maximum, Average |
CPUUtilization |
The percentage of CPU utilization on data nodes in the cluster. The maximum shows the highest CPU utilization node. The average represents all nodes in the cluster. This metric can also be used for individual nodes. Related statistics: Maximum, Average |
ClusterUsedSpace |
The total amount of used space in the cluster. You must wait one minute to get an accurate value. The OpenSearch service console displays this value in GiB. The Amazon CloudWatch console displays it in MiB. Related statistics: Minimum, Maximum |
ClusterIndexWritesBlocked |
Indicates whether your cluster accepts or blocks incoming write requests. A value of 0 means the cluster accepts requests. A value of 1 means blocking requests. Common factors include low FreeStorageSpace or high JVMMemoryPressure . To resolve this issue, consider increasing disk space or expanding the cluster. Related statistics: Maximum |
FreeStorageSpace |
Available storage space across data nodes in the cluster. Sum displays the total available space in the cluster, but you must wait one minute to get an accurate value. Minimum and Maximum show the nodes with the least and most available space respectively. This metric can also be used for individual nodes. An OpenSearchClusterBlockException is thrown when this metric reaches zero. To recover, you must delete indices, add larger instances, or add EBS-based storage to existing instances. For more information, see Missing Available Storage Space. The OpenSearch service console displays this value in GiB. The Amazon CloudWatch console displays it in MiB. |
JVMMemoryPressure |
The maximum percentage of Java heap used by all data nodes in the cluster. OpenSearch service allocates half of the instance RAM to the Java heap, with a maximum heap size of 32 GiB. You can vertically scale the instance RAM up to 64 GiB, after which horizontal scaling by adding instances is possible. See Recommended Amazon OpenSearch Service CloudWatch Alarms for more details. Related statistics: Maximum. Note that the logic for this metric was changed in service software R20220323. For more information, see Release Notes. |
JVMGCYoungCollectionCount |
The number of times "young generation" garbage collection runs. In a well-resourced cluster, this number should remain small and not grow frequently. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
JVMGCOldCollectionTime |
The time spent by the cluster performing "old generation" garbage collection in milliseconds. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
JVMGCYoungCollectionTime |
The time spent by the cluster performing "young generation" garbage collection in milliseconds. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
JVMGCOldCollectionCount |
The number of times "old generation" garbage collection runs. A large and constantly growing number is normal for cluster operations. This metric is also obtained at the node level. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
IndexingLatency |
The difference in total time (in milliseconds) taken for all indexing operations between minute N and minute (N-1). |
IndexingRate |
The number of indexing operations per minute. |
SearchLatency |
The difference in total time (in milliseconds) taken for all searches between minute N and minute (N-1). |
SearchRate |
The total number of search requests per minute across all shards on data nodes. |
SegmentCount |
The number of segments on data nodes. The more segments you have, the longer each search takes. OpenSearch sometimes merges smaller segments into larger ones. Related node statistics: Maximum, Average Related cluster statistics: Sum, Maximum, Average |
SysMemoryUtilization |
The percentage of instance memory in use. High values for this metric are normal and usually do not indicate issues with the cluster. For better indications of potential performance and stability issues, see JVMMemoryPressure metric. Related node statistics: Minimum, Maximum, Average Related cluster statistics: Minimum, Maximum, Average |
OpenSearchDashboardsConcurrentConnections |
The number of active concurrent connections to OpenSearch Dashboards. If this number is consistently high, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
OpenSearchDashboardsHeapTotal |
The total heap memory allocated to OpenSearch Dashboards in MiB. Different EC2 instance types may affect precise memory allocation. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
OpenSearchDashboardsHeapUsed |
The absolute amount of heap memory used by OpenSearch Dashboards in MiB. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
OpenSearchDashboardsHeapUtilization |
The percentage of maximum available heap memory used by OpenSearch Dashboards. If this value exceeds 80%, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Minimum, Maximum, Average |
OpenSearchDashboardsResponseTimesMaxInMillis |
The maximum time (in milliseconds) it takes for OpenSearch Dashboards to respond to requests. If requests consistently take a long time to return results, consider increasing the size of your instance type. Related node statistics: Maximum Related cluster statistics: Maximum, Average |
OpenSearchDashboardsOS1MinuteLoad |
The one-minute average CPU load for OpenSearch Dashboards. Ideally, the CPU load should stay below 1.00. While temporary spikes are fine, if this metric is consistently above 1.00, we recommend increasing the size of your instance type. Related node statistics: Average Related cluster statistics: Average, Maximum |
OpenSearchDashboardsRequestTotal |
The total number of HTTP requests issued to OpenSearch Dashboards. If your system is slow or you see a large number of dashboard requests, consider increasing the size of your instance type. Related node statistics: Total Related cluster statistics: Sum |
ThreadpoolForce_mergeQueue |
The number of queued tasks in the force merge thread pool. If the queue size is consistently large, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
ThreadpoolForce_mergeRejected |
The number of rejected tasks in the force merge thread pool. If this number continues to grow, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum |
ThreadpoolForce_mergeThreads |
The size of the force merge thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum |
ThreadpoolSearchQueue |
The number of queued tasks in the search thread pool. If the queue size is consistently large, consider scaling your cluster. The maximum size of the search queue is 1000. Related node statistics: Maximum Related cluster statistics: Average, Sum |
ThreadpoolSearchRejected |
The number of rejected tasks in the search thread pool. If this number continues to grow, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum |
ThreadpoolSearchThreads |
The size of the search thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum |
Threadpoolsql-workerQueue |
The number of queued tasks in the SQL search thread pool. If the queue size is consistently large, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
Threadpoolsql-workerRejected |
The number of rejected tasks in the SQL search thread pool. If this number continues to grow, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum |
Threadpoolsql-workerThreads |
The size of the SQL search thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum |
ThreadpoolWriteQueue |
The number of queued tasks in the write thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum |
ThreadpoolWriteRejected |
The number of rejected tasks in the write thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum |
ThreadpoolWriteThreads |
The size of the write thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum |
CoordinatingWriteRejected |
The total number of rejections on the coordinating node since the last OpenSearch service process started due to index pressure. Related node statistics: Maximum Related cluster statistics: Average, Sum This metric is available in version 7.1 and higher. |
ReplicaWriteRejected |
The total number of rejections on replica shards since the last OpenSearch service process started due to index pressure. Related node statistics: Maximum Related cluster statistics: Average, Sum This metric is available in version 7.1 and higher. |
PrimaryWriteRejected |
The total number of rejections on primary shards since the last OpenSearch service process started due to index pressure. Related node statistics: Maximum Related cluster statistics: Average, Sum This metric is available in version 7.1 and higher. |
ReadLatency |
The latency (in seconds) of read operations on EBS volumes. This metric can also be used for individual nodes. Related statistics: Minimum, Maximum, Average |
ReadThroughput |
The throughput (in bytes/second) of read operations on EBS volumes. This metric can also be used for individual nodes. Related statistics: Minimum, Maximum, Average |
ReadIOPS |
The number of input and output (I/O) operations per second for read operations on EBS volumes. This metric can also be used for individual nodes. Related statistics: Minimum, Maximum, Average |
WriteIOPS |
The number of input and output (I/O) operations per second for write operations on EBS volumes. This metric can also be used for individual nodes. Related statistics: Minimum, Maximum, Average |
WriteLatency |
The latency (in seconds) of write operations on EBS volumes. This metric can also be used for individual nodes. Related statistics: Minimum, Maximum, Average |
BurstBalance |
The percentage of remaining input and output (I/O) credits in the burst bucket for an EBS volume. A value of 100 indicates that the volume has accumulated the maximum number of credits. If this percentage drops below 70%, see Low EBS Burst Capacity Balance. For domains with gp3 volume types and domains with gp2 volumes larger than 1000 GiB, the burst balance remains at 0. Related statistics: Minimum, Maximum, Average |
CurrentPointInTime |
The number of active PIT search contexts on the node. |
TotalPointInTime |
The number of expired PIT search contexts since the node started. |
HasActivePointInTime |
A value of 1 indicates that there is an active PIT context on the node since it started. A value of 0 indicates none. |
HasUsedPointInTime |
A value of 1 indicates that there is an expired PIT context on the node since it started. A value of 0 indicates none. |
AsynchronousSearchInitializedRate |
The number of asynchronous searches initialized in the past 1 minute. |
AsynchronousSearchRunningCurrent |
The number of asynchronous searches currently running. |
AsynchronousSearchCompletionRate |
The number of asynchronous searches successfully completed in the past 1 minute. |
AsynchronousSearchFailureRate |
The number of asynchronous searches completed and failed in the last minute. |
AsynchronousSearchPersistRate |
The number of asynchronous searches persisted in the past 1 minute. |
AsynchronousSearchRejected |
The total number of asynchronous searches rejected since the node started. |
AsynchronousSearchCancelled |
The total number of asynchronous searches cancelled since the node started. |
SQLRequestCount |
The number of requests to the _SQL API. Related statistics: Total |
SQLUnhealthy |
A value of 1 indicates that the SQL plugin will return 5xx response codes or pass invalid query DSL to OpenSearch in response to specific requests. Other requests will continue to succeed. A value of 0 indicates no recent failures. If you see persistent values of 1, troubleshoot issues with the requests your client sends to the plugin. Related statistics: Maximum |
SQLDefaultCursorRequestCount |
Similar to SQLRequestCount, but only counts paginated requests. Related statistics: Total |
SQLFailedRequestCountByCusErr |
The number of requests to the _SQL API that failed due to client issues. For example, requests might return an HTTP status code 400 due to IndexNotFoundException. Related statistics: Total |
SQLFailedRequestCountBySysErr |
The number of requests to the _SQL API that failed due to server issues or functional limitations. For example, requests might return an HTTP status code 503 due to VerificationException. Related statistics: Total |
OldGenJVMMemoryPressure |
The maximum percentage of Java heap used for "old generation" on all data nodes in the cluster. This metric is also obtained at the node level. Related statistics: Maximum |
OpenSearchDashboardsHealthyNodes (previously called KibanaHealthyNodes ) |
Health check for OpenSearch Dashboards. If the minimum, maximum, and average are all equal to 1, the dashboard is operating normally. If you have 10 nodes, the maximum is 1, the minimum is 0, and the average is 0.7, it means 7 nodes (70%) are operating normally and 3 nodes (30%) are unhealthy. Related statistics: Minimum, Maximum, Average |
InvalidHostHeaderRequests |
The number of HTTP requests to the OpenSearch cluster that contain invalid (or missing) host headers. Valid requests include the domain hostname as the host header value. OpenSearch service rejects invalid requests to public access domains without restrictive access policies. We recommend applying restrictive access policies to all domains. If you see large values for this metric, confirm that your OpenSearch client includes the domain hostname (rather than its IP address) in its requests. Related statistics: Total |
OpenSearchRequests(previously ElasticsearchRequests) |
The number of requests made to the OpenSearch cluster. Related statistics: Total |
2xx, 3xx, 4xx, 5xx |
The number of requests to the domain resulting in specified HTTP response codes (2xx, 3xx, 4xx, 5xx). Related statistics: Total |
Objects¶
The structure of AWS OpenSearch object data collected can be seen in "Infrastructure - Custom".
{
"measurement": "aws_opensearch",
"tags": {
"name" : "df-prd-es",
"EngineVersion" : "Elasticsearch_7.10",
"DomainId" : "5882XXXXX135/df-prd-es",
"DomainName" : "df-prd-es",
"ClusterConfig" : "{JSON data of instance types and instance counts in the domain}",
"ServiceSoftwareOptions": "{JSON data of current state of service software}",
"region" : "cn-northwest-1",
"RegionId" : "cn-northwest-1"
},
"fields": {
"EBSOptions": "{JSON data of elastic block storage options for the specified domain}",
"Endpoints" : "{Mapping JSON data of domain endpoints used to submit index and search requests}",
"message" : "{Instance JSON data}"
}
}
Note: Fields in
tags
andfields
may change with subsequent updates. Tip 1: The value oftags.name
is the instance ID, used for unique identification. Tip 2: The data field corresponding totags.name
in this script isDomainName
. When using this script, ensure that there are no duplicateDomainName
values across multiple AWS accounts. Tip 3:tags.ClusterConfig
,tags.Endpoint
,tags.ServiceSoftwareOptions
,fields.message
,fields.EBSOptions
,fields.Endpoints
, are all serialized JSON strings.