AWS OpenSearch¶
AWS OpenSearch, including number of connections, number of requests, latency, slow query, etc.
config¶
Install Func¶
Recommend opening 「Integrations - Extension - DataFlux Func (Automata)」: All preconditions are installed automatically, Please continue with the script installation
If you deploy Func yourself,Refer to Self-Deployment of Func
Installation script¶
Tip:Please prepare AWS AK that meets the requirements in advance(For simplicity's sake,,You can directly grant the global read-only permission
ReadOnlyAccess
)
To synchronize the monitoring data of ECS cloud resources, we install the corresponding collection script:「Guance Integration(AWS-OpenSearchCollect)」(ID:guance_aws_open_search
)
Click 【Install】 and enter the corresponding parameters: Aws AK, Aws account name.。
tap【Deploy startup Script】,The system automatically creates Startup
script sets,And automatically configure the corresponding startup script。
After this function is enabled, you can view the automatic triggering configuration in「Management / Crontab Config」。Click【Run】,you can immediately execute once, without waiting for a regular time。After a while, you can view task execution records and corresponding logs。
If you want to collect logs, you must enable the corresponding log collection script. If you want to collect bills, start the cloud bill collection script.
We collected some configurations by default, as described in the Metrics column Configuring Custom Cloud Object Metrics
Verify¶
- In「Management / Crontab Config」check whether the automatic triggering configuration exists for the corresponding task,In addition, you can view task records and logs to check whether exceptions exist
- On the Guance platform, click 「Infrastructure / Custom」 to check whether asset information exists
- On the Guance platform, press 「Metrics」 to check whether monitoring data exists
Metric¶
Configure AWS OpenSearch monitoring. The default metric set is as follows. You can collect more metrics by configuring them Aws Cloud Monitor Metrics Details
Metric | Metric Description |
---|---|
ClusterStatus.green |
The value of 1 indicates that all index shards are allocated to nodes in the cluster. Relevant statistical data: Maximum |
ClusterStatus.yellow |
The value of 1 indicates that all primary shards of the indexes are allocated to nodes in the cluster, but at least one replica shard of an index is not. For more information, please refer to Yellow Cluster Status{:target="_blank"}: Relevant statistical data: Maximum |
ClusterStatus.red |
The value of 1 indicates that at least one primary shard and replica shard of an index are not allocated to nodes in the cluster. For more information, please refer to Red Cluster Status{:target="_blank"}: Relevant statistical data: Maximum |
Shards.active |
The total number of active primary and replica shards. Relevant statistical data: Maximum, Total |
Shards.unassigned |
The number of shards that are not allocated to nodes in the cluster. Relevant statistical data: Maximum, Total |
Shards.delayedUnassigned |
The number of shards whose allocation has been delayed due to timeout settings on their nodes. Relevant statistical data: Maximum, Total |
Shards.activePrimary |
The number of active primary shards. Relevant statistical data: Maximum, Total |
Shards.initializing |
The number of shards that are currently initializing. Relevant statistical data: Total |
Shards.relocating |
The number of shards that are currently relocating. Relevant statistical data: Total |
Nodes |
The number of nodes in the OpenSearch service cluster, including dedicated master nodes and data nodes. For more information, please refer to Change Configuration in Amazon OpenSearch Service{:target="_blank"}: Relevant statistical data: Maximum |
SearchableDocuments |
The total number of searchable documents across all data nodes in the cross-cluster. Relevant statistical data: Minimum, Maximum, Average |
CPUUtilization |
The CPU utilization percentage of data nodes in the cluster. The maximum value represents the node with the highest CPU utilization. The average value represents all nodes in the cluster. This metric can also be used for individual nodes. Relevant statistical data: Maximum, Average |
ClusterUsedSpace |
The total amount of used space in the cluster. You must allow one minute to obtain an accurate value. The OpenSearch service console displays this value in GiB, while the Amazon CloudWatch console displays it in MiB. Relevant statistical data: Minimum, Maximum |
ClusterIndexWritesBlocked |
Indicates whether your cluster is accepting or blocking incoming write requests. A value of 0 means the cluster is accepting requests, while a value of 1 means requests are blocked. Some common factors include low FreeStorageSpace or high JVMMemoryPressure. To address this issue, consider increasing disk space or scaling the cluster. Relevant statistical data: Maximum |
FreeStorageSpace |
The available space for each data node in the cluster. The Sum value displays the total available space for the cluster, but you must allow one minute to obtain an accurate value. Minimum and Maximum values represent the nodes with the smallest and largest available space, respectively. This metric can also be used for individual nodes.When this metric reaches 0, the service throws an OpenSearchClusterBlockException. To recover, you must delete indexes, add larger instances, or add EBS-based storage to existing instances. For more information, please refer to "Lack of Available Storage Space" documentation.The OpenSearch service console displays this value in GiB, while the Amazon CloudWatch console displays it in MiB |
JVMMemoryPressure |
The maximum percentage of Java heap used for all data nodes in the cluster. The OpenSearch service allocates half of the instance's RAM for the Java heap, with a maximum heap size of 32 GiB. You can vertically scale the instance's RAM up to 64 GiB, and horizontal scaling is possible by adding instances. For recommended CloudWatch alarms for Amazon OpenSearch service, please refer to Amazon OpenSearch Service Recommended CloudWatch Alarms{:target="_blank"}.Relevant statistical data: Maximum. Note that there have been changes to the logic of this metric in service software R20220323. For more information, please see the Release Notes{:target="_blank"} |
JVMGCYoungCollectionCount |
The number of "Old Gen" garbage collection runs. In a cluster with sufficient resources, this number should remain small and not grow frequently. Relevant node statistical data: Maximum. Relevant cluster statistical data: Sum, Maximum, Average |
JVMGCOldCollectionTime |
The time taken for "Old Gen" garbage collection in the cluster, measured in milliseconds. Relevant node statistical data: Maximum. Relevant cluster statistical data: Sum, Maximum, Average |
JVMGCYoungCollectionTime |
The time taken for "Young Gen" garbage collection in the cluster, measured in milliseconds. Relevant node statistical data: Maximum. Relevant cluster statistical data: Sum, Maximum, Average |
JVMGCOldCollectionCount |
The number of "Young Gen" garbage collection runs. A large number of continuously increasing runs is normal for cluster operations. This metric is also available at the node level. Relevant node statistical data: Maximum. Relevant cluster statistical data: Sum, Maximum, Average |
IndexingLatency |
The difference in total time (measured in milliseconds) for all index operations in the node between minute N and minute (N-1) |
IndexingRate |
The number of index operations per minute |
SearchLatency |
The difference in total time (measured in milliseconds) for all search operations in the node between minute N and minute (N-1) |
SearchRate |
The total number of search requests for all shards on the data nodes per minute |
SegmentCount |
The number of shards on the data nodes. The more shards you have, the longer it takes for each search. OpenSearch sometimes merges smaller shards into larger ones. Relevant node statistical data: Maximum, Average. Relevant cluster statistical data: Sum, Maximum, Average |
SysMemoryUtilization |
The percentage of used instance memory. A higher value for this metric is normal and usually does not indicate any issues with the cluster. For better indications of potential performance and stability problems, refer to the JVMMemoryPressure metric. Relevant node statistical data: Minimum, Maximum, Average. Relevant cluster statistical data: Minimum, Maximum, Average |
OpenSearchDashboardsConcurrentConnections |
The active concurrent connection count on the OpenSearch dashboard. If this number is consistently high, consider scaling your cluster. Relevant node statistical data: Maximum. Relevant cluster statistical data: Sum, Maximum, Average |
OpenSearchDashboardsHeapTotal |
The amount of heap memory allocated to the OpenSearch dashboard, measured in MiB (Mebibytes ). The precise memory allocation may vary based on different EC2 instance types. Relevant node statistical data: Maximum. Relevant cluster statistical data: Sum, Maximum, Average |
OpenSearchDashboardsHeapUsed |
The absolute amount of heap memory used by the OpenSearch dashboard, measured in MiB (Mebibytes ). Relevant node statistical data: Maximum. Relevant cluster statistical data: Sum, Maximum, Average |
OpenSearchDashboardsHeapUtilization |
The maximum available heap memory percentage used by the OpenSearch dashboard. If this value exceeds 80%, consider scaling your cluster. Relevant node statistical data: Maximum. Relevant cluster statistical data: Minimum, Maximum, Average |
OpenSearchDashboardsResponseTimesMaxInMillis |
The maximum time (measured in milliseconds) required for the OpenSearch dashboard to respond to requests. If requests consistently take a long time to return results, consider increasing the instance type size. Relevant node statistical data: Maximum. Relevant cluster statistical data: Maximum, Average |
OpenSearchDashboardsOS1MinuteLoad |
The one-minute average CPU load on the OpenSearch dashboard. Ideally, the CPU load should be kept below 1.00. While temporary spikes are acceptable, if this metric consistently stays above 1.00, we recommend increasing the instance type size. Relevant node statistical data: Average. Relevant cluster statistical data: Average, Maximum |
OpenSearchDashboardsRequestTotal |
The total number of HTTP requests sent to the OpenSearch dashboard. If your system is slow or you notice a large number of dashboard requests, consider increasing the instance type size. Relevant node statistical data: Total. Relevant cluster statistical data: Sum |
ThreadpoolForce_mergeQueue |
The number of queued tasks in the force merge thread pool. If the queue size is consistently large, consider scaling your cluster. Relevant node statistical data: Maximum. Relevant cluster statistical data: Sum, Maximum, Average |
ThreadpoolForce_mergeRejected |
The number of rejected tasks in the force merge thread pool. If this number continues to grow, consider scaling your cluster. Relevant node statistical data: Maximum. Relevant cluster statistical data: Sum |
ThreadpoolForce_mergeThreads |
The size of the force merge thread pool. Relevant node statistical data: Maximum. Relevant cluster statistical data: Average, Sum |
ThreadpoolSearchQueue |
The number of queued tasks in the search thread pool. If the queue size is consistently large, consider scaling your cluster. The maximum size of the search queue is 1000. Relevant node statistical data: Maximum. Relevant cluster statistical data: Average, Sum |
ThreadpoolSearchRejected |
The number of rejected tasks in the search thread pool. If this number continues to grow, consider scaling your cluster. Relevant node statistical data: Maximum. Relevant cluster statistical data: Sum |
ThreadpoolSearchThreads |
The size of the search thread pool. Relevant node statistical data: Maximum. Relevant cluster statistical data: Average, Sum |
Threadpoolsql-workerQueue |
The number of queued tasks in the SQL search thread pool. If the queue size is consistently large, consider scaling your cluster. Relevant node statistical data: Maximum. Relevant cluster statistical data: Sum, Maximum, Average |
Threadpoolsql-workerRejected |
The number of rejected tasks in the SQL search thread pool. If this number continues to grow, consider scaling your cluster. Relevant node statistical data: Maximum. Relevant cluster statistical data: Sum |
Threadpoolsql-workerThreads |
The size of the SQL search thread pool. Relevant node statistical data: Maximum. Relevant cluster statistical data: Average, Sum |
ThreadpoolWriteQueue |
The number of queued tasks in the write thread pool. If the queue size is consistently large, consider scaling your cluster. Relevant node statistical data: Maximum. Relevant cluster statistical data: Average, Sum |
ThreadpoolWriteRejected |
The number of rejected tasks in the write thread pool. If this number continues to increase, consider scaling your cluster. Relevant node statistical data: Maximum. Relevant cluster statistical data: Average, Sum |
ThreadpoolWriteThreads |
The size of the write thread pool. Relevant node statistical data: Maximum. Relevant cluster statistical data: Average, Sum |
CoordinatingWriteRejected |
The total number of rejections that occurred on the coordinating node due to index pressure since the last OpenSearch service process startup. Relevant node statistical data: Maximum. Relevant cluster statistical data: Average, Sum. This metric is available in version 7.1 and higher |
ReplicaWriteRejected |
The total number of rejections that occurred on replica shards due to index pressure since the last OpenSearch service process startup. Relevant node statistical data: Maximum. Relevant cluster statistical data: Average, Sum. This metric is available in version 7.1 and higher |
PrimaryWriteRejected |
The total number of rejections that occurred on primary shards due to index pressure since the last OpenSearch service process startup. Relevant node statistical data: Maximum. Relevant cluster statistical data: Average, Sum. This metric is available in version 7.1 and higher |
ReadLatency |
The latency of read operations on EBS volumes in seconds. This metric is also available for individual nodes. Relevant statistical data: Minimum, Maximum, Average |
ReadThroughput |
The throughput of read operations on EBS volumes in bytes per second. This metric is also available for individual nodes. Relevant statistical data: Minimum, Maximum, Average |
ReadIOPS |
The number of input and output (I/O) operations per second for read operations on EBS volumes. This metric is also available for individual nodes. Relevant statistical data: Minimum, Maximum, Average |
WriteIOPS |
The number of input and output (I/O) operations per second for write operations on EBS volumes. This metric is also available for individual nodes. Relevant statistical data: Minimum, Maximum, Average |
WriteLatency |
The latency of write operations on EBS volumes, measured in seconds. This metric is also available for individual nodes. Relevant statistical data: Minimum, Maximum, Average |
BurstBalance |
The percentage of remaining burst balance of an EBS volume, which represents the accumulated I/O credits. A value of 100 indicates that the volume has reached the maximum number of credits. If this percentage is below 70%, please refer to the "EBS Burst Balance Low" issue. For domains with gp3 volume type and domains with gp2 volumes larger than 1000 GiB in size, the burst balance will stay at 0. Relevant statistical data: Minimum, Maximum, Average |
CurrentPointInTime |
The number of active PIT (Point-in-Time) search contexts in a node |
TotalPointInTime |
The number of PIT (Point-in-Time) search contexts that have expired since the node started |
HasActivePointInTime |
A value of 1 indicates that there are active PIT (Point-in-Time) search contexts on the node since it started. A value of 0 indicates that there are no active PIT search contexts |
HasUsedPointInTime |
A value of 1 indicates that there are expired PIT (Point-in-Time) search contexts on the node since it started. A value of 0 indicates that there are no expired PIT search contexts |
AsynchronousSearchInitializedRate |
The number of asynchronous searches that were initiated in the past 1 minute |
AsynchronousSearchRunningCurrent |
The number of asynchronous searches that are currently running |
AsynchronousSearchCompletionRate |
The number of asynchronous searches that were successfully completed in the past 1 minute |
AsynchronousSearchFailureRate |
The number of asynchronous searches that were completed and the number that failed in the last minute |
AsynchronousSearchPersistRate |
The number of asynchronous searches that have been continuously running in the last minute |
AsynchronousSearchRejected |
The total number of asynchronous searches that have been rejected since the node started |
AsynchronousSearchCancelled |
The total number of asynchronous searches that have been cancelled since the node started |
SQLRequestCount |
The total number of requests to the _SQL API. Relevant statistic: Sum |
SQLUnhealthy |
A value of 1 indicates that the SQL plugin returned a 5xx response code or passed an invalid query DSL to OpenSearch in response to a specific request. Other requests will continue to succeed. A value of 0 indicates no recent failures. If you see a continuous value of 1, investigate the issues with your client's requests to the plugin. Relevant statistic: Maximum |
SQLDefaultCursorRequestCount |
Similar to SQLRequestCount, but only counts paginated requests. Relevant statistic: Total |
SQLFailedRequestCountByCusErr |
The number of requests to the _SQL API that failed due to client issues. For example, requests may fail with an HTTP status code 400 if there's an IndexNotFoundException. Relevant statistic: Total |
SQLFailedRequestCountBySysErr |
The number of requests to the _SQL API that failed due to server issues or feature limitations. For example, requests may fail with an HTTP status code 503 if there's a VerificationException. Relevant statistic: Total |
OldGenJVMMemoryPressure |
The maximum percentage of the Java heap used for "old generation" on all data nodes in the cluster. This metric is also available at the node level. Relevant statistic: Maximum |
OpenSearchDashboardsHealthyNodes (以前称之为 KibanaHealthyNodes ) |
Health check of the OpenSearch dashboard. If the minimum, maximum, and average values are all equal to 1, then the dashboard is running normally. For example, if you have 10 nodes and the maximum value is 1, the minimum value is 0, and the average value is 0.7, it means that 7 nodes (70%) are running normally, and 3 nodes (30%) are not in good health. Relevant statistics: Minimum, Maximum, Average |
InvalidHostHeaderRequests |
The number of HTTP requests to the OpenSearch cluster that contain invalid (or missing) host headers. Valid requests include domain hostnames as the host header value. OpenSearch service rejects invalid requests made to public access domains without restrictive access policies. It is recommended to apply restrictive access policies to all domains. If you see a large value for this metric, please verify that your OpenSearch client includes the domain hostname in its requests (e.g., instead of its IP address). Relevant statistics: Total |
OpenSearchRequests(previously ElasticsearchRequests) |
The number of requests made to the OpenSearch cluster. Relevant Statistics: Total |
2xx, 3xx, 4xx, 5xx |
The number of requests to the domain that resulted in the specified HTTP response code (2xx, 3xx, 4xx, 5xx). Relevant Statistics: Total |
Object¶
The data structure of collected Tencent Cloud Redis object can be viewed in 'Infrastructure - Custom' where object data is available
{
"measurement": "aws_opensearch",
"tags": {
"name" : "crs-xxxx",
"BillingMode" : "0",
"Engine" : "Redis",
"InstanceId" : "crs-xxxx",
"InstanceName": "solution",
"Port" : "6379",
"ProductType" : "standalone",
"ProjectId" : "0",
"RegionId" : "ap-shanghai",
"Status" : "2",
"Type" : "6",
"WanIp" : "172.x.x.x",
"ZoneId" : "200002"
},
"fields": {
"ClientLimits" : "10000",
"Createtime" : "2022-07-14 13:54:14",
"DeadlineTime" : "0000-00-00 00:00:00",
"InstanceNodeInfo": "{Instance node info}",
"InstanceTitle" : "instance",
"Size" : 1024,
"message" : "{Instance JSON data}"
}
}