Cassandra
Cassandra metrics can be collected by using DDTrace. The flow of the collected data is as follows: Cassandra -> DDTrace -> DataKit(StatsD).
You can see that DataKit has integrated the StatsD server, DDTrace collects Cassandra metric data and reports it to DataKit using StatsD protocol.
Configuration¶
Preconditions¶
- Already tested Cassandra version:
- 5.0
- 4.1.3
- 3.11.15
- 3.0.24
- 2.1.22
DDtrace Configuration¶
-
Download
dd-java-agent.jar
, see here; -
DataKit configuration:
See the configuration of StatsD.
Restart DataKit to make configuration take effect.
- Cassandra configuration:
Create the file setenv.sh
under /usr/local/cassandra/bin
and give it execute permission, then write the following:
export CATALINA_OPTS="-javaagent:dd-java-agent.jar \
-Ddd.jmxfetch.enabled=true \
-Ddd.jmxfetch.statsd.host=${DATAKIT_HOST} \
-Ddd.jmxfetch.statsd.port=${DATAKIT_STATSD_HOST} \
-Ddd.jmxfetch.cassandra.enabled=true"
The parameters are described below:
javaagent
: Fill in the full path todd-java-agent.jar
;Ddd.jmxfetch.enabled
: Fill intrue
, which means the DDTrace collection function is enabled;Ddd.jmxfetch.statsd.host
: Fill in the network address that DataKit listens to. No port number is included;Ddd.jmxfetch.statsd.port
: Fill in the port number that DataKit listens to. Usually11002
, as determined by the DataKit side configuration;Ddd.jmxfetch.cassandra.enabled
: Fill intrue
, which means the Cassandra collect function of DDTrace is enabled. When enabled, the metrics set namedcassandra
will showing up;
Restart DataKit to make configuration take effect.
Collector Configuration¶
Go to the conf.d/db
directory under the DataKit installation directory, copy cassandra.conf.sample
and name it cassandra.conf
. Examples are as follows:
[[inputs.statsd]]
## Collector alias.
source = "statsd/cassandra"
## Collect interval, default is 10 seconds. (optional)
# interval = '10s'
protocol = "udp"
## Address and port to host UDP listener on: (defaults to ":8125")
service_address = ":11002"
## Tag request metric. Used for distinguish feed metric name.
## eg, DD_TAGS=source_key:tomcat,host_key:cn-shanghai-sq5ei
## eg, -Ddd.tags=source_key:tomcat,host_key:cn-shanghai-sq5ei
# statsd_source_key = "source_key"
# statsd_host_key = "host_key"
## Indicate whether report tag statsd_source_key and statsd_host_key.
# save_above_key = false
delete_gauges = true
delete_counters = true
delete_sets = true
delete_timings = true
## Counter metric is float in new Datakit version, set true if want be int.
# set_counter_int = false
## Percentiles to calculate for timing & histogram stats
percentiles = [50.0, 90.0, 99.0, 99.9, 99.95, 100.0]
## separator to use between elements of a statsd metric
metric_separator = "_"
## Parses tags in the datadog statsd format
## http://docs.datadoghq.com/guides/dogstatsd/
parse_data_dog_tags = true
## Parses datadog extensions to the statsd format
datadog_extensions = true
## Parses distributions metric as specified in the datadog statsd format
## https://docs.datadoghq.com/developers/metrics/types/?tab=distribution#definition
datadog_distributions = true
## We do not need following tags(they may create tremendous of time-series under influxdb's logic)
## Examples:
## "runtime-id", "metric-type"
drop_tags = [ ]
## All metric-name prefixed with 'jvm_' are set to influxdb's measurement 'jvm'
## All metric-name prefixed with 'stats_' are set to influxdb's measurement 'stats'
## Attention: Must add these word in statsd conf file.
metric_mapping = ["cassandra_:cassandra", "jvm_:cassandra_jvm", "jmx_:cassandra_jmx", "datadog_:cassandra_datadog"]
## Number of UDP messages allowed to queue up, once filled,
## the statsd server will start dropping packets, default is 128.
# allowed_pending_messages = 128
## Number of timing/histogram values to track per-measurement in the
## calculation of percentiles. Raising this limit increases the accuracy
## of percentiles but also increases the memory usage and cpu time.
percentile_limit = 1000
## Max duration (TTL) for each metric to stay cached/reported without being updated.
# max_ttl = "1000h"
[inputs.statsd.tags]
# some_tag = "some_value"
# more_tag = "some_other_value"
Once configured, restart DataKit.
Can be turned on by ConfigMap Injection Collector Configuration or Config ENV_DATAKIT_INPUTS .
Metric¶
cassandra
¶
- Tags
Tag | Description |
---|---|
columnfamily | 'columnfamily'=batches 'columnfamily'=built_views 'columnfamily'=columns 'columnfamily'='paxos' 'columnfamily'=peer |
host | Host name. |
instance | Instance name. |
jmx_domain | JMX domain. |
keyspace | 'keyspace'=system 'keyspace'=system_schema |
metric_type | Metric type. |
name | Type name. |
path | path=request |
runtime-id | Runtime id. |
scope | scope=ReadStage scope=MutationStage scope=HintsDispatcher scope='MemtableFlushWriter' scope='MemtablePostFlush' |
service | Service name. |
table | table=IndexInfo,table=available_ranges,table=batches,table=built_views, |
type | Object type. |
- Metrics
Metric | Description |
---|---|
active_tasks | The number of tasks that the thread pool is actively executing. Type: float Unit: count |
bloom_filter_false_ratio | The ratio of Bloom filter false positives to total checks. Type: float Unit: count |
bytes_flushed_count | The amount of data that was flushed since (re)start. Type: float Unit: digital,B |
cas_commit_latency_75th_percentile | The latency of 'paxos' commit round - p75. Type: float Unit: time,ms |
cas_commit_latency_95th_percentile | The latency of 'paxos' commit round - p95. Type: float Unit: time,ms |
cas_commit_latency_one_minute_rate | The number of 'paxos' commit round per second. Type: float Unit: throughput,reqps |
cas_prepare_latency_75th_percentile | The latency of 'paxos' prepare round - p75. Type: float Unit: time,ms |
cas_prepare_latency_95th_percentile | The latency of 'paxos' prepare round - p95. Type: float Unit: time,ms |
cas_prepare_latency_one_minute_rate | The number of 'paxos' prepare round per second. Type: float Unit: throughput,reqps |
cas_propose_latency_75th_percentile | The latency of 'paxos' propose round - p75. Type: float Unit: time,ms |
cas_propose_latency_95th_percentile | The latency of 'paxos' propose round - p95. Type: float Unit: time,ms |
cas_propose_latency_one_minute_rate | The number of 'paxos' propose round per second. Type: float Unit: throughput,reqps |
col_update_time_delta_histogram_75th_percentile | The column update time delta - p75. Type: float Unit: time,ms |
col_update_time_delta_histogram_95th_percentile | The column update time delta - p95. Type: float Unit: time,ms |
col_update_time_delta_histogram_min | The column update time delta - min. Type: float Unit: time,ms |
compaction_bytes_written_count | The amount of data that was compacted since (re)start. Type: float Unit: digital,B |
compression_ratio | The compression ratio for all SSTables. A low value means a high compression contrary to what the name suggests. Formula used is: 'size of the compressed SSTable / size of original' Type: float Unit: percent,percent |
currently_blocked_tasks | The number of currently blocked tasks for the thread pool. Type: float Unit: count |
currently_blocked_tasks_count | The number of currently blocked tasks for the thread pool. Type: float Unit: count |
db_droppable_tombstone_ratio | The estimate of the droppable tombstone ratio. Type: float Unit: percent,percent |
dropped_one_minute_rate | The tasks dropped during execution for the thread pool. Type: float Unit: count |
exceptions_count | The number of exceptions thrown from 'Storage' metrics. Type: float Unit: count |
key_cache_hit_rate | The key cache hit rate. Type: float Unit: count |
latency_75th_percentile | The client request latency - p75. Type: float Unit: time,ms |
latency_95th_percentile | The client request latency - p95. Type: float Unit: time,ms |
latency_one_minute_rate | The number of client requests. Type: float Unit: throughput,reqps |
live_disk_space_used_count | The disk space used by live SSTables (only counts in use files). Type: float Unit: digital,B |
live_ss_table_count | Number of live (in use) SSTables. Type: float Unit: count |
load_count | The disk space used by live data on a node. Type: float Unit: digital,B |
max_partition_size | The size of the largest compacted partition. Type: float Unit: digital,B |
max_row_size | The size of the largest compacted row. Type: float Unit: digital,B |
mean_partition_size | The average size of compacted partition. Type: float Unit: digital,B |
mean_row_size | The average size of compacted rows. Type: float Unit: digital,B |
metrics_75th_percentile | Metrics - p75. Type: float Unit: count |
metrics_95th_percentile | Metrics - p95. Type: float Unit: count |
metrics_count | Metrics count. Type: float Unit: count |
metrics_one_minute_rate | The number of metrics. Type: float Unit: count |
metrics_value | Metrics value. Type: float Unit: count |
net_down_endpoint_count | The number of unhealthy nodes in the cluster. They represent each individual node's view of the cluster and thus should not be summed across reporting nodes. Type: float Unit: count |
net_up_endpoint_count | The number of healthy nodes in the cluster. They represent each individual node's view of the cluster and thus should not be summed across reporting nodes. Type: float Unit: count |
nodetool_status_load | Amount of file system data under the 'cassandra' data directory without snapshot content. Type: float Unit: digital,B |
nodetool_status_owns | Percentage of the data owned by the node per data center times the replication factor. Type: float Unit: percent,percent |
nodetool_status_replication_availability | Percentage of data available per 'keyspace' times replication factor. Type: float Unit: percent,percent |
nodetool_status_replication_factor | Replication factor per 'keyspace'. Type: float Unit: count |
nodetool_status_status | Node status: up (1) or down (0). Type: float Unit: bool |
pending_compactions | The number of pending compactions. Type: float Unit: count |
pending_flushes_count | The number of pending flushes. Type: float Unit: count |
pending_tasks | The number of pending tasks for the thread pool. Type: float Unit: count |
range_latency_75th_percentile | The local range request latency - p75. Type: float Unit: time,ms |
range_latency_95th_percentile | The local range request latency - p95. Type: float Unit: time,ms |
range_latency_one_minute_rate | The number of local range requests. Type: float Unit: throughput,reqps |
read_latency_75th_percentile | The local read latency - p75. Type: float Unit: time,ms |
read_latency_95th_percentile | The local read latency - p95. Type: float Unit: time,ms |
read_latency_99th_percentile | The local read latency - p99. Type: float Unit: time,ms |
read_latency_one_minute_rate | The number of local read requests. Type: float Unit: throughput,reqps |
row_cache_hit_count | The number of row cache hits. Type: float Unit: count |
row_cache_hit_out_of_range_count | The number of row cache hits that do not satisfy the query filter and went to disk. Type: float Unit: count |
row_cache_miss_count | The number of table row cache misses. Type: float Unit: count |
snapshots_size | The disk space truly used by snapshots. Type: float Unit: digital,B |
ss_tables_per_read_histogram_75th_percentile | The number of SSTable data files accessed per read - p75. Type: float Unit: count |
ss_tables_per_read_histogram_95th_percentile | The number of SSTable data files accessed per read - p95. Type: float Unit: count |
timeouts_count | Count of requests not acknowledged within configurable timeout window. Type: float Unit: count |
timeouts_one_minute_rate | Recent timeout rate, as an exponentially weighted moving average over a one-minute interval. Type: float Unit: count |
tombstone_scanned_histogram_75th_percentile | Number of tombstones scanned per read - p75. Type: float Unit: count |
tombstone_scanned_histogram_95th_percentile | Number of tombstones scanned per read - p95. Type: float Unit: count |
total_blocked_tasks | Total blocked tasks Type: float Unit: count |
total_blocked_tasks_count | Total count of blocked tasks Type: float Unit: count |
total_commit_log_size | The size used on disk by commit logs. Type: float Unit: digital,B |
total_disk_space_used_count | Total disk space used by SSTables including obsolete ones waiting to be garbage collected Type: float Unit: digital,B |
view_lock_acquire_time_75th_percentile | The time taken acquiring a partition lock for materialized view updates - p75. Type: float Unit: time,ms |
view_lock_acquire_time_95th_percentile | The time taken acquiring a partition lock for materialized view updates - p95. Type: float Unit: time,ms |
view_lock_acquire_time_one_minute_rate | The number of requests to acquire a partition lock for materialized view updates. Type: float Unit: count |
view_read_time_75th_percentile | The time taken during the local read of a materialized view update - p75. Type: float Unit: time,ms |
view_read_time_95th_percentile | The time taken during the local read of a materialized view update - p95. Type: float Unit: time,ms |
view_read_time_one_minute_rate | The number of local reads for materialized view updates. Type: float Unit: count |
waiting_on_free_memtable_space_75th_percentile | The time spent waiting for free mem table space either on- or off-heap - p75. Type: float Unit: time,ms |
waiting_on_free_memtable_space_95th_percentile | The time spent waiting for free mem table space either on- or off-heap - p95. Type: float Unit: time,ms |
write_latency_75th_percentile | The local write latency - p75. Type: float Unit: time,ms |
write_latency_95th_percentile | The local write latency - p95. Type: float Unit: time,ms |
write_latency_99th_percentile | The local write latency - p99. Type: float Unit: time,ms |
write_latency_one_minute_rate | The number of local write requests. Type: float Unit: throughput,reqps |
cassandra_jvm
¶
- Tags
Tag | Description |
---|---|
host | Host name. |
instance | Instance name. |
jmx_domain | JMX domain. |
metric_type | Metric type. |
name | Type name. |
runtime-id | Runtime id. |
service | Service name. |
type | Object type. |
- Metrics
Metric | Description |
---|---|
buffer_pool_direct_capacity | Measure of total memory capacity of direct buffers. Type: float Unit: digital,B |
buffer_pool_direct_count | Number of direct buffers in the pool. Type: float Unit: count |
buffer_pool_direct_used | Measure of memory used by direct buffers. Type: float Unit: digital,B |
buffer_pool_mapped_capacity | Measure of total memory capacity of mapped buffers. Type: float Unit: digital,B |
buffer_pool_mapped_count | Number of mapped buffers in the pool. Type: float Unit: count |
buffer_pool_mapped_used | Measure of memory used by mapped buffers. Type: float Unit: digital,B |
cpu_load_process | Recent CPU utilization for the process. Type: float Unit: percent,percent |
cpu_load_system | Recent CPU utilization for the whole system. Type: float Unit: percent,percent |
daemon_code_cache_used | The number of daemon threads. Type: float Unit: count |
daemon_thread_count | Daemon thread count. Type: float Unit: count |
gc_cms_count | The total number of garbage collections that have occurred. Type: float Unit: count |
gc_code_cache_used | GC code cache used. Type: float Unit: count |
gc_eden_size | The 'eden' size in garbage collection. Type: float Unit: digital,B |
gc_major_collection_count | The rate of major garbage collections. Set new_gc_metrics: true to receive this metric. Type: float Unit: count |
gc_major_collection_time | The fraction of time spent in major garbage collection. Set new_gc_metrics: true to receive this metric. Type: float Unit: PPM |
gc_metaspace_size | The metaspace size in garbage collection.Type: float Unit: digital,B |
gc_minor_collection_count | The rate of minor garbage collections. Set new_gc_metrics: true to receive this metric. Type: float Unit: count |
gc_minor_collection_time | The fraction of time spent in minor garbage collection. Set new_gc_metrics: true to receive this metric. Type: float Unit: PPM |
gc_old_gen_size | The ond gen size in garbage collection. Type: float Unit: digital,B |
gc_parnew_time | The approximate accumulated garbage collection time elapsed. Type: float Unit: time,ms |
gc_survivor_size | The survivor size in garbage collection. Type: float Unit: digital,B |
heap_memory | The total Java heap memory used. Type: float Unit: digital,B |
heap_memory_committed | The total Java heap memory committed to be used. Type: float Unit: digital,B |
heap_memory_init | The initial Java heap memory allocated. Type: float Unit: digital,B |
heap_memory_max | The maximum Java heap memory available. Type: float Unit: digital,B |
loaded_classes | Number of classes currently loaded. Type: float Unit: count |
non_heap_memory | The total Java non-heap memory used. Non-heap memory is: Metaspace + CompressedClassSpace + CodeCache .Type: float Unit: digital,B |
non_heap_memory_committed | The total Java non-heap memory committed to be used. Type: float Unit: digital,B |
non_heap_memory_init | The initial Java non-heap memory allocated. Type: float Unit: digital,B |
non_heap_memory_max | The maximum Java non-heap memory available. Type: float Unit: digital,B |
os_open_file_descriptors | The number of file descriptors used by this process (only available for processes run as the dd-agent user) Type: float Unit: count |
peak_thread_count | The peak number of live threads. Type: float Unit: count |
thread_count | The number of live threads. Type: float Unit: count |
total_thread_count | The number of total threads. Type: float Unit: count |
cassandra_jmx
¶
- Tags
Tag | Description |
---|---|
host | Host name. |
instance | Instance name. |
jmx_domain | JMX domain. |
metric_type | Metric type. |
name | Type name. |
runtime-id | Runtime id. |
service | Service name. |
type | Object type. |
- Metrics
Metric | Description |
---|---|
gc_cms.count | The total number of garbage collections that have occurred. Type: float Unit: count |
gc_major_collection_count | The rate of major garbage collections. Set new_gc_metrics: true to receive this metric. Type: float Unit: count |
gc_major_collection_time | The fraction of time spent in major garbage collection. Set new_gc_metrics: true to receive this metric. Type: float Unit: PPM |
gc_minor_collection_count | The rate of minor garbage collections. Set new_gc_metrics: true to receive this metric. Type: float Unit: count |
gc_minor_collection_time | The fraction of time spent in minor garbage collection. Set new_gc_metrics: true to receive this metric. Type: float Unit: PPM |
gc_parnew.time | The approximate accumulated garbage collection time elapsed. Type: float Unit: time,ms |
heap_memory | The total Java heap memory used. Type: float Unit: digital,B |
heap_memory_committed | The total Java heap memory committed to be used. Type: float Unit: digital,B |
heap_memory_init | The initial Java heap memory allocated. Type: float Unit: digital,B |
heap_memory_max | The maximum Java heap memory available. Type: float Unit: digital,B |
non_heap_memory | The total Java non-heap memory used. Non-heap memory is calculated as follows: 'Metaspace' + CompressedClassSpace + CodeCache Type: float Unit: digital,B |
non_heap_memory_committed | The total Java non-heap memory committed to be used. Type: float Unit: digital,B |
non_heap_memory_init | The initial Java non-heap memory allocated. Type: float Unit: digital,B |
non_heap_memory_max | The maximum Java non-heap memory available. Type: float Unit: digital,B |
thread_count | The number of live threads. Type: float Unit: count |
cassandra_datadog
¶
- Tags
Tag | Description |
---|---|
endpoint | Endpoint. |
host | Host name. |
lang | Lang type. |
lang_interpreter | Lang interpreter. |
lang_interpreter_vendor | Lang interpreter vendor. |
lang_version | Lang version. |
metric_type | Metric type. |
priority | Priority. |
service | Service name. |
stat | Stat. |
tracer_version | Tracer version. |
- Metrics
Metric | Description |
---|---|
tracer_agent_discovery_time | Tracer agent discovery time. Type: float Unit: time,ms |
tracer_api_errors_total | Tracer api errors total. Type: float Unit: count |
tracer_api_requests_total | Tracer api requests total. Type: float Unit: count |
tracer_flush_bytes_total | Tracer flush bytes total. Type: float Unit: count |
tracer_flush_traces_total | Tracer flush traces total. Type: float Unit: count |
tracer_queue_enqueued_bytes | Tracer queue enqueued bytes. Type: float Unit: count |
tracer_queue_enqueued_spans | Tracer queue enqueued spans. Type: float Unit: count |
tracer_queue_enqueued_traces | Tracer queue enqueued traces. Type: float Unit: count |
tracer_queue_max_length | Tracer queue max length. Type: float Unit: count |
tracer_scope_activate_count | Tracer scope activate count. Type: float Unit: count |
tracer_scope_close_count | Tracer scope close count. Type: float Unit: count |
tracer_span_pending_created | Tracer span pending created. Type: float Unit: count |
tracer_span_pending_finished | Tracer span pending finished. Type: float Unit: count |
tracer_trace_agent_discovery_time | Tracer trace agent discovery time. Type: float Unit: count |
tracer_trace_agent_send_time | Tracer trace agent send time. Type: float Unit: count |
tracer_trace_pending_created | Tracer trace pending created. Type: float Unit: count |
tracer_tracer_trace_buffer_fill_time | Tracer trace buffer fill time. Type: float Unit: count |