点评 CAT
Dianping-cat 简称 Cat, 是一个开源的分布式实时监控系统,主要用于监控系统的性能、容量和业务指标等。它是美团点评公司研发的一款监控系统,目前已经开源并得到了广泛的应用。
Cat 通过采集系统的各种指标数据,如 CPU、内存、网络、磁盘等,进行实时监控和分析,帮助开发人员快速定位和解决系统问题。同时,它还提供了一些常用的监控功能,如告警、统计、日志分析等,方便开发人员进行系统监控和分析。
数据类型¶
数据传输协议:
- plaintext : 纯文本模式, DataKit 目前暂时不支持。
- native : 以特定符号为分隔符的文本形式,目前 DataKit 已经支持。
数据分类:
| 数据类型简写 | 类型 | 说明 | 当前版本的 DataKit 是否接入 | 对应到观测云中的数据类型 |
|---|---|---|---|---|
| t | transaction start | 事务开始 | true | trace |
| T | transaction end | 事务结束 | true | trace |
| E | event | 事件 | false | - |
| M | metric | 自定义指标 | false | - |
| L | trace | 链路 | false | - |
| H | heartbeat | 心跳包 | true | 指标 |
客户端的启动模式¶
-
启动 cat server 模式
- 数据全在 DataKit 中,cat 的 web 页面已经没有数据,所以启动的意义不大,并且页面报错: 出问题 CAT 的服务端[xxx.xxx]
- 配置客户端行为可以在 client 的启动中做
- cat server 也会将 transaction 数据发送到 dk,造成观测云页面大量的垃圾数据
-
不启动 cat server: 在 DataKit 中配置
startTransactionTypes:用于定义自定义事务类型,指定的事务类型会被 Cat 自动创建。多个事务类型之间使用分号进行分隔。block:指定一个阈值用于阻塞监控,单位为毫秒。当某个事务的执行时间大于该阈值时,会触发 Cat 记录该事务的阻塞情况。routers:指定 Cat 服务端的地址和端口号,多个服务器地址和端口号之间使用分号进行分隔。Cat 会自动将数据发送到这些服务器上,以保证数据的可靠性和容灾性。sample:指定采样率,即只有一部分数据会被发送到 Cat 服务器。取值范围为 0 到 1,其中 1 表示全部数据都会被发送到 Cat 服务器,0 表示不发送任何数据。matchTransactionTypes:用于定义自定义事务类型的匹配规则,通常用于 Api 服务监控中,指定需要监控哪些接口的性能。
所以: 不建议去开启一个 cat_home(cat server) 服务。相应的配置可以在 client.xml 中配置,请看下文。
配置¶
客户端配置¶
<?xml version="1.0" encoding="utf-8"?>
<config mode="client">
<servers>
<!-- datakit ip, cat port , http port -->
<server ip="10.200.6.16" port="2280" http-port="9529"/>
</servers>
</config>
注意:配置中的 9529 端口是 DataKit 的 http 端口。2280 是 cat 采集器开通的 2280 端口。
采集器配置¶
进入 DataKit 安装目录下的 conf.d/samples 目录,复制 cat.conf.sample 并命名为 cat.conf。示例如下:
[[inputs.cat]]
## tcp port
tcp_port = "2280"
##native or plaintext, datakit only support native(NT1) !!!
decode = "native"
## This is default cat-client Kvs configs.
startTransactionTypes = "Cache.;Squirrel."
MatchTransactionTypes = "SQL"
block = "false"
routers = "127.0.0.1:2280;"
sample = "1.0"
## global tags.
# [inputs.cat.tags]
# key1 = "value1"
# key2 = "value2"
# ...
配置好后,重启 DataKit 即可。
目前可以通过 ConfigMap 方式注入采集器配置来开启采集器。
配置文件注意的地方:
startTransactionTypesMatchTransactionTypesblockrouterssample是返回给 client 端的数据routers是 DataKit 的 ip 或者域名tcp_port对应的是 client 端配置 servers ip 地址
Tracing¶
cat¶
Following is tags/fields of tracing data
| Tags & Fields | Description |
|---|---|
| base_service ( tag) |
Span base service name |
| container_host ( tag) |
Container hostname. Available in OpenTelemetry. Optional. |
| db_host ( tag) |
DB host name: ip or domain name. Optional. |
| db_name ( tag) |
Database name. Optional. |
| db_system ( tag) |
Database system name:mysql,oracle... Optional. |
| dk_fingerprint ( tag) |
DataKit fingerprint(always DataKit's hostname) |
| endpoint ( tag) |
Endpoint info. Available in SkyWalking, Zipkin. Optional. |
| env ( tag) |
Application environment info. Available in Jaeger. Optional. |
| host ( tag) |
Hostname. |
| http_method ( tag) |
HTTP request method name. Available in DDTrace, OpenTelemetry. Optional. |
| http_route ( tag) |
HTTP route. Optional. |
| http_status_code ( tag) |
HTTP response code. Available in DDTrace, OpenTelemetry. Optional. |
| http_url ( tag) |
HTTP URL. Optional. |
| operation ( tag) |
Span name |
| out_host ( tag) |
This is the database host, equivalent to db_host,only DDTrace-go. Optional. |
| project ( tag) |
Project name. Available in Jaeger. Optional. |
| service ( tag) |
Service name. Optional. |
| source_type ( tag) |
Tracing source type |
| span_type ( tag) |
Span type |
| status ( tag) |
Span status |
| version ( tag) |
Application version info. Available in Jaeger. Optional. |
| duration | Duration of span Type: int | (gauge) Unit: time,μs |
| message | Origin content of span Type: string Unit: N/A |
| parent_id | Parent span ID of current span Type: string Unit: N/A |
| resource | Resource name produce current span Type: string Unit: N/A |
| span_id | Span id Type: string Unit: N/A |
| start | start time of span. Type: int | (gauge) Unit: timeStamp,usec |
| trace_id | Trace id Type: string Unit: N/A |
Metric¶
Metric cat¶
| Tags & Fields | Description |
|---|---|
| domain ( tag) |
IP address. |
| hostName ( tag) |
Host name. |
| os_arch ( tag) |
CPU architecture:AMD/ARM. |
| os_name ( tag) |
OS name:'Windows/Linux/Mac',etc. |
| os_version ( tag) |
The kernel version of the OS. |
| runtime_java-version ( tag) |
Java version. |
| runtime_user-dir ( tag) |
The path of jar. |
| runtime_user-name ( tag) |
User name. |
| disk_free | Free disk size. Type: float | (gauge) Unit: digital,B |
| disk_total | Total disk size of data nodes. Type: float | (gauge) Unit: digital,B |
| disk_usable | Used disk size. Type: float | (gauge) Unit: digital,B |
| memory_free | Free memory size. Type: float | (gauge) Unit: count |
| memory_heap-usage | The usage of heap memory. Type: float | (gauge) Unit: count |
| memory_max | Max memory usage. Type: float | (gauge) Unit: count |
| memory_non-heap-usage | The usage of non heap memory. Type: float | (gauge) Unit: count |
| memory_total | Total memory size. Type: float | (gauge) Unit: count |
| os_available-processors | The number of available processors in the host. Type: float | (gauge) Unit: count |
| os_committed-virtual-memory | Committed virtual memory size. Type: float | (gauge) Unit: digital,B |
| os_free-physical-memory | Free physical memory size. Type: float | (gauge) Unit: digital,B |
| os_free-swap-space | Free swap space size Type: float | (gauge) Unit: digital,B |
| os_system-load-average | Average system load. Type: float | (gauge) Unit: percent,percent |
| os_total-physical-memory | Total physical memory size. Type: float | (gauge) Unit: digital,B |
| os_total-swap-space | Total swap space size. Type: float | (gauge) Unit: digital,B |
| runtime_start-time | Start time. Type: int | (gauge) Unit: time,s |
| runtime_up-time | Runtime. Type: int | (gauge) Unit: time,ms |
| thread_cat_thread_count | The number of threads used by cat. Type: float | (gauge) Unit: count |
| thread_count | Total number of threads. Type: float | (gauge) Unit: count |
| thread_daemon_count | The number of daemon threads. Type: float | (gauge) Unit: count |
| thread_http_thread_count | The number of http threads. Type: float | (gauge) Unit: count |
| thread_peek_count | Thread peek. Type: float | (gauge) Unit: count |
| thread_pigeon_thread_count | The number of pigeon threads. Type: float | (gauge) Unit: count |
| thread_total_started_count | Total number of started threads. Type: float | (gauge) Unit: count |