Skip to content

Dataway Sink


Version-1.14.0 to use the sinker functionality here.


Dataway Sinker Introduction

In the daily data collection process, we may need to upload different data into different workspaces due to the existence of multiple different workspaces. For example, in a common Kubernetes cluster, the data collected may involve different teams or business departments, and we can tap the data with specific attributes to different workspaces to achieve fine-grained collection in common infrastructure scenarios.

The basic network topology is as follows:

flowchart TD
dw(Dataway);
dk(Datakit);
etcd[(etcd)];
sinker(Sinker);
wksp1(Workspace 1);
wksp2(Workspace 2);
rules(Sinker Rules);
check_token{Token ok?};
drop(Drop Data/Request);

subgraph "Datakit cluster"
dk
end

dk -.-> |HTTP: X-Global-Tags/Secret-Token|dw

subgraph "Dataway Cluster (Nginx)"
%%direction LR
rules -->  dw
dw --> check_token -->|No| drop
check_token -->|OK| sinker

sinker --> |Rule 1 match|wksp1;
sinker --> |Rule 2 match|wksp2;
sinker --> |Rules do not match|drop;
end

subgraph "Workspace Changes"
direction BT
etcd -.-> |Key change notice|rules
end

Dataway Serial Mode

For SaaS users, you can deploy a Dataway on your own premises (k8s cluster) dedicated to offloading, and then forward the data to Openway:

Warning

In cascaded mode, the Dataway in the cluster needs to enable the cascaded option. See Environment Variable Description in the installation documentation.

flowchart LR;
dk1(Datakit-A);
dk2(Datakit-B);
dk3(Datakit-C);
sink_dw(Dataway);
openway(Openway);
etcd(etcd);
%%

subgraph "K8s Cluster"
dk1 ---> sink_dw
dk2 ---> sink_dw
dk3 ---> sink_dw
etcd-.-> | triage rules | sink_dw
end

subgraph "SaaS"
sink_dw --> |Sink|openway;
sink_dw --> |Sink|openway;
end

Dataway installation

See here

Dataway Settings

In addition to the general Dataway settings, several additional configurations need to be set up (located in the /usr/local/cloudcare/dataflux/dataway/ directory):

# Set the address to be uploaded by Dataway here, usually Kodo, but it can also be another Dataway
remote_host: https://kodo.guance.com

# If the upload address is Dataway, set to true here to indicate that Dataway cascade
cascaded: false

# This token is a random token set on the dataway, we need to fill it in
# Datakit's datakit.conf configuration. A certain length and format need to be maintained here.
secret_token: tkn_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

# sinker rule settings
sinker:
  etcd: # supports etcd
    urls:
    - http://localhost:2379
    dial_timeout: 30s
    key_space: /dw_sinker
    username: "dataway"
    password: "<PASSWORD>"

  #file: # also supports local file mode, which is often used for debugging
  #  path: /path/to/sinker.json
Warning

If you do not set secret_token, any request sent by Datakit will go through without causing data problems. However, if Dataway is deployed on the public network, it is recommended to set secret_token.

If etcd does not set a username/password, then set both username and password to "" here.

Config sinker rules

Dataway Sinker rules are JSON string, there are 2 JSON source:

  • Specify a local disk file(like /path/to/sinker.json), every time we update the JSON file, we must restart dataway to reload the JSON.
  • We can push the JSON file to etcd. While sinker JSON host on etcd, we don't have to restart Dataway if the JSON file refreshed, Dataway will be notified if sinker rules updated.

Actually, the JSON from local file or etcd are the same string, we only cover how to manage sinker rules hosted on etcd in the following sections.

etcd settings

All command following are under Linux.

Dataway, as an etcd client, can set the following username and role in etcd (etcd 3.5+), see here

Create a 'dataway' account and corresponding role:

# Add a username, where you will be prompted for a password
$ etcdctl user add dataway

# Add the role of sinker
$ etcdctl role add sinker

# Add Dataway to the role
$ etcdctl user grant-role dataway sinker

# Restrict the key permissions of the role (where /dw_sinker and /ping are the two keys used by default)
$ etcdctl role grant-permission sinker readwrite /dw_sinker
$ etcdctl role grant-permission sinker readwrite /ping # is used to detect connectivity

See here.

Why create a role?

Roles are used to control the permissions of the corresponding user on certain keys, here we may use the user's existing etcd service, it is necessary to restrict the data permissions of Dataway this user.

Warning

If etcd has authentication mode enabled, execute the etcdctl command, and bring the corresponding username and password:

$ etcdctl --user name:password ...

Write sinker rules

For Dataway version 1.3.6, there are convenient commands to manage sinker rules hosted on etcd.

Suppose the sinker.json rule is defined as follows:

{
    "strict":true,
    "rules": [
        {
            "rules": [
                "{ host = 'my-host'}"
            ],
            "url": "https://kodo.guance.com?token=tkn_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
        },
        {
            "rules": [
                "{ host = 'my-host' OR cluster = 'cluster-A' }"
            ],
            "url": "https://kodo.guance.com?token=tkn_yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy"
        }
     ]
}

You can write the sinker rule configuration with the following command:

$ etcdctl --user dataway:PASSWORD put /dw_sinker "$(<sinker.json)"
OK
Comment URL Token Info

Because we can't add comments on JSON file sinker.json, we can add extra field for commenting:

{
    "rules": [
        "{ host = 'my-host' OR cluster = 'cluster-A' }"
    ],
    "info": "This is yyy workspace",
    "url": "https://kodo.guance.com?token=tkn_yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy"
}

Token Specs

Since Datakit will detect tokens on Dataway, the token (including secret_token) set here must meet the following conditions:

starts with token_ or tkn_ and follows a character length of 32.

For tokens that do not meet this condition, Datakit fails to install.

Datakit settings

For Datakit, we must setup some configures to enable sinker:

  • Config customer global keys: Datakit will search among all uploading data for these keys(only string-type fields), and group the upload payload by same key:value pair

See here

See here

  • Configure Global Host Tag and Global Election Tag

In all Datakit uploaded data, these configured global tags (including tag key and tag value) will be brought as the basis for group sending.

See here

See here

Setup global custom keys

To enable sinker for Datakit, setup these in datakit.conf:

[dataway]
  global_customer_keys = [
    # Do not add too may keys here, 2 ~ 3 keys are valid.
    # Here we add category and class.
    "category",
    "class",
  ]

  # enable Sinker feature
  enable_sinker = true

In addition to dial tests, General Data Classification, Session Replay and Profiling and other binary file data. Do not configure non-string-type keys for global_customer_keys, we just ignore them.

Impact of Global Tags on Sink

In addition to global_customer_keys affecting the sinking markers, the Global Tags configured in Datakit (including global election tags and global host tags) also influence the sinking markers. This means that if the data points contain fields that appear in the global tags (and the value types of these fields must be string), they will be taken into account for sinking. Assume the global election tag is configured as follows:

# datakit.conf
[election.tags]
    cluster = "my-cluster"

For the following data point:

pi,cluster=cluster_A,app=math,other_tag=other_value value=3.14 1712796013000000000

Since the global election tag includes cluster (regardless of the value configured for this tag), and the data point itself also has a cluster tag, the final X-Global-Tags will include the key-value pair cluster=cluster_A:

X-Global-Tags: cluster=cluster_A

If global_customer_keys also configures the app key, then the final sharding Header will be (the order of the key-value pairs is not important):

X-Global-Tags: cluster=cluster_A,app=math
Note

The example here intentionally sets the value of cluster in the datakit.conf different from the value of the cluster field in the data point, mainly to emphasize the impact of the tag key here. It can be understood that once a data point contains a global tag key that meets the condition, its effect is equivalent to this global tag key being added to global_customer_keys.

Dataway sink command

Dataway supports managing the configuration of sinker through the command line since version Version-1.3.6. The specific usage is as follows:

$ ./dataway sink --help

Usage of sink:
  -add string
        single rule json file
  -cfg-file string
        configure file (default "/usr/local/cloudcare/dataflux/dataway/dataway.yaml")
  -file string
        file path of the rule json, only used for command put and get
  -get
        get the rule json
  -list
        list rules
  -log string
        log file path (default "/dev/null")
  -put
        save the rule json
  -token string
        rules filtered by token, eg: xx,yy

Specify configuration file

When the command is executed, the default configuration file loaded is /usr/local/cloudcare/dataflux/dataway/dataway.yaml, and if additional configurations need to be loaded, they can be specified using the --cfg-file option.

$ ./dataway sink --cfg-file dataway.yaml [--list...]

Command log setting

The command log was disabled by default. If you need to view it, you can set the --log parameter.

# output log to stdout
$ ./dataway sink --list --log stdout

# output log to file
$ ./dataway sink --list --log /tmp/log

View sinker rules

# list all rules 
$ ./dataway sink --list

# list all rules filtered by token 
$ ./dataway sink --list --token=token1,token2

CreateRevision: 2
ModRevision: 41
Version: 40
Rules: 
[
    {
        "rules": [
            "{ workspace = 'zhengb-test'}"
        ],
        "url": "https://openway.guance.com?token=token1"
    }
]

Add sinker rules

Create file rule.json and add the following content:

[
  {
    "rules": [
      "{ host = 'HOST1'}"
    ],
    "url": "https://openway.guance.com?token=tkn_xxxxxxxxxxxxx"
  },
  {
    "rules": [
      "{ host = 'HOST2'}"
    ],
    "url": "https://openway.guance.com?token=tkn_yyyyyyyyyyyyy"
  }
]

Add the rules.

$ ./dataway sink --add rule.json

add 2 rules ok!

Export sinker configuration

Export the sinker configuration content to local file.

$ ./dataway sink --get --file sink-get.json

rules json was saved to sink-get.json!

Import sinker configuration

Import sinker configuration from local file.

Create file sink-put.json and add following content:

{
    "rules": [
        {
            "rules": [
                "{ workspace = 'test'}"
            ],
            "url": "https://openway.guance.com?token=tkn_xxxxxxxxxxxxxx"
        }
    ],
    "strict": true
}

Import the file.

$ dataway sink --put --file sink-put.json

Config examples

dataway.yaml in Kubernetes(expand me)

We can setup a configmap in Dataway Pod yaml:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: deployment-utils-dataway
  name: dataway
  namespace: utils
spec:
  replicas: 1
  selector:
    matchLabels:
      app: deployment-utils-dataway
  template:
    metadata:
      labels:
        app: deployment-utils-dataway
      annotations:
        datakit/logs: |
          [{"disable": true}]
        datakit/prom.instances: |
          [[inputs.prom]]
            url = "http://$IP:9090/metrics"
            source = "dataway"
            measurement_name = "dw"
            interval = "10s"

            [inputs.prom.tags]
              namespace = "$NAMESPACE"
              pod_name = "$PODNAME"
              node_name = "$NODENAME"
    spec:
      affinity:
        podAffinity: {}
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - deployment-utils-dataway
              topologyKey: kubernetes.io/hostname

      containers:
      - image: registry.jiagouyun.com/dataway/dataway:1.3.6 # select version here 
        #imagePullPolicy: IfNotPresent
        imagePullPolicy: Always
        name: dataway
        env:
        - name: DW_REMOTE_HOST
          value: "http://kodo.forethought-kodo:9527" # setup kodo server or next cascaded Dataway
        - name: DW_BIND
          value: "0.0.0.0:9528"
        - name: DW_UUID
          value: "agnt_xxxxx" # setup Dataway UUID
        - name: DW_TOKEN
          value: "tkn_oooooooooooooooooooooooooooooooo" # setup system workspace Dataway token
        - name: DW_PROM_LISTEN
          value: "0.0.0.0:9090"
        - name: DW_SECRET_TOKEN
          value: "tkn_zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
        - name: DW_SINKER_FILE_PATH
          value: "/usr/local/cloudcare/dataflux/dataway/sinker.json"
        ports:
        - containerPort: 9528
          name: 9528tcp01
          protocol: TCP
        volumeMounts:
          - mountPath: /usr/local/cloudcare/dataflux/dataway/cache
            name: dataway-cache
          - mountPath: /usr/local/cloudcare/dataflux/dataway/sinker.json
            name: sinker
            subPath: sinker.json
        resources:
          limits:
            cpu: '4'
            memory: 4Gi
          requests:
            cpu: 100m
            memory: 512Mi
      # nodeSelector:
      #   key: string
      imagePullSecrets:
      - name: registry-key
      restartPolicy: Always
      volumes:
      - hostPath:
          path: /root/dataway_cache
        name: dataway-cache
      - configMap:
          name: sinker
        name: sinker
---

apiVersion: v1
kind: Service
metadata:
  name: dataway
  namespace: utils
spec:
  ports:
  - name: 9528tcp02
    port: 9528
    protocol: TCP
    targetPort: 9528
    nodePort: 30928
  selector:
    app: deployment-utils-dataway
  type: NodePort

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: sinker
  namespace: utils
data:
  sinker.json: |
    {
        "strict":true,
        "rules": [
            {
                "rules": [
                    "{ project = 'xxxxx'}"
                ],
                "url": "http://kodo.forethought-kodo:9527?token=tkn_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
            },
            {
                "rules": [
                    "{ project = 'xxxxx'}"
                ],
                "url": "http://kodo.forethought-kodo:9527?token=tkn_yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy"
            }
         ]
    }
Ingress for Dataway(expand me)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: dataway-sinker
  namespace: utils
spec:
  ingressClassName: nginx
  rules:
  - host: datawaysinker-xxxx.com
    http:
      paths:
      - backend:
          service:
            name: dataway
            port:
              number: 9528
        path: /
        pathType: ImplementationSpecific

FAQ

Datakit Error 403

If the sinker on the Dataway is misconfigured, causing all Datakit requests to use secret_token, and the token hub (Kodo) is not recognized, a 403 error kodo.tokenNotFound is reported.

The cause of this problem may be that the etcd username and password are wrong, causing Dataway to fail to obtain the sinker configuration, and Dataway believes that the current sinker is invalid, and all data is directly transmitted to the center.

etcd permission configuration issues

If the following error is reported in the Dataway log, there may be a problem with the permission setting:

sinker ping: etcdserver: permission denied, retrying(97th)

If the permissions are not configured properly, you can delete all existing Dataway-based permissions and reconfigure them, see here

Datakit Key Priority

When configuring the Global Custom Key List, if both the Global Host Tag and the Global Election Tag also have a Key with the same name, the corresponding Key-Value pair in the collected data is used.

For example, if there is key1,key2,key3 in the configured "global custom key list", and these keys are also configured in the "global host tag" or "global election tag" and the corresponding values are specified, such as: key1=value-1, in a data collection, there is also a field key1=value-from-data, then the final grouping by uses key1=in the data value-from-data', ignoring the value of the corresponding key configured in the Global Host Tag and Global Election Tag.

If there is a key with the same name between the Global Host Tag and the Global Election Tag, the key in the Global Election Tag takes precedence. In summary, the value source priority of the key is as follows (decreasing):

  • Data collected
  • Global election tag
  • Global host TAg

Built-in "global custom key"

Datakit has several built-in custom keys that are not typically present in the collected data, but Datakit can use these keys to group data. If there is a need to split the dimensions of these keys, you can add them to the "Global Custom Key" list (none of these keys are configured by default). We can use some built-in custom keys as follows to achieve data offloading.

Warning

The addition of a "global custom key" will cause data to be subpackaged when it is sent, and if the granularity is too fine, the Datakit upload efficiency will be rapidly reduced. In general, it is not recommended to have more than 3 global custom keys.

  • class is for object data, and when enabled, it will be divided according to the classification of objects. For example, if the object of a pod is classified as kubelet_pod, then you can formulate a triage rule for the pod:
{
    "strict": true,
    "rules": [
        {
            "rules": [
                "{ class = 'kubelet_pod' AND other_conditon = 'some-value' }",
            ],
            "url": "https://openway.guance.com?token=<YOUR-TOKEN>",
        },
        {
            ... # other rules
        }
    ]
}
  • measurement For indicator data, we can hit a specific indicator set to a specific workspace, for example, the name of the indicator set on disk is disk, we can write the rule like this:
{
    "strict": true,
    "rules": [
        {
           "rules": [
               "{ measurement = 'disk' AND other_conditon = 'some-value' }",
           ],
           "url": "https://openway.guance.com?token=<YOUR-TOKEN>",
        },
        {
            ... # other rules
        }
    ]
}
  • source for logs (L), eBPF network metrics (N), events (E), and RUM data
  • service for Tracing, Scheck, and Profiling
  • category for all general data classification, its value is the "name" column of the corresponding data classification (e.g. time series is metric, object is object, etc.). Taking logs as an example, we can do a separate triage rule for logs as follows:
{
    "strict": true,
    "rules": [
        {
            "rules": [
                "{ category = 'logging' AND other_conditon = 'some-value' }",
            ],
            "url": "https://openway.guance.com?token=<YOUR-TOKEN>",
        },
        {
            ... # other rules
        }
    ]
}

Special Sink Behavior

Some requests initiated by Datakit are aimed at pulling resources from the center or performing self-identification. These behaviors are atomic and indivisible, and these requests cannot be distributed to multiple workspaces (because Datakit needs to process the return of these API requests and decide its subsequent actions). Therefore, these APIs can only be diverted to one workspace at most.

If multiple conditions are met in the diversion rules, these APIs will also only be diverted to the workspace pointed to by the first rule that meets the condition.

Here is an example of such a diversion rule:

We recommend adding the following rule to the Sinker rules to ensure that these existing API requests from Datakit can be implemented in a specific workspace.

{
    "strict": true,
    "info": "Some Special workspace only used for pulling APIs",
    "rules": [
        {
            "rules": [
                "{ __dataway_api in ['/v1/datakit/pull', '/v1/election', '/v1/election/heartbeat', '/v1/query/raw', '/v1/workspace', '/v1/object/labels', '/v1/check/token'] }",
            ],
            "url": "https://openway.guance.com?token=<SOME-SPECIAL-WORKSPACE-TOKEN>"
        }
    ]
}
Info

The descriptions of these API URLs are as follows:

  • /v1/election: Election request
  • /v1/election/heartbeat: Election heartbeat request
  • /v1/datakit/pull: Pull central configuration of Pipeline and blacklist
  • /v1/query/raw: DQL query
  • /v1/workspace: Get workspace information
  • /v1/object/labels: Update/delete object data
  • /v1/check/token: Check workspace Token information

The key __dataway_api does not need to be configured in the global_customer_keys of the datakit.conf. Dataway will default to using this as a diversion Key, and use the current request's API route as its value. That is to say, for a certain API:

POST /v1/write/metrics
X-Global-Tags: cluster=cluster_A,app=math

The final diversion effect it participates in is the same as the following:

POST /v1/write/metrics
X-Global-Tags: cluster=cluster_A,app=math,__dataway_api=/v1/write/metrics

So, we can directly use the __dataway_api KV pair in the Sink rules for matching. This also reminds us that in this special rule, do not include other important data upload APIs, such as the /v1/write/... API routes, otherwise, which workspace the data will ultimately fall into is undefined.

Feedback

Is this page helpful? ×