Guance Log Collection and Analysis Best Practices¶

Log Collection¶

First, what are logs? Logs are text data generated by programs that follow a certain format (usually including timestamps).
Logs are typically generated by servers and output to different files, usually including system logs, application logs, security logs. These logs are stored on different machines. When the system fails, engineers need to log in to each server and use Linux script tools like grep / sed / awk to search for the cause of the failure in the logs. Without a log system, you first need to locate the server handling the request. If this server is deployed with multiple instances, you would then need to find the log files in the log directories of each application instance. Each application instance also sets up log rotation policies (such as generating one file per day or generating a new file when the log file reaches a given size), as well as log compression and archiving policies.
This series of processes makes it quite difficult for us to troubleshoot and find the cause of the failure promptly. Therefore, if we can centrally manage these logs and provide centralized search functionality, it not only improves diagnostic efficiency but also provides a comprehensive understanding of the system situation, avoiding the passive situation of firefighting after an incident.
Thus, log data plays a very important role in several aspects:

Data Search: By searching log information, corresponding bugs can be located, and solutions found;
Service Diagnosis: By statistically analyzing log information, understand the server load and service operation status;
Data Analysis: Further data analysis can be performed.

Collecting File Logs¶

This article uses Nginx log collection as an example. Enter the conf.d/log directory under the DataKit installation directory, copy logging.conf.sample, and rename it to logging.conf. The example is as follows:

[[inputs.logging]]
  # List of log files, absolute paths can be specified, supports using glob rules for batch designation
  # It is recommended to use absolute paths
  logfiles = [
    "/var/log/nginx/access.log",                         
    "/var/log/nginx/error.log",                      
  ]

  # File path filtering, using glob rules, any file matching any filter condition will not be collected
  ignore = [""]

  # Data source, if empty, 'default' will be used by default
  source = ""

  # Add tags, if empty, $source will be used by default
  service = ""

  # Pipeline script path, if empty, $source.p will be used; if $source.p does not exist, pipeline will not be used
  pipeline = "nginx.p"

  # Filter corresponding statuses:
  #   `emerg`,`alert`,`critical`,`error`,`warning`,`info`,`debug`,`OK`
  ignore_status = []

  # Select encoding, incorrect encoding may prevent data from being viewed. Default is empty:
  #    `utf-8`, `utf-16le`, `utf-16le`, `gbk`, `gb18030` or ""
  character_encoding = ""

  ## Set regular expression, e.g., ^\d{4}-\d{2}-\d{2} matches YYYY-MM-DD time format at the beginning of the line
  ## Data matching this regular expression will be considered valid; otherwise, it will be appended to the previous valid data
  ## Use three single quotes '''this-regexp''' to avoid escaping
  ## Regular expression link: https://golang.org/pkg/regexp/syntax/#hdr-Syntax
  # multiline_match = '''^\S'''

  ## Whether to remove ANSI escape codes, such as text color in standard output
  remove_ansi_escape_codes = false

  # Custom tags
  [inputs.logging.tags]
   app = oa

Custom tags configuration allows setting any key-value pairs. ● After configuration, all metrics will carry the tag app = oa, enabling quick queries. ● Relevant documentation < DataFlux Tag Application Best Practices> Restart Datakit

systemctl restart datakit

Collecting Multi-line Logs¶

By identifying the characteristics of the first line of multi-line logs, it can be determined whether a particular line is a new log entry. If it doesn't match this characteristic, the current line is considered an addition to the previous multi-line log. For example, logs are usually written flush left, but some log texts are not flush left, such as stack trace logs when a program crashes. For such log texts, they are multi-line logs. In DataKit, we use regular expressions to identify multi-line log features. Lines that match the regular expression are considered the start of a new log entry. All subsequent lines that do not match the regular expression are considered additions to this new log until another line matches the regular expression. To enable multi-line log collection, modify the following configuration in logging.conf:

match = '''Here, fill in the specific regular expression''' # Note, it is recommended to add three English single quotes around the regular expression

The regular expression style used in the log collector refer

Here is an example of a Python log:

2020-10-23 06:41:56,688 INFO demo.py 1.0
2020-10-23 06:54:20,164 ERROR /usr/local/lib/python3.6/dist-packages/flask/app.py Exception on /0 [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
ZeroDivisionError: division by zero
2020-10-23 06:41:56,688 INFO demo.py 5.0

Match configured as ^\d{4}-\d{2}-\d{2}.* (meaning matching lines starting with something like 2020-10-23)

Filtering Special Byte Codes in Logs¶

Logs may contain unreadable byte codes (like terminal output colors), which can be filtered out by setting remove_ansi_escape_codes to true in logging.conf.

Enabling this feature will slightly increase processing time

Collecting Remote Log Files¶

On Linux systems, you can use NFS method to mount the log file path from the host where the logs reside to the host where DataKit is installed. Configure the Logging collector to point to the corresponding log path to complete the collection.

Collecting Streaming Logs¶

This article uses collecting Fluentd logs as an example.

Example Fluentd version: td-agent-4.2.x, configurations may vary across different versions.

DataKit Configuration¶

When collecting streaming logs, DataKit starts an HTTP Server to receive log text data and report it to Guance. The HTTP URL is fixed as /v1/write/logstreaming, i.e., http://Datakit_IP:PORT/v1/write/logstreaming

Note: If DataKit is deployed as a daemonset in Kubernetes, you can access it via Service, the address is http://datakit-service.datakit:9529

Enter the conf.d/log directory under the DataKit installation directory, copy logstreaming.conf.sample and rename it to logstreaming.conf. The example is as follows:

[inputs.logstreaming]
  ignore_url_tags = true

Restart DataKit

systemctl restart datakit

Parameter Support¶

Logstreaming supports adding parameters in the HTTP URL to operate on log data. The parameter list is as follows:

type: Data format, currently only supports influxdb.
When type is influxdb (/v1/write/logstreaming?type=influxdb), it indicates the data itself is line protocol format (default precision is s), only built-in Tags will be added without further processing
When this value is empty, the data will be processed line-by-line and through pipelines
source: Identifies the data source, i.e., the measurement in the line protocol. For example, nginx or redis (/v1/write/logstreaming?source=nginx)
This value is invalid when type is influxdb
Defaults to default
service: Adds a service tag field, for example (/v1/write/logstreaming?service=nginx_service)
Defaults to the value of the source parameter.
pipeline: Specifies the name of the pipeline to be used by the data, for example nginx.p (/v1/write/logstreaming?pipeline=nginx.p)

Fluentd Configuration¶

Using Fluentd to collect Nginx logs and forward them to the upper-level server's Plugin configuration as an example, we do not want to send the logs directly to the server for processing but rather process them locally and send them to DataKit for reporting to the Guance platform for analysis.

##PC end log collection
<source>
  @type tail
  format ltsv
  path /var/log/nginx/access.log
  pos_file /var/log/buffer/posfile/access.log.pos
  tag nginx
  time_key time
  time_format %d/%b/%Y:%H:%M:%S %z
</source>

##Collect data forwarded to multiple servers' port 49875 via TCP protocol

<match nginx>
 type forward
  <server>
   name es01
   host es01
   port 49875
   weight 60
  </server>
  <server>
   name es02
   host es02
   port 49875
   weight 60
  </server>
</match>

Modify the Match Output to Http type and point the Endpoint to the DataKit address with Logstreaming enabled to complete the collection

##PC end log collection
<source>
  @type tail
  format ltsv
  path /var/log/nginx/access.log
  pos_file /var/log/buffer/posfile/access.log.pos
  tag nginx
  time_key time
  time_format %d/%b/%Y:%H:%M:%S %z
</source>

##Collect data forwarded to local DataKit via HTTP protocol

## nginx output

<match nginx>
  @type http
  endpoint http://127.0.0.1:9529/v1/write/logstreaming?source=nginx_td&pipeline=nginx.p
  open_timeout 2
  <format>
    @type json
  </format>
</match>

After modifying the configuration, restart td-agent to complete data reporting

You can verify the reported data using DQL:

dql > L::nginx_td LIMIT 1
-----------------[ r1.nginx_td.s1 ]-----------------
    __docid 'L_c6et7vk5jjqulpr6osa0'
create_time 1637733374609
    date_ns 96184
       host 'df-solution-ecs-018'
    message '{"120.253.192.179 - - [24/Nov/2021":"13:55:10 +0800] \"GET / HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36\" \"-\""}'
     source 'nginx_td'
       time 2021-11-24 13:56:06 +0800 CST
---------
1 rows, 1 series, cost 2ms

Log Parsing (Pipeline)¶

Generally, logs generated by systems or services are long strings. Each field is separated by spaces. Usually, logs are obtained as whole strings. If the meaning of each field in the log is split and analyzed, the presented data becomes clearer and more convenient for visualization.
Pipeline is an important component of Guance used for text data processing. Its main function is to convert text-formatted strings into structured data, working with Grok regular expressions.

Grok Pattern Classification¶

In DataKit, Grok patterns can be divided into two categories: global patterns and local patterns. Patterns in the pattern directory are global patterns, available to all Pipeline scripts. Patterns added through the add_pattern() function within Pipeline scripts are local patterns, effective only for the current Pipeline script.
When DataKit's built-in patterns cannot meet all user requirements, users can add pattern files in the Pipeline directory to expand. If the custom pattern is at the global level, create a new file in the pattern directory and add the pattern there. Do not add or modify existing built-in pattern files because DataKit startup will overwrite them.

Adding Local Patterns¶

Grok essentially predefines some regular expressions for text matching and extraction and names these predefined regular expressions for easy use and nested referencing to extend countless new patterns. For example, DataKit has three built-in patterns as follows:

_second (?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)    # Matches seconds, _second is the pattern name
_minute (?:[0-5][0-9])                            # Matches minutes, _minute is the pattern name
_hour (?:2[0123]|[01]?[0-9])                      # Matches hours, _hour is the pattern name

Based on the above three built-in patterns, you can extend your own built-in pattern named time:

# Add time to the file in the pattern directory, this pattern is a global pattern and can be referenced anywhere
time ([^0-9]?)%{hour:hour}:%{minute:minute}(?::%{second:second})([^0-9]?)

# You can also add it to the pipeline file using add_pattern(), making this pattern a local pattern, usable only by the current pipeline script
add_pattern(time, "([^0-9]?)%{HOUR:hour}:%{MINUTE:minute}(?::%{SECOND:second})([^0-9]?)")

# Extract the time field from the original input using grok. Assuming the input is 12:30:59, the extracted result would be {"hour": 12, "minute": 30, "second": 59}
grok(_, %{time})

Note: - Same pattern names prioritize script-level (i.e., local patterns override global patterns) - In pipeline scripts, add_pattern() must be called before grok() to avoid the first data extraction failure.

Configuring Nginx Log Parsing¶

Writing Pipeline File¶

Write the Pipeline file in the <DataKit installation directory>/pipeline directory, named nginx.p.

add_pattern("date2", "%{YEAR}[./]%{MONTHNUM}[./]%{MONTHDAY} %{TIME}")

grok(_, "%{IPORHOST:client_ip} %{NOTSPACE:http_ident} %{NOTSPACE:http_auth} \\[%{HTTPDATE:time}\\] \"%{DATA:http_method} %{GREEDYDATA:http_url} HTTP/%{NUMBER:http_version}\" %{INT:status_code} %{INT:bytes}")

# Access log
add_pattern("access_common", "%{IPORHOST:client_ip} %{NOTSPACE:http_ident} %{NOTSPACE:http_auth} \\[%{HTTPDATE:time}\\] \"%{DATA:http_method} %{GREEDYDATA:http_url} HTTP/%{NUMBER:http_version}\" %{INT:status_code} %{INT:bytes}")
grok(_, '%{access_common} "%{NOTSPACE:referrer}" "%{GREEDYDATA:agent}')
user_agent(agent)

# Error log
grok(_, "%{date2:time} \\[%{LOGLEVEL:status}\\] %{GREEDYDATA:msg}, client: %{IPORHOST:client_ip}, server: %{IPORHOST:server}, request: \"%{DATA:http_method} %{GREEDYDATA:http_url} HTTP/%{NUMBER:http_version}\", (upstream: \"%{GREEDYDATA:upstream}\", )?host: \"%{IPORHOST:ip_or_host}\"")
grok(_, "%{date2:time} \\[%{LOGLEVEL:status}\\] %{GREEDYDATA:msg}, client: %{IPORHOST:client_ip}, server: %{IPORHOST:server}, request: \"%{GREEDYDATA:http_method} %{GREEDYDATA:http_url} HTTP/%{NUMBER:http_version}\", host: \"%{IPORHOST:ip_or_host}\"")
grok(_,"%{date2:time} \\[%{LOGLEVEL:status}\\] %{GREEDYDATA:msg}")

group_in(status, ["warn", "notice"], "warning")
group_in(status, ["error", "crit", "alert", "emerg"], "error")

cast(status_code, "int")
cast(bytes, "int")

group_between(status_code, [200,299], "OK", status)
group_between(status_code, [300,399], "notice", status)
group_between(status_code, [400,499], "warning", status)
group_between(status_code, [500,599], "error", status)


nullif(http_ident, "-")
nullif(http_auth, "-")
nullif(upstream, "")
default_time(time)

Note, during splitting, avoid possible conflicts with tag keys (Pipeline Field Naming Precautions)

Debugging Pipeline File¶

Since Grok Patterns are numerous, manual matching can be cumbersome. DataKit provides an interactive command-line tool grokq (Grok Query):

datakit --grokq
grokq > Mon Jan 25 19:41:17 CST 2021   # Enter the text you wish to match here
        2 %{DATESTAMP_OTHER: ?}        # The tool suggests corresponding patterns, the higher the number, the more precise the match (higher weight). The preceding number indicates the weight.
        0 %{GREEDYDATA: ?}

grokq > 2021-01-25T18:37:22.016+0800
        4 %{TIMESTAMP_ISO8601: ?}      # Here, ? means you should name the matched text with a field
        0 %{NOTSPACE: ?}
        0 %{PROG: ?}
        0 %{SYSLOGPROG: ?}
        0 %{GREEDYDATA: ?}             # Patterns like GREEDYDATA have low weights due to their broad range
                                       # Higher weight means higher precision

grokq > Q                              # Q or exit to quit
Bye!

After writing the Pipeline file using DataKit's provided command-line tool grokq, specify the Pipeline script name (--pl, the Pipeline script must be placed in /pipeline directory), enter a segment of text (--txt) to determine if the extraction was successful

# Successful extraction example
datakit --pl nginx.p --txt '172.17.0.1 - - [06/Jan/2017:16:16:37 +0000] "GET /datadoghq/company?test=var1%20Pl HTTP/1.1" 401 612 "http://www.perdu.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" "-"'
Extracted data(cost: 5.279203ms):  # Indicates successful parsing
{
  "agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36\" \"-\"",
  "browser": "Chrome",
  "browserVer": "55.0.2883.87",
  "bytes": 612,
  "client_ip": "172.17.0.1",
  "engine": "AppleWebKit",
  "engineVer": "537.36",
  "http_method": "GET",
  "http_url": "/datadoghq/company?test=var1%20Pl",
  "http_version": "1.1",
  "isBot": false,
  "isMobile": false,
  "message": "172.17.0.1 - - [06/Jan/2017:16:16:37 +0000] \"GET /datadoghq/company?test=var1%20Pl HTTP/1.1\" 401 612 \"http://www.perdu.com/\" \"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36\" \"-\"",
  "os": "Linux x86_64",
  "referrer": "http://www.perdu.com/",
  "status": "warning",
  "status_code": 401,
  "time": 1483719397000000000,
  "ua": "X11"
}

# Failed extraction example
datakit --pl nginx.p --txt '172.17.0.1 - - [06/Jan/2017:16:16:37 +0000] "GET /datadoghq/company?test=var1%20Pl HTTP/1.1" 401 612 "http://www.perdu.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" "-"'
No data extracted from pipeline

Configuring Collector to Apply Pipeline Script¶

Text Log Collection Pipeline Configuration¶

Taking Nginx log collection as an example, configure the Pipeline field in the Logging collector. Note that here you configure the Pipeline script name, not the path. All referenced Pipeline scripts must be stored in the directory:

[inputs.logging]]
  # List of log files, can specify absolute paths, supports using glob rules for batch designation
  # Recommended to use absolute paths
  logfiles = [
    "/var/log/nginx/access.log",                         
    "/var/log/nginx/error.log",                      
  ]

  # File path filtering, using glob rules, any file matching any filter condition will not be collected
  ignore = [""]

  # Data source, if empty, 'default' will be used by default
  source = ""

  # Add tags, if empty, $source will be used by default
  service = ""

  # Pipeline script path, if empty will use $source.p, if $source.p does not exist, no pipeline will be used
  pipeline = "nginx.p"

  # Filter corresponding statuses:
  #   `emerg`,`alert`,`critical`,`error`,`warning`,`info`,`debug`,`OK`
  ignore_status = []

  # Select encoding, incorrect encoding may prevent data from being viewed. Default is empty:
  #    `utf-8`, `utf-16le`, `utf-16le`, `gbk`, `gb18030` or ""
  character_encoding = ""

  ## Set regular expression, e.g., ^\d{4}-\d{2}-\d{2} matches YYYY-MM-DD time format at the beginning of the line
  ## Data matching this regular expression will be considered valid; otherwise, it will be appended to the previous valid data
  ## Use three single quotes '''this-regexp''' to avoid escaping
  ## Regular expression link: https://golang.org/pkg/regexp/syntax/#hdr-Syntax
  # multiline_match = '''^\S'''

  ## Whether to remove ANSI escape codes, such as text color in standard output
  remove_ansi_escape_codes = false

  # Custom tags
  [inputs.logging.tags]
   app = oa

Restart Datakit to parse the corresponding logs.

systemctl restart datakit

Streaming Log Collection Pipeline Configuration¶

Taking Fluentd log collection as an example, modify the Match Output to Http type and point the Endpoint to the DataKit address with Logstreaming enabled and configure the Pipeline script name to complete the collection.

##PC end log collection
<source>
  @type tail
  format ltsv
  path /var/log/nginx/access.log
  pos_file /var/log/buffer/posfile/access.log.pos
  tag nginx
  time_key time
  time_format %d/%b/%Y:%H:%M:%S %z
</source>

##Collect data forwarded to local DataKit via HTTP protocol
## nginx output
<match nginx>
  @type http
  endpoint http://127.0.0.1:9529/v1/write/logstreaming?source=nginx_td&pipeline=nginx.p
  open_timeout 2
  <format>
    @type json
  </format>
</match>

After modifying the configuration, restart td-agent to complete data reporting

Log Collection Performance Optimization¶

Why is My Pipeline Running Very Slowly?¶

Performance issues often come up for discussion. Users typically find that using Grok expressions slows down the speed at which Pipeline processes logs. Grok patterns are based on regular expressions, possibly because the Grok variables used when writing Pipelines cover too many scenarios, or because full-line matching is done line by line leading to slow processing speeds.

Be Aware of Expressions That Match Twice¶

We have seen many Grok patterns encounter problems when processing various application logs from the same gateway, such as Syslog. Imagine a scenario where we use the "common_header: payload" log format to record three types of application logs:

Application 1: '8.8.8.8 process-name[666]: a b 1 2 a lot of text at the end'
Application 2: '8.8.8.8 process-name[667]: a 1 2 3 a lot of text near the end;4'
Application 3: '8.8.8.8 process-name[421]: a completely different format | 1111'

Usually, we handle all three types of logs in one Pipeline

grok(_ , "%{IPORHOST:clientip} %{DATA:process_name}\[%{NUMBER:process_id}\]: %{WORD:word_1} %{WORD:word_2} %{NUMBER:number_1} %{NUMBER:number_2} %{DATA:data}")
grok(_ , "%{IPORHOST:clientip} %{DATA:process_name}\[%{NUMBER:process_id}\]: %{WORD:word_1} %{NUMBER:number_1} %{NUMBER:number_2} %{NUMBER:number_3} %{DATA:data};%{NUMBER:number_4}")
grok(_ , "%{IPORHOST:clientip} %{DATA:process_name}\[%{NUMBER:process_id}\]: %{DATA:data} | %{NUMBER:number}")

However, even if your logs match correctly, Grok still tries to match incoming logs sequentially. It breaks the loop once it encounters the first matching log. This requires us to judge how to place them appropriately, otherwise, it will try one by one, especially for different formats. A common optimization solution is to use hierarchical matching to optimize this Pipeline

add_pattern("message", "%{IPORHOST:clientip} %{DATA:process_name}\[%{NUMBER:process_id}\]: %{GREEDYDATA:message}")

grok(_, "%{message} %{WORD:word_1} %{WORD:word_2} %{NUMBER:number_1} %{NUMBER:number_2} %{GREEDYDATA:data}")
grok(_, "%{message} %{WORD:word_1} %{NUMBER:number_1} %{NUMBER:number_2} %{NUMBER:number_3} %{DATA:data};%{NUMBER:number_4}")
grok(_, "%{message} %{DATA:data} | %{NUMBER:number}")

Be Aware of High-Cost Grok Expressions¶

Let's look at the following Nginx log:

172.17.0.1 - - [06/Jan/2017:16:16:37 +0000] "GET /datadoghq/company?test=var1%20Pl HTTP/1.1" 401 612 "http://www.perdu.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" "-"

We usually process it using flexible Grok expressions in a Pipeline

grok(_, "%{IPORHOST:client_ip} %{NOTSPACE:http_ident} %{NOTSPACE:http_auth} \\[%{HTTPDATE:time}\\] \"%{DATA:http_method} %{GREEDYDATA:http_url} HTTP/%{NUMBER:http_version}\" %{INT:status_code} %{INT:bytes}")

cast(status_code, "int")
cast(bytes, "int")

Here %{IPORHOST:client_ip} --> 172.17.0.1 has a high performance cost because Grok is converted into regular expressions at the bottom layer. The more scenarios a Grok expression covers, the worse its performance might be. Let's take a look at the complex regular expression underlying %{IPORHOST:client_ip}

IPORHOST (?:%{IP}|%{HOSTNAME})
HOSTNAME \b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\.?|\b)
IP (?:%{IPV6}|%{IPV4})
IPV6 ((([0-9A-Fa-f]{1,4}:){7}([0-9A-Fa-f]{1,4}|:))|(([0-9A-Fa-f]{1,4}:){6}(:[0-9A-Fa-f]{1,4}|((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){5}(((:[0-9A-Fa-f]{1,4}){1,2})|:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){4}(((:[0-9A-Fa-f]{1,4}){1,3})|((:[0-9A-Fa-f]{1,4})?:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){3}(((:[0-9A-Fa-f]{1,4}){1,4})|((:[0-9A-Fa-f]{1,4}){0,2}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){2}(((:[0-9A-Fa-f]{1,4}){1,5})|((:[0-9A-Fa-f]{1,4}){0,3}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){1}(((:[0-9A-Fa-f]{1,4}){1,6})|((:[0-9A-Fa-f]{1,4}){0,4}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(:(((:[0-9A-Fa-f]{1,4}){1,7})|((:[0-9A-Fa-f]{1,4}){0,5}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:)))(%.+)?
IPV4 (?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9])

We can see that a short Grok expression can contain so many complex regular expressions. When processing a large volume of logs, using such complex Grok expressions significantly impacts performance. How do we optimize it?

grok(_, "%{NOTSPACE:client_ip} %{NOTSPACE:http_ident} %{NOTSPACE:http_auth} \\[%{HTTPDATE:time}\\] \"%{DATA:http_method} %{GREEDYDATA:http_url} HTTP/%{NUMBER:http_version}\" %{INT:status_code} %{INT:bytes}")

cast(status_code, "int")
cast(bytes, "int")

default_time(time)

Focus on performance, prefer using %{NOTSPACE:} since Grok expressions are converted into regular expressions at the bottom layer. The more scenarios covered by a Grok expression, the worse its performance might be. Conversely, simple matching variables like %{NOTSPACE:} (non-space) perform very well. So when tokenizing, if you can confirm the data is non-space and adjacent to whitespace, choose %{NOTSPACE:} to improve Pipeline performance.

Better Utilization of Tools for Writing Pipelines¶

DataKit - Interactive Command-Line Tool grokq¶

Due to the numerous Grok Patterns, manual matching can be cumbersome. DataKit provides an interactive command-line tool grokq (Grok Query):

datakit --grokq
grokq > Mon Jan 25 19:41:17 CST 2021   # Enter the text you wish to match here
        2 %{DATESTAMP_OTHER: ?}        # The tool suggests corresponding patterns, the higher the number, the more precise the match (higher weight). The preceding number indicates the weight.
        0 %{GREEDYDATA: ?}

grokq > 2021-01-25T18:37:22.016+0800
        4 %{TIMESTAMP_ISO8601: ?}      # Here, ? means you should name the matched text with a field
        0 %{NOTSPACE: ?}
        0 %{PROG: ?}
        0 %{SYSLOGPROG: ?}
        0 %{GREEDYDATA: ?}             # Patterns like GREEDYDATA have low weights due to their broad range
                                       # Higher#### DataKit - Pipeline Script Testing

After writing the Pipeline file using DataKit's provided command-line tool `grokq`, specify the Pipeline script name (--pl, the Pipeline script must be placed in the <DataKit installation directory>/pipeline directory), enter a segment of text (--txt) to determine if the extraction was successful.

```shell
# Successful extraction example
datakit --pl nginx.p --txt '172.17.0.1 - - [06/Jan/2017:16:16:37 +0000] "GET /datadoghq/company?test=var1%20Pl HTTP/1.1" 401 612 "http://www.perdu.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" "-"'
Extracted data(cost: 5.279203ms):  # Indicates successful parsing
{
  "agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36\" \"-\"",
  "browser": "Chrome",
  "browserVer": "55.0.2883.87",
  "bytes": 612,
  "client_ip": "172.17.0.1",
  "engine": "AppleWebKit",
  "engineVer": "537.36",
  "http_method": "GET",
  "http_url": "/datadoghq/company?test=var1%20Pl",
  "http_version": "1.1",
  "isBot": false,
  "isMobile": false,
  "message": "172.17.0.1 - - [06/Jan/2017:16:16:37 +0000] \"GET /datadoghq/company?test=var1%20Pl HTTP/1.1\" 401 612 \"http://www.perdu.com/\" \"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36\" \"-\"",
  "os": "Linux x86_64",
  "referrer": "http://www.perdu.com/",
  "status": "warning",
  "status_code": 401,
  "time": 1483719397000000000,
  "ua": "X11"
}

# Failed extraction example
datakit --pl nginx.p --txt '172.17.0.1 - - [06/Jan/2017:16:16:37 +0000] "GET /datadoghq/company?test=var1%20Pl HTTP/1.1" 401 612 "http://www.perdu.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" "-"'
No data extracted from pipeline

Online Grok Debugging¶

Use the GrokDebug website for Grok debugging

Log Collection Cost Optimization¶

Cost Optimization Through Guance Product Features¶

Guance supports filtering out logs that match certain criteria by setting log blacklists. After configuring the log blacklist, matching log data will no longer be reported to the Guance workspace, helping users save on log storage costs.

Note: This configuration does not get pushed down to DataKit. It takes effect when DataKit actively fetches the central configuration file and executes the filtering actions locally.

Creating a New Log Blacklist¶

In the Guance workspace, click on 「Logs」-「Blacklist」-「Create New Blacklist」, select 「Log Source」, add one or more log filtering rules, and click Confirm to enable the log filtering rule by default. You can view all log filtering rules through 「Log Blacklist」.

Note: The log filtering conditions are in an "and (AND)" relationship, meaning only logs that meet all filtering conditions will not be reported to the workspace.

Pre-cost Optimization for Streaming Log Collection¶

For collecting Fluentd logs, you can perform log aggregation within <match> </match> to compress logs or filter events within <match> </match> to report only error or alert logs to Guance to reduce usage costs.