Guance Log Collection and Analysis Best Practices¶
Log Collection¶
First, what are logs? Logs are text data generated by programs, following a certain format (usually including timestamps).
Logs are usually generated by servers and output to different files, generally including system logs, application logs, and security logs. These logs are stored on different machines. When a system fails, engineers need to log in to each server and use Linux script tools like grep / sed / awk to find the cause of the failure in the logs. Without a log system, the first step is to locate the server handling the request. If multiple instances are deployed on this server, you need to go to each application instance's log directory to find the log files. Each application instance also sets log rolling policies (e.g., generating a file every day or when the log file reaches a certain size) and log compression and archiving policies.
This series of processes makes it quite troublesome for us to troubleshoot failures and find the cause in time. Therefore, if we can centrally manage these logs and provide centralized search functions, it can not only improve diagnostic efficiency but also provide a comprehensive understanding of the system, avoiding the passive situation of putting out fires after the fact.
So, log data plays a very important role in the following aspects:
- Data search: Locate the corresponding Bug by searching log information and find a solution;
- Service diagnosis: Understand server load and service running status by analyzing log information;
- Data analysis: Perform further data analysis.
Collecting File Logs¶
This article takes Nginx log collection as an example. Enter the conf.d/log directory under the DataKit installation directory, copy logging.conf.sample and name it logging.conf. The example is as follows:
[[inputs.logging]]
# Log file list, absolute paths can be specified, and batch specification using glob rules is supported
# It is recommended to use absolute paths
logfiles = [
"/var/log/nginx/access.log",
"/var/log/nginx/error.log",
]
# File path filtering, using glob rules, any file that meets any filtering condition will not be collected
ignore = [""]
# Data source, if empty, 'default' is used by default
source = ""
# Add a tag, if empty, $source is used by default
service = ""
# Pipeline script path, if empty, $source.p will be used, if $source.p does not exist, no pipeline will be used
pipeline = "nginx.p"
# Filter corresponding status:
# `emerg`,`alert`,`critical`,`error`,`warning`,`info`,`debug`,`OK`
ignore_status = []
# Select encoding, incorrect encoding will cause data to be unviewable. Default is empty:
# `utf-8`, `utf-16le`, `utf-16le`, `gbk`, `gb18030` or ""
character_encoding = ""
## Set regular expression, e.g., ^\d{4}-\d{2}-\d{2} matches YYYY-MM-DD time format at the beginning of the line
## Data matching this regular expression will be considered valid, otherwise it will be appended to the end of the previous valid data
## Use three single quotes '''this-regexp''' to avoid escaping
## Regular expression link: https://golang.org/pkg/regexp/syntax/#hdr-Syntax
# multiline_match = '''^\S'''
## Whether to remove ANSI escape codes, such as text colors in standard output
remove_ansi_escape_codes = false
# Custom tags
[inputs.logging.tags]
app = oa
Custom tags configuration is for custom labels, any key-value values can be filled in ● After configuration, all metrics will have the label app = oa, which can be used for quick queries ● Related document < DataFlux Tag Application Best Practices> Restart Datakit
Collecting Multi-line Logs¶
By identifying the first line feature of multi-line logs, it can be determined whether a line of logs is a new log. If it does not meet this feature, we consider the current line of logs as an append to the previous multi-line log.
For example, generally, logs are written at the beginning of the line, but some log texts are not written at the beginning, such as call stack logs when a program crashes. For such log texts, they are multi-line logs. In DataKit, we use regular expressions to identify multi-line log features. The log line that matches the regular expression is the start of a new log, and all subsequent unmatched log lines are considered as appends to this new log, until another log line that matches the regular expression is encountered.
To enable multi-line log collection, modify the following configuration in logging.conf:
match = '''Here fill in the specific regular expression''' # Note, it is recommended to add three 'single quotes' on both sides of the regular expression here
The regular expression style used in the log collector reference
Here is a Python log as an example:
2020-10-23 06:41:56,688 INFO demo.py 1.0
2020-10-23 06:54:20,164 ERROR /usr/local/lib/python3.6/dist-packages/flask/app.py Exception on /0 [GET]
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
ZeroDivisionError: division by zero
2020-10-23 06:41:56,688 INFO demo.py 5.0
Match configuration is ^\d{4}-\d{2}-\d{2}.* (meaning matching a line starting with 2020-10-23)
Filtering Special Byte Codes in Logs¶
Logs may contain some unreadable byte codes (such as colors in terminal output), which can be deleted and filtered by setting remove_ansi_escape_codes to true in logging.conf.
Enabling this feature will slightly increase processing time
Collecting Remote Log Files¶
Linux systems can mount the file path of the log host to the host where DataKit is installed through NFS method, and configure the corresponding log path in the Logging collector to complete the collection.
Collecting Streaming Logs¶
This article takes collecting Fluentd logs as an example
Example Fluentd version: td-agent-4.2.x, configurations may vary for different versions.
DataKit Configuration¶
When collecting streaming logs, DataKit will start an HTTP Server to receive log text data and report it to Guance. The HTTP URL is fixed as: /v1/write/logstreaming, i.e., http://Datakit_IP:PORT/v1/write/logstreaming
Note: If DataKit is deployed in Kubernetes as a daemonset, you can use the Service method to access, the address is
http://datakit-service.datakit:9529
Enter the conf.d/log directory under the DataKit installation directory, copy logstreaming.conf.sample and name it logstreaming.conf. The example is as follows:
Restart DataKit
Parameter Support¶
Logstreaming supports adding parameters to the HTTP URL to operate on log data. The parameter list is as follows:
type: Data format, currently onlyinfluxdbis supported.- When type is inflxudb (
/v1/write/logstreaming?type=influxdb), it means the data itself is in line protocol format (default precision is s), and only built-in Tags will be added, no other operations will be performed - When this value is empty, the data will be processed by line and pipeline
source: Identify the data source, i.e., the measurement of the line protocol. For example, nginx or redis (/v1/write/logstreaming?source=nginx)- When
typeisinfluxdb, this value is invalid - Default is
default service: Add the service tag field, for example (/v1/write/logstreaming?service=nginx_service)- Default is the value of the source parameter.
pipeline: Specify the pipeline name to be used for the data, for example,nginx.p(/v1/write/logstreaming?pipeline=nginx.p)
Fluentd Configuration¶
Take the example of Fluentd collecting Nginx logs and forwarding them to the upper-level Server-side Plugin configuration. We do not want to send it directly to the Server-side for processing, but want to process it directly and send it to DataKit to report to the Guance platform for analysis.
##PC-side log collection
<source>
@type tail
format ltsv
path /var/log/nginx/access.log
pos_file /var/log/buffer/posfile/access.log.pos
tag nginx
time_key time
time_format %d/%b/%Y:%H:%M:%S %z
</source>
## The collected data is forwarded to multiple servers' 49875 port via TCP protocol
<match nginx>
type forward
<server>
name es01
host es01
port 49875
weight 60
</server>
<server>
name es02
host es02
port 49875
weight 60
</server>
</match>
##PC-side log collection
<source>
@type tail
format ltsv
path /var/log/nginx/access.log
pos_file /var/log/buffer/posfile/access.log.pos
tag nginx
time_key time
time_format %d/%b/%Y:%H:%M:%S %z
</source>
## The collected data is forwarded to the local DataKit via HTTP protocol
## nginx output
<match nginx>
@type http
endpoint http://127.0.0.1:9529/v1/write/logstreaming?source=nginx_td&pipeline=nginx.p
open_timeout 2
<format>
@type json
</format>
</match>
After modifying the configuration, restart td-agent to complete data reporting
You can use DQL to verify the reported data:
dql > L::nginx_td LIMIT 1
-----------------[ r1.nginx_td.s1 ]-----------------
__docid 'L_c6et7vk5jjqulpr6osa0'
create_time 1637733374609
date_ns 96184
host 'df-solution-ecs-018'
message '{"120.253.192.179 - - [24/Nov/2021":"13:55:10 +0800] \"GET / HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36\" \"-\""}'
source 'nginx_td'
time 2021-11-24 13:56:06 +0800 CST
---------
1 rows, 1 series, cost 2ms
Log Parsing ( Pipeline )¶
Generally, logs generated by systems or services are long strings. Each field is separated by a space. When obtaining logs, the entire string is usually obtained. If the meaning of each field in the log is separated for analysis, the data presented will be clearer and more convenient for visualization.
Pipeline is an important component of Guance for text data processing. Its main function is to convert text format strings into specific structured data, used in conjunction with Grok regular expressions.
Grok Pattern Classification¶
Grok patterns in DataKit can be divided into two categories: global patterns and local patterns. Pattern files in the pattern directory are global patterns and can be used by all Pipeline scripts, while patterns added through the add_pattern() function in Pipeline scripts are local patterns and are only valid for the current Pipeline script.
When the built-in patterns of DataKit cannot meet all user needs, users can add pattern files in the Pipeline directory to expand. If the custom pattern is global, you need to create a new file in the pattern directory and add the pattern to it. Do not add or modify in the existing built-in pattern files, because the built-in pattern files will be overwritten during the DataKit startup process.
Adding Local Patterns¶
Grok essentially predefines some regular expressions for text matching and extraction, and names the predefined regular expressions for easy use and nested reference to expand countless new patterns. For example, DataKit has the following three built-in patterns:
_second (?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?) #Match seconds, _second is the pattern name
_minute (?:[0-5][0-9]) #Match minutes, _minute is the pattern name
_hour (?:2[0123]|[01]?[0-9]) #Match years, _hour is the pattern name
Based on the above three built-in patterns, you can expand your own built-in pattern and name it time:
# Add time to the file in the pattern directory, this pattern is a global pattern and can be referenced anywhere
time ([^0-9]?)%{hour:hour}:%{minute:minute}(?::%{second:second})([^0-9]?)
# You can also add it to the pipeline file through add_pattern(), then this pattern becomes a local pattern, only the current pipeline script can use time
add_pattern(time, "([^0-9]?)%{HOUR:hour}:%{MINUTE:minute}(?::%{SECOND:second})([^0-9]?)")
# Extract the time field from the original input through grok. Assuming the input is 12:30:59, then {"hour": 12, "minute": 30, "second": 59} will be extracted
grok(_, %{time})
Note: - The same pattern name takes precedence at the script level (i.e., local patterns override global patterns) - In the pipeline script,
add_pattern()needs to be called before thegrok()function, otherwise the first data extraction will fail.
Configuring Nginx Log Parsing¶
Writing Pipeline File¶
Write the Pipeline file in the <datakit installation directory>/pipeline directory, the file name is nginx.p.
add_pattern("date2", "%{YEAR}[./]%{MONTHNUM}[./]%{MONTHDAY} %{TIME}")
grok(_, "%{IPORHOST:client_ip} %{NOTSPACE:http_ident} %{NOTSPACE:http_auth} \\[%{HTTPDATE:time}\\] \"%{DATA:http_method} %{GREEDYDATA:http_url} HTTP/%{NUMBER:http_version}\" %{INT:status_code} %{INT:bytes}")
# access log
add_pattern("access_common", "%{IPORHOST:client_ip} %{NOTSPACE:http_ident} %{NOTSPACE:http_auth} \\[%{HTTPDATE:time}\\] \"%{DATA:http_method} %{GREEDYDATA:http_url} HTTP/%{NUMBER:http_version}\" %{INT:status_code} %{INT:bytes}")
grok(_, '%{access_common} "%{NOTSPACE:referrer}" "%{GREEDYDATA:agent}')
user_agent(agent)
# error log
grok(_, "%{date2:time} \\[%{LOGLEVEL:status}\\] %{GREEDYDATA:msg}, client: %{IPORHOST:client_ip}, server: %{IPORHOST:server}, request: \"%{DATA:http_method} %{GREEDYDATA:http_url} HTTP/%{NUMBER:http_version}\", (upstream: \"%{GREEDYDATA:upstream}\", )?host: \"%{IPORHOST:ip_or_host}\"")
grok(_, "%{date2:time} \\[%{LOGLEVEL:status}\\] %{GREEDYDATA:msg}, client: %{IPORHOST:client_ip}, server: %{IPORHOST:server}, request: \"%{GREEDYDATA:http_method} %{GREEDYDATA:http_url} HTTP/%{NUMBER:http_version}\", host: \"%{IPORHOST:ip_or_host}\"")
grok(_,"%{date2:time} \\[%{LOGLEVEL:status}\\] %{GREEDYDATA:msg}")
group_in(status, ["warn", "notice"], "warning")
group_in(status, ["error", "crit", "alert", "emerg"], "error")
cast(status_code, "int")
cast(bytes, "int")
group_between(status_code, [200,299], "OK", status)
group_between(status_code, [300,399], "notice", status)
group_between(status_code, [400,499], "warning", status)
group_between(status_code, [500,599], "error", status)
nullif(http_ident, "-")
nullif(http_auth, "-")
nullif(upstream, "")
default_time(time)
Note, during the cutting process, avoid possible problems of the same name as the tag key (Pipeline field naming considerations)
Debugging Pipeline File¶
Due to the large number of Grok Patterns, manual matching is quite troublesome. DataKit provides an interactive command-line tool grokq (Grok Query):
datakit --grokq
grokq > Mon Jan 25 19:41:17 CST 2021 # Enter the text you want to match here
2 %{DATESTAMP_OTHER: ?} # The tool will give corresponding suggestions, the more accurate the match (the greater the weight), the higher the weight. The preceding number indicates the weight.
0 %{GREEDYDATA: ?}
grokq > 2021-01-25T18:37:22.016+0800
4 %{TIMESTAMP_ISO8601: ?} # The ? here means you need to use a field to name the matched text
0 %{NOTSPACE: ?}
0 %{PROG: ?}
0 %{SYSLOGPROG: ?}
0 %{GREEDYDATA: ?} # Patterns like GREEDYDATA with a wide range have lower weights
# The higher the weight, the more accurate the match
grokq > Q # Q or exit to exit
Bye!
After writing the Pipeline file with the help of the command-line tool grokq provided by DataKit, specify the Pipeline script name (--pl, the Pipeline script must be placed in the
#Successful extraction example
datakit --pl nginx.p --txt '172.17.0.1 - - [06/Jan/2017:16:16:37 +0000] "GET /datadoghq/company?test=var1%20Pl HTTP/1.1" 401 612 "http://www.perdu.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" "-"'
Extracted data(cost: 5.279203ms): # Indicates successful cutting
{
"agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36\" \"-\"",
"browser": "Chrome",
"browserVer": "55.0.2883.87",
"bytes": 612,
"client_ip": "172.17.0.1",
"engine": "AppleWebKit",
"engineVer": "537.36",
"http_method": "GET",
"http_url": "/datadoghq/company?test=var1%20Pl",
"http_version": "1.1",
"isBot": false,
"isMobile": false,
"message": "172.17.0.1 - - [06/Jan/2017:16:16:37 +0000] \"GET /datadoghq/company?test=var1%20Pl HTTP/1.1\" 401 612 \"http://www.perdu.com/\" \"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36\" \"-\"",
"os": "Linux x86_64",
"referrer": "http://www.perdu.com/",
"status": "warning",
"status_code": 401,
"time": 1483719397000000000,
"ua": "X11"
}
# Failed extraction example
datakit --pl nginx.p --txt '172.17.0.1 - - [06/Jan/2017:16:16:37 +0000] "GET /datadoghq/company?test=var1%20Pl HTTP/1.1" 401 612 "http://www.perdu.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" "-"'
No data extracted from pipeline
Configuring Collector Application Pipeline Script¶
Text Log Collection Configuration Pipeline Script¶
Take collecting Nginx logs as an example, in the Logging collector, configure the Pipeline field, note, here is the Pipeline script name, not the path. All Pipeline scripts referenced here must be stored in the
[inputs.logging]]
# Log file list, absolute paths can be specified, and batch specification using glob rules is supported
# It is recommended to use absolute paths
logfiles = [
"/var/log/nginx/access.log",
"/var/log/nginx/error.log",
]
# File path filtering, using glob rules, any file that meets any filtering condition will not be collected
ignore = [""]
# Data source, if empty, 'default' is used by default
source = ""
# Add a tag, if empty, $source is used by default
service = ""
# Pipeline script path, if empty, $source.p will be used, if $source.p does not exist, no pipeline will be used
pipeline = "nginx.p"
# Filter corresponding status:
# `emerg`,`alert`,`critical`,`error`,`warning`,`info`,`debug`,`OK`
ignore_status = []
# Select encoding, incorrect encoding will cause data to be unviewable. Default is empty:
# `utf-8`, `utf-16le`, `utf-16le`, `gbk`, `gb18030` or ""
character_encoding = ""
## Set regular expression, e.g., ^\d{4}-\d{2}-\d{2} matches YYYY-MM-DD time format at the beginning of the line
## Data matching this regular expression will be considered valid, otherwise it will be appended to the end of the previous valid data
## Use three single quotes '''this-regexp''' to avoid escaping
## Regular expression link: https://golang.org/pkg/regexp/syntax/#hdr-Syntax
# multiline_match = '''^\S'''
## Whether to remove ANSI escape codes, such as text colors in standard output
remove_ansi_escape_codes = false
# Custom tags
[inputs.logging.tags]
app = oa
Restart Datakit, and the corresponding logs will be cut.
Collecting Streaming Logs Configuration Pipeline Script¶
Take collecting Fluentd logs as an example, modify the Output of Match to specify the type as Http and point the Endpoint to the DataKit address where Logstreaming is enabled and configure the Pipeline script name to complete the collection.
##PC-side log collection
<source>
@type tail
format ltsv
path /var/log/nginx/access.log
pos_file /var/log/buffer/posfile/access.log.pos
tag nginx
time_key time
time_format %d/%b/%Y:%H:%M:%S %z
</source>
## The collected data is forwarded to the local DataKit via HTTP protocol
## nginx output
<match nginx>
@type http
endpoint http://127.0.0.1:9529/v1/write/logstreaming?source=nginx_td&pipeline=nginx.p
open_timeout 2
<format>
@type json
</format>
</match>
Log Collection Performance Optimization¶
Why is My Pipeline Running Very Slowly¶
Performance issues are usually brought up for discussion, users often find that after using Grok expressions, the speed of Pipeline processing logs becomes very slow. Grok patterns are based on regular expressions, it may be that the Grok variables we use when writing Pipeline cover too many scenarios, or we do full line matching line by line, which leads to slow processing speed, etc.
Please Pay Attention to Expressions that Match Twice¶
We have seen many Grok patterns dealing with multiple application logs sent by the same gateway, such as Syslog. Imagine a scenario where we use "common_header: payload" to record three application logs
Application 1: '8.8.8.8 process-name[666]: a b 1 2 a lot of text at the end'
Application 2: '8.8.8.8 process-name[667]: a 1 2 3 a lot of text near the end;4'
Application 3: '8.8.8.8 process-name[421]: a completely different format | 1111'
grok(_ , "%{IPORHOST:clientip} %{DATA:process_name}\[%{NUMBER:process_id}\]: %{WORD:word_1} %{WORD:word_2} %{NUMBER:number_1} %{NUMBER:number_2} %{DATA:data}")
grok(_ , "%{IPORHOST:clientip} %{DATA:process_name}\[%{NUMBER:process_id}\]: %{WORD:word_1} %{NUMBER:number_1} %{NUMBER:number_2} %{NUMBER:number_3} %{DATA:data};%{NUMBER:number_4}")
grok(_ , "%{IPORHOST:clientip} %{DATA:process_name}\[%{NUMBER:process_id}\]: %{DATA:data} | %{NUMBER:number}")
add_pattern("message", "%{IPORHOST:clientip} %{DATA:process_name}\[%{NUMBER:process_id}\]: %{GREEDYDATA:message}")
grok(_, "%{message} %{WORD:word_1} %{WORD:word_2} %{NUMBER:number_1} %{NUMBER:number_2} %{GREEDYDATA:data}")
grok(_, "%{message} %{WORD:word_1} %{NUMBER:number_1} %{NUMBER:number_2} %{NUMBER:number_3} %{DATA:data};%{NUMBER:number_4}")
grok(_, "%{message} %{DATA:data} | %{NUMBER:number}")
Please Pay Attention to High-Performance Overhead Grok Expressions¶
Let's look at the following Nginx log
172.17.0.1 - - [06/Jan/2017:16:16:37 +0000] "GET /datadoghq/company?test=var1%20Pl HTTP/1.1" 401 612 "http://www.perdu.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" "-"
grok(_, "%{IPORHOST:client_ip} %{NOTSPACE:http_ident} %{NOTSPACE:http_auth} \\[%{HTTPDATE:time}\\] \"%{DATA:http_method} %{GREEDYDATA:http_url} HTTP/%{NUMBER:http_version}\" %{INT:status_code} %{INT:bytes}")
cast(status_code, "int")
cast(bytes, "int")
Here %{IPORHOST:client_ip} --> 172.17.0.1 has a high performance overhead, because Grok is transformed into regular expressions at the bottom, the more scenarios covered by Grok expressions, the worse the performance may be, let's take a look at the complex regular expressions at the bottom of %{IPORHOST:client_ip}
IPORHOST (?:%{IP}|%{HOSTNAME})
HOSTNAME \b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\.?|\b)
IP (?:%{IPV6}|%{IPV4})
IPV6 ((([0-9A-Fa-f]{1,4}:){7}([0-9A-Fa-f]{1,4}|:))|(([0-9A-Fa-f]{1,4}:){6}(:[0-9A-Fa-f]{1,4}|((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){5}(((:[0-9A-Fa-f]{1,4}){1,2})|:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){4}(((:[0-9A-Fa-f]{1,4}){1,3})|((:[0-9A-Fa-f]{1,4})?:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){3}(((:[0-9A-Fa-f]{1,4}){1,4})|((:[0-9A-Fa-f]{1,4}){0,2}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){2}(((:[0-9A-Fa-f]{1,4}){1,5})|((:[0-9A-Fa-f]{1,4}){0,3}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){1}(((:[0-9A-Fa-f]{1,4}){1,6})|((:[0-9A-Fa-f]{1,4}){0,4}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(:(((:[0-9A-Fa-f]{1,4}){1,7})|((:[0-9A-Fa-f]{1,4}){0,5}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:)))(%.+)?
IPV4 (?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9])
We can see that a short line of Grok expression can contain so many complex regular expressions, it can be seen that when we need to process a very large amount of logs, using such complex Grok expressions will greatly affect performance, so how can we optimize it
grok(_, "%{NOTSPACE:client_ip} %{NOTSPACE:http_ident} %{NOTSPACE:http_auth} \\[%{HTTPDATE:time}\\] \"%{DATA:http_method} %{GREEDYDATA:http_url} HTTP/%{NUMBER:http_version}\" %{INT:status_code} %{INT:bytes}")
cast(status_code, "int")
cast(bytes, "int")
default_time(time)
Focus on performance, try to use %{NOTSPACE:}, since grok is transformed into regular expressions at the bottom, the more scenarios covered by Grok expressions, the worse the performance may be, instead, such as %{NOTSPACE:} (non-space) matching is extremely simple variables, its performance is very high, so when dividing words, if you can determine that the data is non-space, and the data is next to the blank character, then decisively choose %{NOTSPACE:} to improve our Pipeline performance.
Better Use Tools to Write Pipeline¶
DataKit - Interactive Command Line Tool grokq¶
Due to the large number of Grok Patterns, manual matching is quite troublesome. DataKit provides an interactive command-line tool grokq (Grok Query):
datakit --grokq
grokq > Mon Jan 25 19:41:17 CST 2021 # Enter the text you want to match here
2 %{DATESTAMP_OTHER: ?} # The tool will give corresponding suggestions, the more accurate the match (the greater the weight), the higher the weight. The preceding number indicates the weight.
0 %{GREEDYDATA: ?}
grokq > 2021-01-25T18:37:22.016+0800
4 %{TIMESTAMP_ISO8601: ?} # The ? here means you need to use a field to name the matched text
0 %{NOTSPACE: ?}
0 %{PROG: ?}
0 %{SYSLOGPROG: ?}
0 %{GREEDYDATA: ?} # Patterns like GREEDYDATA with a wide range have lower weights
# The higher the weight, the more accurate the match
grokq > Q # Q or exit to exit
Bye!
DataKit - Pipeline Script Test¶
After writing the Pipeline file with the help of the command-line tool grokq provided by DataKit, specify the Pipeline script name (--pl, the Pipeline script must be placed in the
#Successful extraction example
datakit --pl nginx.p --txt '172.17.0.1 - - [06/Jan/2017:16:16:37 +0000] "GET /datadoghq/company?test=var1%20Pl HTTP/1.1" 401 612 "http://www.perdu.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" "-"'
Extracted data(cost: 5.279203ms): # Indicates successful cutting
{
"agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36\" \"-\"",
"browser": "Chrome",
"browserVer": "55.0.2883.87",
"bytes": 612,
"client_ip": "172.17.0.1",
"engine": "AppleWebKit",
"engineVer": "537.36",
"http_method": "GET",
"http_url": "/datadoghq/company?test=var1%20Pl",
"http_version": "1.1",
"isBot": false,
"isMobile": false,
"message": "172.17.0.1 - - [06/Jan/2017:16:16:37 +0000] \"GET /datadoghq/company?test=var1%20Pl HTTP/1.1\" 401 612 \"http://www.perdu.com/\" \"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36\" \"-\"",
"os": "Linux x86_64",
"referrer": "http://www.perdu.com/",
"status": "warning",
"status_code": 401,
"time": 1483719397000000000,
"ua": "X11"
}
# Failed extraction example
datakit --pl nginx.p --txt '172.17.0.1 - - [06/Jan/2017:16:16:37 +0000] "GET /datadoghq/company?test=var1%20Pl HTTP/1.1" 401 612 "http://www.perdu.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" "-"'
No data extracted from pipeline
Online Grok Debug¶
Use the GrokDebug website for Grok debugging
Log Collection Cost Optimization¶
Cost Optimization Through Guance Product Side¶
"Guance" supports filtering logs that meet certain conditions by setting a log blacklist, i.e., after configuring the log blacklist, log data that meets the conditions will no longer be reported to the "Guance" workspace, helping users save log data storage costs.
Note: The configuration here will not be pushed to DataKit in a distributed manner. The configuration here takes effect when DataKit actively sends a Get request to the center's configuration file, then reads the configuration file and executes the filtering action locally.
Creating a Log Blacklist¶
In the "Guance" workspace, click on 「Logs」-「Blacklist」-「Create Blacklist」, select 「Log Source」, add one or more log filtering rules, and click OK to enable the log filtering rule by default. You can view all log filtering rules through 「Log Blacklist」.

Note: The log filtering conditions are in an "and (and)" relationship, i.e., log data that meets all filtering conditions will not be reported to the workspace.
Pre-Cost Optimization for Collecting Streaming Logs¶
Take collecting Fluentd logs as an example, you can perform log aggregation in <match> </match> to compress logs, or use <match> </match> to filter events and only report error or alert logs to Guance to reduce usage costs.

