Guance VS ELK, EFK¶
Overview of ELK, EFK, and Guance¶
As the complexity of software systems continues to increase, logs are typically generated by servers and output to different files. These generally include system logs, application logs, and security logs. These logs are stored分散ly across different machines. Usually, when a system fails, engineers need to log into each server and use Linux scripting tools like grep / sed / awk to search for failure causes within the logs. Without a logging system, it is first necessary to identify which server processed the request. If this server has multiple instances deployed, one would need to look for log files in the log directories of each application instance. Each application instance also sets up log rotation strategies (such as generating one file per day) and log compression archiving strategies. This entire process makes it difficult to troubleshoot faults and find fault reasons promptly.
When deploying on the cloud, logging into various nodes to check logs from different modules is basically impractical. This is not only inefficient but sometimes impossible due to security constraints that prevent engineers from directly accessing physical nodes. Additionally, large-scale software systems are now mostly deployed in clusters, meaning that for each service, multiple identical PODs provide external services. Each container generates its own logs, making it almost impossible to determine which POD produced the logs just by looking at them, thus complicating distributed log viewing even further.
Therefore, if we can centrally manage these logs and provide centralized retrieval functions, we can not only improve diagnostic efficiency but also gain a comprehensive understanding of the system, avoiding firefighting after incidents.
ELK¶
So, what exactly is ELK? "ELK" is the acronym for three open-source projects: Elasticsearch, Logstash, and Kibana. Elasticsearch is a search and analysis engine. Logstash is a server-side data processing pipeline that can simultaneously collect data from multiple sources, transform the data, and then send it to "repositories" such as Elasticsearch. Kibana allows users to visualize data in Elasticsearch using graphs and charts.
Elasticsearch¶
Elasticsearch is a JSON-based distributed search and analytics engine. It can be accessed via RESTful Web service interfaces and uses schema-less JSON (JavaScript Object Notation) documents to store data. It is based on the Java programming language, enabling Elasticsearch to run on different platforms. This allows users to search very large volumes of data at very high speeds.
Key Features¶
- Distributed real-time file storage with every field indexed and searchable
- Distributed real-time analytical search engine
- Can scale to hundreds of servers, handling petabytes of structured or unstructured data
Logstash¶
It is an open-source stream processing engine used for data streams, capable of establishing data pipelines within minutes. It is horizontally scalable and resilient with adaptive buffering. It features over 200 integrated and processor plugins, allowing monitoring and management of deployments using Elastic Stack.
Key Features¶
- Accesses nearly any kind of data
- Integrates with multiple external applications
- Supports elastic scaling
Logstash Components¶
- Inputs: Inputs primarily provide rules for receiving data, such as collecting file contents;
- Filters: Filters mainly filter transmitted data, such as using grok rules for data filtering;
- Outputs: Outputs mainly define how received data should be formatted and sent out, such as sending it to Elasticsearch;
Kibana¶
Kibana is an open-source data analysis and visualization platform. It is part of the Elastic Stack and is designed to work with Elasticsearch. You can use Kibana to search, view, and interactively operate data in Elasticsearch indices. You can easily use charts, tables, and maps to perform diversified analysis and presentations of data.
Kibana makes big data understandable and accessible. Its browser-based interface simplifies the creation and sharing of dynamic dashboards to track real-time changes in Elasticsearch data.
EFK¶
EFK is not a piece of software but rather a solution set. EFK is the abbreviation for three open-source softwares: Elasticsearch, Fluentd, Kibana or Elasticsearch, Filebeat, Kibana. Among them, Elasticsearch is responsible for log analysis and storage, Fluentd and Filebeat are responsible for log collection, and Kibana is responsible for interface display. They work together seamlessly, efficiently meeting many application scenarios and are currently one of the mainstream log analysis system solutions.
Fluentd¶
Fluentd is an open-source data collector specifically designed to handle data flows and uses JSON as its data format. It adopts a plugin-based architecture, featuring high scalability and high availability, along with reliable message forwarding. In use, we can send various types of information to Fluentd first, then Fluentd forwards the information to different destinations according to the configuration through different plugins, such as files, SaaS Platforms, databases, or even another Fluentd.
Key Features¶
- Easy installation
- Low resource consumption
- Semi-structured data logging
- Flexible plugin mechanism
- Reliable buffering
- Log forwarding
Fluentd Components¶
Fluentd's Input/Buffer/Output is very similar to Flume's Source/Channel/Sink.
- Input: Input is responsible for receiving data or actively fetching data. It supports syslog, http, file tail, etc.
- Buffer: Buffer ensures the performance and reliability of data acquisition, with different types of Buffers such as file or memory configurable.
- Output: Output is responsible for sending data to the destination, such as files, AWS S3, or other Fluentds.
Filebeat¶
Filebeat is a lightweight log collector implemented in Golang and is also a member of the Elasticsearch stack. Essentially, it is an agent that can be installed on various nodes to read corresponding location logs according to the configuration and report them to designated locations.
Filebeat is highly reliable and can ensure logs are reported at least once. It also considers various issues in log collection, such as log breakpoint resumption, filename changes, and truncated logs.
Filebeat does not depend on Elasticsearch and can exist independently. We can use Filebeat alone for log reporting and collection. Filebeat includes common Output components such as Kafka, Elasticsearch, Redis. For debugging purposes, it can also output to console and file. We can utilize existing Output components to report logs. Of course, we can also customize Output components to forward logs to our desired destinations.
Filebeat is actually a member of elastic/beats, apart from Filebeat, there are HeartBeat, PacketBeat. The implementation of these beats is all based on the libbeat framework.
Filebeat Components¶
-
Harvester: The main responsibility of the harvester is to read the content of a single file. It reads each file and sends the content to the output. A harvester is started for each file, managing the opening and closing of the file, which means that file descriptors remain open during runtime. If a file is deleted or renamed while being read, Filebeat will continue reading the file.
-
Prospector: The main responsibility of the prospector is to manage harvesters and find all the files to be read. If the input type is a log, the prospector will look for all files matching the path and start a harvester for each file. Each prospector runs in its own Go coroutine.
Note: Filebeat Prospector can only read local files and does not have the ability to connect to remote hosts to read stored files or logs. Because the application scope of Filebeat is quite limited, it will not be extensively compared in this article.
Guance¶
DataKit¶
DataKit is a fundamental data collection tool that runs on user's local machines. It mainly collects various metrics and logs of system operations and aggregates them to Guance. In Guance, users can view and analyze their various metrics and logs. DataKit is a crucial data collection component in Guance, and all data in Guance originates from DataKit.
-
DataKit primarily collects data periodically, gathering various different metrics and then sending the data to DataWay at regular intervals and fixed quantities via HTTP(s). Each DataKit is configured with a corresponding token to identify different users.
-
After DataWay receives the data, it forwards it to Guance, where the data contains an API signature.
-
Guance receives legitimate data and writes it into different storages according to the data type.
For collected data businesses, in general, partial data loss is allowed (since the data itself is intermittently collected, data within the interval can be considered as a form of data loss). Currently, the following data loss protection measures are implemented in the entire data transmission chain:
-
If DataKit fails to send data to DataWay due to certain network reasons, DataKit will cache up to a maximum of one thousand points of data. When the cached data exceeds this amount, the cache will be cleared.
-
If DataWay fails to send data to Guance due to some reasons, or if the traffic is too high to send the data to Guance in time, DataWay will persist the data to disk. Once the traffic decreases or the network recovers, the data will be sent to Guance. Delayed data does not affect timeliness since timestamps are attached to the cached data.
On DataWay, to protect the disk, the maximum disk usage is also configurable to avoid overloading the storage of the node. Data exceeding this limit is discarded. However, this capacity is usually set relatively large.
DataKit Components¶
From top to bottom, DataKit is mainly divided into three layers internally:
- Top Layer: Includes the program entry module and some common modules
- Configuration Loading Module: Apart from its main configuration (i.e.,
conf.d/datakit.conf
), DataKit's individual collector configurations are separated. If placed together, the configuration file could become very large and inconvenient to edit. - Service Management Module: Mainly responsible for overall DataKit service management.
- Toolchain Module: Besides data collection, DataKit provides many peripheral functions, all implemented in the toolchain module, such as viewing documentation, restarting services, updating, etc.
- Pipeline Module: In log processing, Pipeline scripts (Grok syntax) are used to split logs, converting unstructured log data into structured data. Similar data processing can also be done for non-log data.
- Election Module: When deploying a large number of DataKits, users can make all DataKit configurations the same and distribute them through automated batch deployment. The significance of the election module lies in ensuring that, in a cluster, certain data collections (such as Kubernetes cluster metrics) are performed by only one DataKit (otherwise, data duplication occurs, and pressure is put on the data source). Under the condition where all DataKit configurations in the cluster are the same, the election module can ensure that at any given time, at most one DataKit performs the collection.
- Documentation Module: DataKit's documentation is installed alongside the program, and users can access the documentation list via the http://localhost:9529/man page or browse the documentation in the command line.
- Transport Layer: Responsible for almost all data input and output
- HTTP Service Module: DataKit supports third-party data integration, such as Telegraf/Prometheus. Currently, these data are accessed via HTTP.
- IO Module: After each data collection by the plugins, the data is sent to the IO module. The IO module encapsulates unified data construction, processing, and sending interfaces, facilitating the integration of data collected by various plugins. Additionally, the IO module sends data to DataWay at a certain rhythm (periodically and quantitatively) via HTTP(s).
- Collection Layer: Responsible for collecting various types of data. According to the collection type, they are divided into two categories:
- Active Collection Type: These collectors collect data at a fixed frequency as configured, such as CPU, network card traffic, cloud testing, etc.
- Passive Collection Type: These collectors typically implement collection through external data inputs, such as RUM, Tracing, etc. They generally run outside DataKit and can upload data through the DataKit's open data upload API after standardizing the data.
Guance Platform¶
Based on powerful data collection capabilities, “Guance” builds full-chain observability from infrastructure, containers, middleware, databases, message queues, application links, frontend visits, system security, and network visit performance. Based on Guance's standard products, users can quickly build complete observability for their projects after correctly configuring Datakit collection. Meanwhile, based on line protocol (Line Protocol) and Guance's scene-building capabilities, users can customize and integrate the required observation indicators conveniently to achieve further observability.
“Guance” as a whole is a complete technical product oriented towards observability and involves many technical thresholds. Compared to various open-source solutions, Guance emphasizes from the beginning how to effectively reduce the learning cost of users using the product and enhance user usability. Therefore, from the installation and deployment of DataKit, including all configurable capabilities, “Guance” aims to reduce the configuration difficulty for users to fit the habits of most programmers and operation engineers, while enhancing the usability and professionalism of the entire UI, allowing users to quickly understand the product and its value.
Running Platform Comparison¶
One of Logstash's original advantages was that it was written in JRuby, so it could run on Windows;
Fluentd until recently supported Windows because it no longer depended on Linux-centric event libraries. Fluentd now supports Windows. You can also use this in_windows_eventlog plugin to track Windows event logs;
DataKit is an official data collector provided by the Guance product. It comes with built-in data source collection scripts, supports multiple data integrations, and works on Windows, Linux, Mac operating systems, ARM, X86 multi-system types, providing full-platform compatibility for log collection.
Logstash¶
Linux and Windows
Fluentd¶
Linux and Windows
DataKit¶
Full platform support and supports client-side visual configuration management, significantly reducing the learning cost of installation, deployment, and complex configurations.
Event Routing Comparison¶
In terms of event routing configuration, Fluentd's method is more declarative, while Logstash's method is procedural. Therefore, developers trained in procedural programming may find Logstash's configuration easier to learn. Additionally, Fluentd's tag-based routing allows clear expression of complex routes. However, Guance can, based on its robust product logic and strong product components, function without relying on other products for event alerts or data browsing, ensuring data security and avoiding complex configurations, thereby delivering an excellent user experience.
Logstash Event Routing¶
Logstash routes all data into one stream and then sends them to the desired destination using if-then statements. Below is an example of sending error events in production to PagerDuty:
Fluentd Event Routing¶
Fluentd relies on tags to route events. Each Fluentd event has a tag that tells Fluentd where to route it. If you want to send error events in production to PagerDuty, the configuration looks like this:
<source>
@type forward
</source>
<filter app.**>
@type record_transformer
<record>
hostname "#{Socket.gethostname}"
</record>
</filter>
<match app.**>
@type file
# ...
</match>
DataKit Proxy¶
DataKit, as an integral part of Guance's powerful product suite, directly reports data to the cloud-based Guance platform for observation and analysis without needing to provide event routing like LogStash and Fluentd to transfer data to other tools for analysis and caching. To ensure user data security and solve the problem of DataKit being deployed in internal networks without Internet access requiring proxy servers to access the Internet, DataKit's proxy configuration is simple—just enable the proxy option. Through simple configuration, rich product features can be experienced.
Plugin Ecosystem Comparison¶
Logstash, Fluentd, and DataKit all have rich plugin ecosystems covering many input systems (files and TCP/UDP, etc.) and filters (field splitting and filtering).
Logstash Plugins¶
Logstash manages all its plugins in its GitHub repo, with over 200 input, filter, and output plugins maintained by users, lacking official maintenance and hosting.
Fluentd Plugins¶
Fluentd includes 8 types of plugins—input, parsers, filters, outputs, formatters, storages, service discovery, and buffers, totaling over 500 plugins. However, only 10 plugins are officially hosted, with the rest maintained by users, lacking official maintenance, hosting, and technical stack support.
¶
DataKit Plugins¶
DataKit includes powerful built-in functionalities—dynamic Grok syntax query debugging, fast data querying based on proprietary DQL grammar, real-time input collection runtime monitoring, edge computing capabilities, and visual client configuration and deployment of data sources. It supports over 200 officially maintained data source integrations and technical stack support, compatible with external data integrations such as Telegraf, Beats, Logstash, Fluentd, etc. More friendly to users, it supports visual plugin and agent management via the client and allows real-time viewing of data collection status.
Queue Comparison¶
Logstash lacks persistent internal message queues: Currently, Logstash has an in-memory queue that can hold 20 events (fixed size) and relies on external queues like Redis for persistence during restarts. Fluentd has a configurable buffer system that can be in-memory or on-disk, but configuring its reliability can be complex. DataKit has a built-in caching mechanism that adjusts parameters based on server configuration to achieve data caching effects.
Logstash Queue¶
Since Logstash lacks a built-in persistent message queue, its internal queue model is very simple and requires external Redis queues for persistence.
Fluentd Queue¶
Compared to Logstash, Fluentd has built-in reliability, but its configuration is more complex, increasing the learning cost for users.
DataKit Queue¶
DataKit has a built-in caching mechanism. When the server running DataKit fails to send data to DataWay due to network issues, DataKit will cache up to a maximum of one thousand data points to prevent data loss. This cache limit can be controlled by modifying the DataKit configuration file. The configuration is simple, with almost zero learning cost.
Log Parsing Comparison¶
Log analysis is a fundamental core technology within enterprises, not only applied in security teams but also in IT development and business teams. From a security perspective, security teams extract and analyze logs mainly to discover unknown security events, trace known security events, and comply with national regulatory requirements. From an IT development standpoint, internal non-security technical teams conduct log analysis primarily to identify issues and analyze known problems, focusing on system monitoring and APM (which encompasses all monitoring items of concern to development teams). From a business perspective, the demand for log analysis in business teams focuses more on risk control, operational promotion, user profiling, and website profiling. Therefore, lying logs on hard drives have no value; realizing the value of log information through log analysis technology reflects a company’s technical strength indirectly.
Common Logstash includes Grok parsing and constructing arbitrary text, mutate performing regular transformations on event fields, drop completely deleting events, clone creating copies of events, geoip adding geographical information about IP addresses, and other common parsers. Fluentd log parsing commonly involves filtering events by searching one or more field values, enriching events by adding new fields, and protecting privacy and compliance by deleting or masking certain fields. However, the plugins are relatively fewer, with only five plugins: record_transformer, filter_stdout, filter_grep, parser, and filter_geoip. DataKit log parsing includes Pipeline for cutting unstructured text data or extracting information from structured text (such as JSON), using glob rules to specify log files more conveniently, automatic discovery and file filtering, an easy-to-use interactive Grok matching tool lowering the Grok usage threshold, and supporting numerous script functions to make data formats more flexible.
Logstash Log Parsing¶
Grok is currently the best way in Logstash to parse unstructured log data into structured and queryable content. Logstash currently has 120 Grok parsing templates managed in its GitHub repo by users, lacking official technical support. Various business needs, including Grok template performance optimization, require users to explore on their own.
Fluentd Log Parsing¶
Fluentd's log parsing style is similar to Logstash but offers more flexible configuration methods. However, it does not provide corresponding Grok parsing templates but only some configuration examples, requiring users to configure the parsing functions themselves according to the documentation examples. This relatively increases the threshold and lacks appropriate technical support for configuration issues, often requiring users to Google solutions themselves.
In the configured server environment, taking Nginx's access log as an example, the following log is 365 bytes long and is structured into 14 fields:
In the upcoming tests, the log will be repeatedly written into the file under different pressures. The time field of each log takes the current system time, while the other 13 fields are the same.
Compared to actual scenarios, there is no difference in log parsing in simulated scenarios. One distinction is that higher data compression rates reduce network write traffic.
Logstash¶
Logstash version 7.1.0, using Grok to parse logs and write to Kafka (built-in plugin, with gzip compression enabled).
Log parsing configuration:
grok {
patterns_dir=> "/home/admin/workspace/survey/logstash/patterns"
match=>{ "message"=>"%{IPORHOST:ip} %{USERNAME:rt} -
\"%{WORD:method} %{DATA:url}\" %{NUMBER:status} %{NUMBER:size} \"%{DATA:ref}\" \"%{DATA:agent}\" \"%{DATA:cookie_unb}\" \"%{DATA:cookie_cookie2}\" \"%{DATA:monitor_traceid}\" %{WORD:cell} %{WORD:ups} %{BASE10NUM:remote_port}" }
remove_field=>[ "message"]
}
Write TPS | Write Traffic (KB/s) | CPU Usage (%) | Memory Usage (MB) |
---|---|---|---|
500 | 178.89 | 25.3 | 432 |
1000 | 346.65 | 46.9 | 476 |
5000 | 1882.23 | 231.1 | 489 |
10000 | 3564.45 | 511.2 | 512 |
Fluentd¶
td-agent version 4.1.0, using regular expressions to parse logs and write to Kafka (third-party plugin fluent-plugin-kafka, with gzip compression enabled).
Log parsing configuration:
<source>
type tail
format /^(? <ip>\S+)\s(?<rt>\d+)\s-\s\[(?<time>[^\]]*)\]\s"(?<url>[^\"]+)"\s(?<status>\d+)\s(?<size>\d+)\s"(?<ref>[^\"]+)"\s"(?<agent>[^\"]+)"\s"(?<cookie_unb>\d+)"\s"(?<cookie_cookie2>\w+)"\s"(?
<monitor_traceid>\w+)"\s(?<cell>\w+)\s(?<ups>\w+)\s(?<remote_port>\d+).*$/
time_format %d/%b/%Y:%H:%M:%S %z
path /home/admin/workspace/temp/mock_log/access.log
pos_file /home/admin/workspace/temp/mock_log/nginx_access.pos
tag nginx.access
</source>
Write TPS | Write Traffic (KB/s) | CPU Usage (%) | Memory Usage (MB) |
---|---|---|---|
500 | 174.272 | 13.8 | 58 |
1000 | 336.85 | 24.4 | 61 |
5000 | 1771.43 | 95.3 | 103 |
10000 | 3522.45 | 140.2 | 140 |
DataKit¶
DataKit-1.1.8-rc3, using Pipeline to split unstructured text data.
# access log
grok(_, "%{NOTSPACE:ip} %{NOTSPACE:rt} - "%{NOTSPACE:method} %{NOTSPACE:url}\" %{NOTSPACE:status} %{NOTSPACE:size} \"%{NOTSPACE:ref}\" \"%{NOTSPACE:agent}\" \"%{NOTSPACE:cookie_unb}\" \"%{NOTSPACE:cookie_cookie2}\" \"%{NOTSPACE:monitor_traceid}\" %{NOTSPACE:cell} %{NOTSPACE:ups} %{NOTSPACE:remote_port}")
cast(status_code, "int")
cast(bytes, "int")
default_time(time)
Write TPS | Write Traffic (KB/s) | CPU Usage (%) | Memory Usage (MB) |
---|---|---|---|
500 | 178.24 | 8.5 | 41 |
1000 | 356.45 | 13.8 | 45 |
5000 | 1782.23 | 71.1 | 76 |
10000 | 3522.45 | 101.2 | 88 |
Log Collection Architecture Comparison¶
ELK Solution¶
Solution One¶
This is the simplest ELK architecture. Its advantage is that it is easy to set up and get started. The downside is that Logstash consumes significant resources, with high CPU and memory usage. Additionally, without a message queue cache, there is a risk of data loss. Users must be proficient in Logstash, ElasticSearch, and Kibana to skillfully use and solve various complex business problems. Maintenance personnel for LogStash clusters and ElasticSearch clusters must also be skilled in cluster performance optimization and resource management to ensure smooth business operations.
This architecture involves Logstash distributed across various nodes collecting relevant logs and data, analyzing and filtering them, and then sending them to a remote server's Elasticsearch for storage. Elasticsearch compresses and stores the data in shards and provides multiple APIs for user queries and operations. Users can also more intuitively configure Kibana Web to conveniently query logs and generate reports based on the data.
Solution Two¶
This is a relatively mature ELK architecture. Its advantage is that introducing Kafka ensures that data is temporarily stored even if the remote Logstash cluster stops due to a fault, preventing data loss. The downside is that it is complex to set up, with a complex tech stack that is harder to master quickly. Logstash consumes significant resources, with high CPU and memory usage. An additional Kafka cluster must be maintained, and in large scenarios, a Zookeeper cluster might also be needed. Users must be proficient in Logstash, ElasticSearch, Kafka, and Kibana to skillfully use and solve various complex business problems. Maintenance personnel for LogStash clusters, Kafka clusters, and ElasticSearch clusters must also be skilled in cluster performance optimization and resource management to ensure smooth business operations.
This architecture introduces a message queue mechanism. Logstash Agents located on various nodes first pass data/logs to Kafka (or Redis) and then indirectly pass messages or data to Logstash. Logstash filters and analyzes the data and then passes it to Elasticsearch for storage. Finally, Kibana presents the logs and data to users. Since Kafka (or Redis) is introduced, even if the remote Logstash server stops due to a fault, the data will be temporarily stored, preventing data loss.
EFK Solution¶
Solution One¶
This is a more flexible EFK architecture. Its advantage is greater flexibility, less resource consumption compared to Logstash, and stronger scalability. The downside is that reporting logs to the LogStash cluster for centralized processing requires a large LogStash cluster to provide computational support, and users must be proficient in Logstash, ElasticSearch, and Kibana to skillfully use and solve various complex business problems. Maintenance personnel for LogStash clusters and ElasticSearch clusters must also be skilled in cluster performance optimization and resource management to ensure smooth business operations.
This architecture replaces the collection end Logstash with Filebeats and can configure Logstash and Elasticsearch clusters to support large-cluster system operation log monitoring and queries.
Solution Two¶
On the basis of ELK, Filebeat is used for log collection. The advantage is that unlike the ELK architecture where Logstash is the log collection end, each server requires a JAVA environment because Logstash is based on the Java environment to function properly. Filebeat, however, requires no dependencies, allowing direct installation, modification of the configuration file, and starting the service. The downside is that the tech stack is complex and harder to master quickly, and Logstash consumes significant resources with high CPU and memory usage. An additional Kafka cluster must be maintained, and in large scenarios, a Zookeeper cluster might also be needed. Users must be proficient in FileBeats, Logstash, ElasticSearch, Kafka, and Kibana to skillfully use and solve various complex business problems. Maintenance personnel for LogStash clusters, Kafka clusters, and ElasticSearch clusters must also be skilled in cluster performance optimization and resource management to ensure smooth business operations.
In this architecture, when the collection end gathers log files, in the Filebeat input, we define a field in Filebeat called log_topic to classify log files under specified paths. In the Output, we specify the output to Kafka. Kafka acts as a message queue, receiving all logs collected by the Filebeat client and forwarding them according to different log types (e.g., nginx, php, system). In Kafka, we create different topics based on the self-defined log types in the input. Logstash receives messages from the Kafka message queue and writes logs into Elasticsearch based on different Kafka topics; Kibana matches Elasticsearch indices to analyze, search, and display log content (of course, designing the charts is up to the user).
Solution Three¶
Using Fluentd for log collection. The advantage is that Fluentd consumes far fewer resources than LogStash clusters, making the architecture simpler and more flexible. The downside is that Fluentd configuration is relatively complex, with a higher usage threshold, making it harder to get started quickly, and the configuration file is relatively complex and cumbersome to modify. Users must be proficient in Fluentd, ElasticSearch, and Kibana to skillfully use and solve various complex business problems. Maintenance personnel for ElasticSearch clusters must also be skilled in cluster performance optimization and resource management to ensure smooth business operations.
This architecture uses Fluentd to collect program logs, then stores the logs in the Elasticsearch cluster, and finally associates Elasticsearch in Kibana to enable log querying.
Guance Architecture¶
The data collection tool Datakit is mainly used to collect various metrics and logs of system operations and aggregates them via Dataway to Guance. In Guance, users can view and analyze their various metrics and logs. DataKit is a critical data collection component in Guance, and all data in Guance originates from DataKit.
DataKit deployment and configuration are extremely simple and clear and can help users manage DataKit through a visual client. DataKit not only collects log data but also includes APM data, infrastructure, containers, middleware, network performance, etc. DataKit does not need to rely on components like ElasticSearch and Kafka to complement business functions, Guance completely eliminates the need for users to consider these issues, allowing them to truly focus on optimizing their business. DataKit does not require users to master a complex tech stack or incur high learning costs; simple configuration can be paired with Guance to solve various complex business problems. Overall, ELK and EFK have enormous operational costs, and just maintaining the ElasticSearch cluster requires significant expenses. Considering hot and cold data to save costs can also be headache-inducing. Using Guance eliminates these concerns, allowing users to concentrate solely on their business.
Hardware Cost Comparison¶
Price is a factor everyone cares about. We will compare the costs between ELK, EFK, and Guance using cloud services.
ELK Cost¶
The basic components of Elastic are open-source, with the main costs originating from hardware. We will calculate the cost of collecting logs from 10 servers, with each server producing 1GB of logs daily, using different architectures.
LogStash Cluster + Kafka Cluster + ElasticSearch Cluster + Kibana
- LogStash Cluster
Billing Item | Value | Unit Price | Cost (Yuan) |
---|---|---|---|
Server | 1 x 2-core 4GB | Monthly subscription fee: 216.7 Yuan/Month | 216.7 |
Storage | 50GB | ESSD: 0.5 Yuan/GB | 25 |
Total | 241.7 |
- Kafka Cluster
Billing Item | Value | Unit Price | Cost (Yuan) |
---|---|---|---|
Server | 3 x 4-core 16GB | Monthly subscription fee: 788 Yuan/Month | 2364 |
Storage | 200GB | ESSD: 0.5 Yuan/GB | 300 |
Total | 2664 |
- ElasticSearch Cluster
Billing Item | Value | Unit Price | Cost (Yuan) |
---|---|---|---|
Server | 3 x 2-core 8GB | Monthly subscription fee: 383 Yuan/Month | 1149 |
Storage | 500GB | ESSD: 0.5 Yuan/GB | 750 |
Total | 1899 |
- Kibana Node
Billing Item | Value | Unit Price | Cost (Yuan) |
---|---|---|---|
Server | 1 x 1-core 2GB | Monthly subscription fee: 104 Yuan/Month | 104 |
Storage | 50 GB | ESSD: 0.5 Yuan/GB | 25 |
Total | 129 |
At a small business scale with storage and server configurations, the total monthly cost for LogStash + Kafka + ElasticSearch + Kibana is 5175.4 Yuan.
Without using the Kafka cluster in a simple architecture, the total monthly cost for LogStash + ElasticSearch + Kibana is 2511.4 Yuan.
EFK Cost¶
The basic components of Elastic are open-source, with the main costs originating from hardware. We will calculate the cost of collecting logs from 10 servers, with each server producing 1GB of logs daily, using different architectures.
Fluentd + ElasticSearch Cluster + Kibana
- Fluentd
Fluentd does not need to be deployed as a separate cluster, so Fluentd costs are not considered, only ElasticSearch + Kibana are calculated.
- ElasticSearch Cluster
Billing Item | Value | Unit Price | Cost (Yuan) |
---|---|---|---|
Server | 3 x 2-core 8GB | Monthly subscription fee: 383 Yuan/Month | 1149 |
Storage | 500GB | ESSD: 0.5 Yuan/GB | 750 |
Total | 1899 |
- Kibana Node
Billing Item | Value | Unit Price | Cost (Yuan) |
---|---|---|---|
Server | 1 x 1-core 2GB | Monthly subscription fee: 104 Yuan/Month | 104 |
Storage | 50 GB | ESSD: 0.5 Yuan/GB | 25 |
Total | 129 |
Similarly, calculating storage and server configurations based on the same business scale, the total monthly cost for Fluentd + ElasticSearch + Kibana is 2028 Yuan.
Guance Cost¶
Guance does not charge product fees but charges according to storage usage, considering factors such as the number of DataKit collectors, the amount of log data, backup log data quantity, daily task scheduling times, single DataKit timelines, daily user access monitoring session counts, and application performance monitoring trace counts for pricing. Similarly, calculating costs based on collecting logs from 10 servers with each producing 1GB of logs daily:
Billing Item / Version | Free Plan | Agile Plan |
---|---|---|
Number of DataKits | Unlimited | 5 Yuan/day |
Timelines | Total timelines < 500 | Single DataKit timeline < 500, DataKit cost = DataKit quantity × base price Single DataKit timeline > 500, then DataKit quantity is calculated using the following formula: - DataKit quantity = Current workspace's total timeline quantity / 500 (final result rounded up) - DataKit cost = DataKit quantity × base price |
Log Data Quantity | 2 million entries | 0.5 Yuan/day (per 1 million entries) |
Backup Log Data Quantity | None | 0.2 Yuan/day (per 1 million entries) |
Trace Quantity | 10,000 traces | 1 Yuan/day (per 1 million entries) |
Session Quantity / PV Quantity | 100 Sessions | 1 Yuan/day (per 100 Sessions or per 1,000 PVs) Note: The lower of the two dimensions' actual costs is taken as the final cost |
Cloud Synthetic Testing API Task Count | 5 | 1 Yuan/day (per 1,000 tasks) Note: Statistics do not include API testing data generated by self-built nodes |
Cloud Synthetic Testing Browser Task Count | 15 Yuan/day (per 1,000 tasks) Note: Statistics do not include Browser testing data generated by self-built nodes |
|
Task Scheduling Count | 5,000 times | 1 Yuan/day (per 10,000 times) |
SMS Sending Count | None | 0.1 Yuan/day (per message) |
Installing DataKit on 10 servers, with log collection at 1GB per day per server calculated at 4K per entry.
Billing Item | Value | Unit Price | Cost (Yuan) |
---|---|---|---|
Servers | 10 DataKits | Monthly subscription fee: 150 Yuan/Month | 1500 |
Storage | 1GB per day | 0.5 Yuan/day (per 1 million entries) | 325 |
Total | 1825 |
Similarly, calculating Guance's total monthly cost based on the same business scale is 1825 Yuan.
Operations Cost Comparison¶
Speaking of operations costs, we all know that ensuring the integrity of clusters requires essential operations management. Let’s analyze the cost differences between different solutions.
ELK Operations Cost¶
Since the basic components of Elastic are open-source, users need to build their own clusters for operations management. For complex ELK architectures, this might include a LogStash cluster + Kafka cluster + ElasticSearch cluster + Kibana node composition. First, if the user's business log volume is large and the calculation logic is complex, the requirements for LogStash clusters, Kafka clusters, and ElasticSearch clusters in terms of scale and configuration would be very high. Second, the qualifications of operations personnel for clusters of different scales are also very demanding, as various issues may arise in large-scale clusters, requiring sufficient experience for performance optimization. Finally, the technical stack requirements for operations personnel are indispensable.
EFK Operations Cost¶
Similar to ELK, EFK is also based on open-source components, requiring users to build their own clusters for management. A better point compared to ELK is that EFK uses Fluentd as a collector to collect data, which can eliminate the need for a LogStash cluster and consumes fewer resources. However, it still requires dynamic scaling of the ElasticSearch cluster based on business scale, facing the same challenges of high operational personnel qualifications and optimization difficulties in large-scale clusters. Additionally, Fluentd configuration is more challenging than LogStash, relying on the operational personnel's experience for configuration. If script optimization is not done well, it could impact the original business on the server, requiring higher operational skills to ensure stable online business operations.
Guance Operations Cost¶
Guance is a SaaS-based observability platform. Users only need to deploy one DataKit on the servers they want to collect data from, enabling remote visualization configuration management through visualized management functionality. Guance provides optimal log parsing templates to help users achieve maximum performance with minimal server pressure through simple configuration, allowing them to focus more on business optimization and expansion without investing time in optimizing the collection end and log parsing clusters. Moreover, Guance offers complete observability functions online, covering infrastructure, containers, middleware, databases, message queues, application links, frontend visits, system security, and network visit performance across the entire chain. It allows users to create their own observation scenarios based on business needs without needing to research or modify immature open-source products, truly achieving zero-operation costs and focusing on business development.
Learning Cost Comparison¶
For setting up or using a log analysis system, learning costs are an indispensable part. To use effectively, one must learn first. Let's see how easy it is to get started with different solutions.
ELK Learning Cost¶
Since Elastic components are open-source, building your own cluster is necessary for analyzing logs. Therefore, understanding ELK's environment preparation and component configuration is essential when using the ELK system for log analysis processing. For instance, you need to understand Elasticsearch basics, master index operations, plan Elasticsearch clusters, and optimize open-source versions of LogStash and Elasticsearch.
EFK Learning Cost¶
Like ELK, EFK requires learning these contents. Fluentd configuration is more flexible than LogStash, making the learning curve steeper. Performance optimization becomes more dependent on user experience compared to LogStash, which provides over 120 Grok templates. Fluentd requires users to refer to documentation and do more learning to use it effectively.
Guance Learning Cost¶
For users, only DataKit is deployed in the environment for data collection. The official configurations for required data collection are mostly provided with configuration references and usage guides (currently supporting over 200 technical stacks). Users wanting to perform log analysis, business observability, or trace tracking only need to learn about the corresponding modules of Guance and enable the DataKit collection items to meet their business needs. This avoids the complexity of setting up an open-source cluster for log analysis, requiring users to learn numerous technologies. It allows users to focus more on handling business problems rather than spending too much time learning various technologies to ensure the operation of open-source clusters.
Usage Experience Comparison¶
Comparing the usage experiences of products is also important. What are the differences in the concepts realized for the same functionality?
ELK Usage Experience¶
To use ELK for log analysis, you first need to set up LogStash, Kafka, ElasticSearch clusters, and Kibana display nodes. Next, to collect and parse data for certain components, you check whether there are any usable templates among the existing 120+. If not, some testing might be needed to collect and parse data. Debugging becomes tedious if performance consumption during collection is too high or parsing is too slow. Furthermore, if log increments are too large, ElasticSearch query index optimization and frequent cluster scaling become inevitable. Finally, learning Kibana's KQL for data querying and display might be necessary. To further track and alert data in real-time, you might need additional open-source components. In the process, 80% of the time from requirement proposal to resolution is spent dealing with various issues of open-source components, leaving only a small portion of time for actual business analysis and optimization.
EFK Usage Experience¶
Using EFK for log analysis similarly cannot escape setting up ElasticSearch clusters and Kibana display nodes. For Fluentd's more flexible configurations, users may need more time to configure and debug to successfully complete data collection and parsing. Using ElasticSearch clusters cannot avoid index optimization and cluster scaling. Finally, learning Kibana for data querying and display is necessary. Similarly, users spend 80% of their time from requirement proposal to resolution dealing with various issues of open-source components, leaving only a small portion of time for actual business analysis and optimization.
Guance Usage Experience¶
Using Guance for log analysis and other needs proves to be very friendly. Firstly, the data collection end DataKit installation and configuration are extremely simple, requiring just one command to complete the installation and configuration. Secondly, for user component log or data collection needs, Guance supports over 200 mainstream technology stacks, providing comprehensive support from infrastructure, containers, middleware, databases, message queues, application links, frontend visits, system security, and network visit performance. It also has a complete documentation system where all user usage needs can be met, and the DataKit collector provides a visual client method to help users reduce usage difficulties. Simultaneously, Guance's platform provides a large number of official scenario views to help users better observe the health of their own businesses.
Guance builds full-chain observability from infrastructure, containers, middleware, databases, message queues, application links, frontend visits, logs, system security, and network visit performance. Based on Guance's standard products, once users correctly configure Datakit, they can quickly realize the complete observability construction of their projects. Meanwhile, based on line protocol (Line Protocol) and Guance's scene-building capabilities, users can customize and conveniently integrate the required observability indicators to achieve further observability.
Simple DataKit Installation¶
Just one command completes the installation of DataKit.
Convenient Collection Item Management¶
After enabling the DataKit client access, you can modify the collection items directly in the DataKit client. It includes a large number of built-in templates; users only need to enable the corresponding configurations according to what they want to collect to complete data collection.
Rich Official Component Support¶
DataKit includes powerful built-in functionalities—dynamic Grok syntax query debugging, fast data querying based on proprietary DQL grammar, real-time input collection runtime monitoring, edge computing capabilities, and visual client configuration and deployment of data sources. It supports over 200 officially maintained data source integrations and technical stack support, compatible with external data integrations such as Telegraf, Beats, Logstash, Fluentd, etc.
More Powerful Product Capabilities¶
Guance builds full-chain observability from infrastructure, containers, middleware, databases, message queues, application links, front-end visits, logs, system security, and network visit performance. Based on Guance's standard products, once users correctly configure Datakit, they can quickly realize the complete observability construction of their projects. It also supports multi-technology-stack anomaly detection libraries, offering users more options for complex business problems.