Building Springcloud Service Observability from 0 to 1 with Guance¶
Overview of the Business System in This Project:¶
This case study uses a simulated enterprise internal office system, building its observability from scratch using Guance
The observability setup chosen for this project is based on a standalone JAR package application
Project open-source address: https://gitee.com/y_project/RuoYi-Cloud
Project demo address: http://demo.ruoyi.vip/login
System Introduction:
This system is an open-source backend management system and also a rapid development platform for Java EE enterprise applications, built on classic technology combinations (Spring Boot, Apache Shiro, MyBatis, Thymeleaf, Bootstrap, etc.). It includes numerous built-in modules such as department management, role-user, menu and button authorization, data permissions, system parameters, log management, notifications, and announcements. The main goal is to allow developers to focus on business logic, reducing technical complexity and saving human resources, shortening project cycles, and improving software security and quality. This project can be used for all web applications, such as website management backends, website member centers, CMS, CRM, OA, etc., while supporting deep customization, allowing enterprises to build more powerful systems. All frontend and backend code is encapsulated and easy to use, with a low probability of errors. Mobile client access is also supported.
Project Feature Modules:
- User Management: Configuration of system users.
- Department Management: Configuration of organizational structures (companies, departments, teams) with tree structure support and data permission settings.
- Position Management: Configuration of positions held by system users.
- Menu Management: Configuration of system menus, operation permissions, and button permission identifiers.
- Role Management: Allocation of menu permissions and setting role-based data scope permissions.
- Dictionary Management: Maintenance of frequently used fixed data in the system.
- Parameter Management: Dynamic configuration of common parameters.
- Notifications and Announcements: Maintenance of system information releases.
- Operation Logs: Recording and querying of normal operation logs; recording and querying of system exception logs.
- Login Logs: Querying login logs including abnormal logins.
- Online Users: Monitoring the status of active users in the system.
- Scheduled Tasks: Online operations (addition, modification, deletion) of scheduled tasks including execution result logs.
- Code Generation: Generation of front-end and back-end code (Java, HTML, XML, SQL) supporting CRUD downloads.
- System APIs: Automatic generation of API documentation based on business code.
- Service Monitoring: Monitoring CPU, memory, disk, stack, and other information of the current system.
- Cache Monitoring: Operations like querying, viewing, and clearing cache.
- Online Builder: Dragging form elements to generate corresponding HTML code.
Technologies Involved in the Office System:
Technology | Version | Guance Observability Inputs Required |
---|---|---|
SpringBoot | 2.3.7.RELEASE | ddtrace |
SpringCloud | Hoxton.SR9 | ddtrace |
SpringCloud Alibaba | 2.2.5.RELEASE | ddtrace |
Nginx | 1.16.1 | nginx |
Mysql | 5.7.17 | mysql |
Redis | 3.2.12 | redis |
Vue | 2.6.0 | rum |
Java | OpenJDK 1.8.0_292 | Statsd or jolokia (statsd used in this example) |
Office System Architecture:
- Web Pages: Hosted in Nginx
- Registration Center: Nacos
- Gateway: Gateway
- Service Modules: Auth, System
- Database: MySQL
- Cache: Redis
Note: This demo deploys all service modules on a single server, accessing services via different ports.
Guance Overview:¶
Overview: [Guance Official Overview]
Guance is a cloud service platform designed to provide full-stack observability for every complete application in the era of cloud computing and cloud-native systems, fundamentally different from traditional monitoring systems.
Traditional monitoring systems are often domain-specific, similar to many isolated silos within enterprises, such as APM, RUM, logs, NPM, Zabbix, etc., each being a separate and fragmented monitoring system. These silos lead to data isolation within enterprises, causing significant challenges in cross-departmental and cross-platform issue diagnosis, consuming substantial human and material resources.
The concept of observability involves a comprehensive system that provides observability for IT infrastructure supporting business systems, encompassing metrics, logs, and trace components. It achieves unified data collection, storage, query, and presentation, and correlates all observability data (metrics, traces, logs), enabling complete observability of the IT system.
Guance is developed based on this philosophy, aiming to enhance the quality of internal IT services within enterprises and improve end-user experience.
Guance Data Flow:
Note: DQL is a specialized QL language developed by DataFlux for correlated queries of ES and InfluxDB data.
Installing Datakit:¶
- Log in to console.guance.com
- Create a new workspace
- Select Integration — Datakit — Choose the installation command suitable for your environment and copy it
- Install Datakit on the server
- Execute
service datakit status
(orsystemctl status datakit
) to check the Datakit status
After installing Datakit, it defaults to collecting the following content, which can be viewed directly under DataFlux — Infrastructure — Hosts
Select different integration input names to view corresponding monitoring views. Below the monitoring view, you can also view other data such as logs, processes, containers, etc.
Collector Name | Description |
---|---|
cpu | Collects host CPU usage |
disk | Collects disk usage |
diskio | Collects host disk IO |
mem | Collects host memory usage |
swap | Collects Swap memory usage |
system | Collects host OS load |
net | Collects host network traffic |
host_process | Collects resident processes (alive for over 10 minutes) |
hostobject | Collects basic host information (OS, hardware info, etc.) |
docker | Collects container objects and container logs |
Enabling Specific Inputs:¶
Component Involved | Input to Enable | Directory of Input File | Relevant Metrics |
---|---|---|---|
Nginx | √ | •/usr/local/datakit/conf.d/nginx | Request information, logs, request duration, etc. |
MySQL | √ | •/usr/local/datakit/conf.d/db | Connections, QPS, read/write stats, slow queries |
Redis | √ | •/usr/local/datakit/conf.d/db | Connections, CPU usage, memory usage, hit rate, loss rate |
JVM | √ | •/usr/local/datakit/conf.d/statsd | Heap memory, GC count, GC time |
APM | √ | •/usr/local/datakit/conf.d/ddtrace | Response time, error count, error rate |
RUM | Enabled by default | —— | UV/PV, LCP, FID, CLS, JS errors |
Note:
RUM Metric Explanation | Description | Target Value |
---|---|---|
LCP(Largest Contentful Paint) | Time taken to load the largest content element in the visible area of the webpage | Less than 2.5s |
FID(First Input Delay) | Delay when the user first interacts with the webpage | Less than 100ms |
CLS(Cumulative Layout Shift) | Whether page layout changes due to dynamic loading, 0 means no change. | Less than 0.1 |
Nginx:¶
For detailed steps, refer to the document <Nginx Observability Best Practices>
Prerequisite: Check if the http_stub_status_module module is enabled in Nginx, skip step 1 if already installed.
- Install the http_stub_status_module (Linux):
To enable this module, recompile Nginx with the following command:
./configure --with-http_stub_status_module
Locate the configure file using:find / | grep configure | grep nginx
$ find / | grep configure | grep nginx
$ cd /usr/local/src/nginx-1.20.0/
$ ./configure --with-http_stub_status_module
- Add the nginx_status location in nginx.conf
$ server {
listen 80;
server_name localhost;
## Port can be customized
location /nginx_status {
stub_status on;
allow 127.0.0.1;
deny all;
}
}
- Modify the nginx inputs in Datakit
# Modify as follows
[[inputs.nginx]]
url = "http://localhost/nginx_status"
[inputs.nginx.log]
files = ["/var/log/nginx/access.log", "/var/log/nginx/error.log"]
# Save the file and restart Datakit
$ service datakit restart
Verify data: curl 127.0.0.1/nginx_status
- Create an Nginx view in the Guance platform and view data
Refer to Creating Scenarios and Views Steps: Scenario —> New Scenario —> New Blank Scenario —> System View (Create Nginx) View Example (quickly check Nginx-related metrics and logs to determine Nginx health):
MySQL:¶
For detailed steps, refer to the document <MySQL DataKit Integration>
# Log in to MySQL
$ mysql -uroot -p
# Enter password: Solution****
# Create a monitoring account
$ CREATE USER 'datakit'@'localhost' IDENTIFIED BY 'Datakit_1234';
# Grant monitoring account privileges
$ GRANT PROCESS, SELECT, REPLICATION CLIENT ON *.* TO 'datakit'@'%' IDENTIFIED BY 'Datakit_1234';
# Refresh privileges
FLUSH PRIVILEGES;
1. Modify MySQL inputs in Datakit¶
$ cd /usr/local/datakit/conf.d/db/
$ cp mysql.conf.sample mysql.conf
$ vim mysql.conf
# Modify as follows
## It's recommended to create a read-only MySQL account
[[inputs.mysql]]
user = "datakit"
pass = "Datakit_1234"
# Save the file and restart Datakit
$ service datakit restart
2. Create a MySQL view in the Guance platform and view data¶
Refer to Creating Scenarios and Views Steps: Scenario —> New Scenario —> New Blank Scenario —> System View (Create MySQL) View Example (quickly check MySQL-related metrics and logs to determine MySQL health):
Redis:¶
For detailed steps, refer to the document <Redis DataKit Integration>
1. Modify Redis inputs in Datakit¶
$ cd /usr/local/datakit/conf.d/db/
$ cp redis.conf.sample redis.conf
$ vim redis.conf
# Modify as follows
## It's recommended to create a read-only Redis account
[[inputs.redis]]
pass = "Solution******"
# Note: Uncomment the pass line before modifying
[inputs.redis.log]
files = ["/var/log/redis/redis.log"]
# Save the file and restart Datakit
$ service datakit restart
2. Create a Redis view in the Guance platform and view data¶
Refer to Creating Scenarios and Views Steps: Scenario —> New Scenario —> New Blank Scenario —> System View (Create Redis) View Example (quickly check Redis-related metrics and logs to determine Redis health):
JVM:¶
For detailed steps, refer to the document <JVM DataKit Integration>
1. Modify JVM inputs in Datakit¶
By default, there is no need to modify JVM inputs; only copying the generated conf file is required
$ cd /usr/local/datakit/conf.d/statsd/
$ cp statsd.conf.sample ddtrace-jvm-statsd.conf
$ vim ddtrace-jvm-statsd.conf
# No modifications needed by default
2. Modify the Java application startup script¶
Since JVM and APM both rely on ddtrace-agent for data collection, see APM-related content [APM]
3. Create a JVM view in the Guance platform and view data¶
Refer to Creating Scenarios and Views Steps: Scenario —> New Scenario —> New Blank Scenario —> System View (Create JVM) View Example (quickly check JVM-related metrics and logs to determine JVM health):
APM (Application Performance Monitoring):¶
For detailed steps, refer to the document Distributed Tracing (APM) Best Practices
Guance supports multiple APM tools that adhere to the OpenTracing protocol, including ddtrace, SkyWalking, Zipkin, Jaeger, etc. This example uses ddtrace for APM observability.
1. Modify APM (ddtrace) inputs in Datakit¶
By default, there is no need to modify JVM inputs; only copying the generated conf file is required
$ cd /usr/local/datakit/conf.d/ddtrace/
$ cp ddtrace.conf.sample ddtrace.conf
$ vim ddtrace.conf
# No modifications needed by default
2. Modify the Java application startup script¶
APM observability requires adding an agent to the Java application. This agent collects performance data during application startup through bytecode injection, capturing method calls, SQL calls, external system calls, etc., to monitor the quality of the application code.
# Original application startup scripts
$ cd /usr/local/ruoyi/
$ nohup java -Dfile.encoding=utf-8 -jar ruoyi-gateway.jar > logs/gateway.log 2>&1 &
$ nohup java -Dfile.encoding=utf-8 -jar ruoyi-auth.jar > logs/auth.log 2>&1 &
$ nohup java -Dfile.encoding=utf-8 -jar ruoyi-modules-system.jar > logs/system.log 2>&1 &
Kill existing application startup processes and add ddtrace parameters before restarting the application. See the following image for details:
# Application startup script with ddtrace-agent added
$ cd /usr/local/ruoyi/
$ nohup java -Dfile.encoding=utf-8 -javaagent:dd-java-agent-0.80.0.jar -XX:FlightRecorderOptions=stackdepth=256 -Ddd.logs.injection=true -Ddd.service.name=ruoyi-gateway -Ddd.service.mapping=redis:redis_ruoyi -Ddd.agent.port=9529 -Ddd.jmxfetch.enabled=true -Ddd.jmxfetch.check-period=1000 -Ddd.jmxfetch.statsd.port=8125 -Ddd.version=1.0 -jar ruoyi-gateway.jar > logs/gateway.log 2>&1 &
$ nohup java -Dfile.encoding=utf-8 -javaagent:dd-java-agent-0.80.0.jar -XX:FlightRecorderOptions=stackdepth=256 -Ddd.logs.injection=true -Ddd.service.name=ruoyi-auth -Ddd.service.mapping=redis:redis_ruoyi -Ddd.env=staging -Ddd.agent.port=9529 -Ddd.jmxfetch.enabled=true -Ddd.jmxfetch.check-period=1000 -Ddd.jmxfetch.statsd.port=8125 -Ddd.version=1.0 -jar ruoyi-auth.jar > logs/auth.log 2>&1 &
$ nohup java -Dfile.encoding=utf-8 -javaagent:dd-java-agent-0.80.0.jar -XX:FlightRecorderOptions=stackdepth=256 -Ddd.logs.injection=true -Ddd.service.name=ruoyi-modules-system -Ddd.service.mapping=redis:redis_ruoyi,mysql:mysql_ruoyi -Ddd.env=dev -Ddd.agent.port=9529 -Ddd.jmxfetch.enabled=true -Ddd.jmxfetch.check-period=1000 -Ddd.jmxfetch.statsd.port=8125 -Ddd.version=1.0 -jar ruoyi-modules-system.jar > logs/system.log 2>&1 &
If APM data is not visible in the Guance platform, check the Datakit logs
cat /var/log/datakit/gin.log
Normal logs:
Error logs:
Modify /usr/local/datakit/con.d/ddtrace/ddtrace.conf to match the following image Ensure the path in ddtrace.conf matches the path in datakit/gin.log
Explanation of ddtrace environment variables (start parameters):
- Ddd.env: Custom environment type, optional.
- Ddd.tags: Custom application tags, optional.
- Ddd.service.name: Custom application name, required.
- Ddd.agent.port: Data upload port (default 9529), required.
- Ddd.version: Application version, optional.
- Ddd.trace.sample.rate: Set sampling rate (default full sample), optional, e.g., 0.6 for 60% sampling.
- Ddd.service.mapping: Add aliases for Redis, MySQL, etc., called by the current application, to distinguish them from those called by other applications, optional. For example, projects A and B call MySQL-a and MySQL-b respectively. Without mapping configuration, Guance will show both projects calling a database named MySQL. With mapping configured as MySQL-a and MySQL-b, Guance will show project A calling MySQL-a and project B calling MySQL-b.
- Ddd.agent.host: Data transmission target IP, default is localhost, optional.
3. View APM Data in the Guance Platform¶
APM (Application Performance Monitoring) is a default built-in module in Guance, requiring no scenario or view creation for viewing.
Path: Guance platform — Application Performance Monitoring
View Example: (quickly check application calls, topology, anomaly data, and other APM-related data)
RUM (Real User Monitoring):¶
For detailed steps, refer to the document [User Access (RUM) Observability Best Practices]
1. Log in to the Dataflux platform¶
2. Choose User Access Monitoring —> New Application —> Select Web Type —> Synchronous Loading¶
3. Integrate Guance RUM observability JS file into the frontend index.html page¶
$ cd /usr/local/ruoyi/dist/
// Remember to back up
$ cp index.html index.html.bkd
// Add df-js to index.html before </head>, then save the file, example:
$ vim index.html
<script src="https://static.guance.com/browser-sdk/v2/dataflux-rum.js" type="text/javascript"></script>
<script>
window.DATAFLUX_RUM &&
window.DATAFLUX_RUM.init({
applicationId: 'xxxxxxxxxxxxxxxxxxxxxxxxxx',
datakitOrigin: 'xxx.xxx.xxx.xxx:9529',
env: 'test',
version: '1.0.0',
trackInteractions: true,
allowedTracingOrigins: ["xxx.xxx.xxx.xxx"]
})
</script></head>
# Replace xxx with actual values as needed, refer to the following for detailed changes:
datakitOrigin: Data transmission address, in production environments, if configured as a domain name, requests can be forwarded to any server running datakit-9529. If frontend traffic is high, consider adding a load balancer between the domain name and datakit servers. The load balancer must have port 9529 open and forward requests to multiple datakit-9529 servers. Multiple datakit servers handle RUM data without session interruption.
trackInteractions: User behavior collection configuration, enabling tracking of user actions on the frontend.
allowedTracingOrigins: Configuration for connecting frontend (RUM) and backend (APM), fill in the domain names or IPs of backend servers interacting with the frontend.
Notes:
- datakitOrigin: Data transmission address. In production, if configured as a domain name, requests can be forwarded to any server running datakit-9529. If frontend traffic is high, consider adding a load balancer between the domain name and datakit servers. The load balancer must have port 9529 open and forward requests to multiple datakit-9529 servers. Multiple datakit servers handle RUM data without session interruption.
- allowedTracingOrigins: Connects frontend (RUM) and backend (APM). Fill in the domain names (production) or IPs (testing) of backend servers interacting with the frontend. Use Case: Slow frontend visits caused by backend code issues can be traced via RUM data to APM data for root cause analysis.
- env: Required, environment of the application, e.g., test or product.
- version: Required, application version number.
- trackInteractions: Tracks user interactions, e.g., button clicks, form submissions.
4. Save, Verify, and Publish the Page¶
Open a browser to visit the target page and use F12 Developer Tools to check if there are related RUM requests with status code 200.
Note!!: If F12 Developer Tools shows data not being reported and port refused, verify the port connectivity with telnet IP:9529
. If blocked, modify /usr/local/datakit/conf.d/datakit.conf
to change http_listen
from localhost
to 0.0.0.0
.
5. View RUM Data in User Access Monitoring¶
6. Demonstration of RUM and APM Data Correlation¶
Configuration method: [Java Example]
Use Case: Frontend and backend correlation, binding frontend request data with backend method execution performance data one-to-one, facilitating cross-team and cross-departmental issue localization. For instance, slow frontend visits due to backend service anomalies can be quickly diagnosed. Example:
Security Checker:¶
Security Checker Overview: [Guance Official Overview]
Note: Currently only supports Linux Detailed steps refer to the document [Security Checker Installation and Configuration]
1. Install Security Checker¶
## Install
$ bash -c "$(curl https://static.guance.com/security-checker/install.sh)"
## Or execute sudo datakit --install scheck
## Update
$ bash -c "$(curl https://static.guance.com/security-checker/install.sh) --upgrade"
## Start/Stop Commands
$ systemctl start/stop/restart/status scheck
## Or
$ service scheck start/stop/restart/status
## Installation directory /usr/local/scheck
2. Connect Security Checker to Datakit¶
Send Security Checker data to Datakit and then forward it to the Dataflux platform.
$ cd /usr/local/scheck/
$ vim scheck.conf
# ##(required) directory contains script
rule_dir='/usr/local/scheck/rules.d'
# ##(required) output of the check result, support local file or remote http server
# ##localfile: file:///your/file/path
# ##remote: http(s)://your.url
output='http://127.0.0.1:9529/v1/write/security'
# ##(optional)global cron, default is every 10 seconds
#cron='*/10 * * * *'
log='/usr/local/scheck/log'
log_level='info'
#disable_log=false
3. View Security Checker Data¶
Logs:¶
Detailed steps refer to the document Log Collection
1. Standard Log Collection (Nginx, MySQL, Redis, etc.)¶
Enable various built-in inputs in DataKit to collect logs directly, such as Nginx, Redis, Containers, ES, etc.
Example: Nginx
$ cd /usr/local/datakit/conf.d/nginx/
$ cp nginx.conf.sample nginx.conf
$ vim nginx.conf
## Modify log paths to correct Nginx paths
$ [inputs.nginx.log]
$ files = ["/usr/local/nginx/logs/access.log","/usr/local/nginx/logs/error.log"]
$ pipeline = "nginx.p"
## Pipeline refers to grok statements for text log parsing. DataKit has built-in pipelines for Nginx, MySQL, etc., located in /usr/local/datakit/pipeline/. There's no need to modify the pipeline path; DataKit automatically reads it.
View Display:
2. Custom Log Collection (Application Logs, Business Logs, etc.)¶
Example: Application Logs
Pipeline (log grok parsing) [Guance Official Documentation]
$ cd /usr/local/datakit/conf.d/log/
$ cp logging.conf.sample logging.conf
$ vim logging.conf
## Modify log paths to correct application log paths
## source and service are required fields and can be set to the application name to differentiate log names
$ [inputs.nginx.log]
$ logfiles = [
"/usr/local/ruoyi/logs/ruoyi-system/error.log",
"/usr/local/ruoyi/logs/ruoyi-system/info.log",]
$ source = "ruoyi-system"
$ service = "ruoyi-system"
# pipeline = "ruoyi-system.p"
## Pipeline refers to grok statements for text log parsing. If this configuration is not uncommented, DF platform will display raw log content. If filled, it will parse the log using the specified .p file, which needs to be manually written.
$ cd /usr/local/datakit/pipeline/
$ vim ruoyi_system.p
## Example:
# Log format
#2021-06-25 14:27:51.952 [http-nio-9201-exec-7] INFO c.r.s.c.SysUserController - [list,70] ruoyi-08-system 5430221015886118174 6503455222153372731 - Query user
## Example grok, copy the following content to ruoyi_system.p
grok(_, "%{TIMESTAMP_ISO8601:time} %{NOTSPACE:thread_name} %{LOGLEVEL:level} \\s+%{NOTSPACE:class_name} - \\[%{NOTSPACE:method_name},%{NUMBER:line}\\] %{DATA:service} %{DATA:trace_id} %{DATA:span_id} - %{GREEDYDATA:msg}")
default_time(time)
View Display:
Creating Nginx Log Anomaly Detection:¶
-
Open Guance platform -> Anomaly Detection Library -> New Detection Library -> Custom Monitoring
-
Click on the newly created detection library name -> New Detection Rule -> New Log Detection
-
Fill in specific detection rule content and save
Rule Name: Nginx Log ERROR Count Exceeds Threshold Detection
Detection Metric: As shown in the figure
Trigger Condition: Result>=5
Event Name: Nginx Log ERROR Count Exceeds Threshold Alert
Event Content:Level: {status}
Host: {host}
Content: Too many log errors, error count is {{ Result }} Recommendation: Too many log errors indicate potential application issues; recommend checking the application health.
Detection Frequency: 1 minute
Verifying Anomaly Detection Mechanism:¶
- Query ruoyi-gateway-related processes on the server and kill them
- Visit the ruoyi website (refresh multiple times, at least 5 times)
- Check event-related content on the Guance platform
- Check Nginx log-related content and relevant views
Troubleshooting During Input Activation:¶
- Check input error messages
Guance uploads input status information to the Guance platform at regular intervals, which can be viewed under Infrastructure — Specific Host.
Example: Apache service down, input shows error
- Check data reporting information
Method 1:
Enter curl 127.0.0.1:9529/monitor
in the browser or terminal to view
Method 2:
Enter curl 127.0.0.1:9529/stats
in the browser or terminal to view
- Check Datakit logs
Datakit log directory: cd /var/log/datakit
Creating Scenarios and Views:¶
Using System View Templates (Using Nginx as an Example)¶
- Scenario —> New Scenario
- New Blank Scenario
- Input Scenario Name —> Confirm
- System View —> Nginx View (Create)
- View Nginx View
- Other
Other view creation methods are similar. For custom view content and layout requirements, you can create blank views and build them yourself.
Summary:¶
Thus, we have achieved comprehensive observability for the demo office system's chain, metrics, logs, and infrastructure.
Guance is user-friendly and convenient to manage, providing a unified view for all metrics, traces, and logs, all associated through the same tag (host). This makes it easy to achieve cascading on the platform, thus realizing overall IT system observability.
Finally, combining anomaly detection can achieve integrated system management, enhancing operational and development efficiency and IT decision-making capabilities!
The product is continuously improving, with more features and better usability, and a more aesthetically pleasing UI.
Guance aims to be the champion of observability!