Reference Table¶
Through the Reference Table feature, Pipeline supports importing external data for data processing.
Warning
This feature consumes a significant amount of memory. For example, approximately 200MB of disk space (JSON file) is used for 1.5 million non-repeating rows (two string columns; one each of int, float, and bool types), with memory usage maintained at 950MB to 1.2GB, and peak memory usage during updates at 2.2GB to 2.7GB. You can configure use_sqlite = true
to save data to disk.
Table Structure and Column Data Types¶
The table structure is a two-dimensional table. Tables are distinguished by their names, and there must be at least one column. Elements within each column must have consistent data types, which must be one of int(int64), float(float64), string, or bool.
Setting primary keys for tables is not yet supported, but queries can be performed using any column, and the first row of all results will be returned as the query result. The following is an example of a table structure:
-
Table name:
refer_table_abc
-
Column names (col1, col2, ...), column data types (int, float, ...), and row data:
col1: int | col2: float | col3: string | col4: bool |
---|---|---|---|
1 | 1.1 | "abc" | true |
2 | 3 | "def" | false |
Importing Data from External Sources¶
Configure the reference table URL and pull interval (default interval is 5 minutes) in the configuration file datakit.conf
Warning
Currently, the HTTP response Content-Type of the address specified by refer_table_url
must be Content-Type: application/json
.
- Data consists of multiple tables represented as lists, where each table is a map with fields:
Field Name | Description | Data Type |
---|---|---|
table_name | Table name | string |
column_name | All column names | [ ]string |
column_type | Column data types, corresponding to column names, values range "int", "float", "string", "bool" | [ ]string |
row_data | Multiple row data, for int, float, bool types can use corresponding type data or convert to string representation; elements in []any need to correspond one-to-one with column names and column types | [ ][ ]any |
- JSON structure:
[
{
"table_name": string,
"column_name": []string{},
"column_type": []string{},
"row_data": [
[]any{},
...
]
},
...
]
- Example:
[
{
"table_name": "table_abc",
"column_name": ["col", "col2", "col3", "col4"],
"column_type": ["string", "float", "int", "bool"],
"row_data": [
["a", 123, "123", "false"],
["ab", "1234.", "123", true],
["ab", "1234.", "1235", "false"]
]
},
{
"table_name": "table_ijk",
"column_name": ["name", "id"],
"column_type": ["string", "string"],
"row_data": [
["a", "12"],
["a", "123"],
["ab", "1234"]
]
}
]
Using SQLite to Save Imported Data¶
To save imported data to an SQLite database, simply set use_sqlite
to true
:
[pipeline]
refer_table_url = "http[s]://host:port/path/to/resource"
refer_table_pull_interval = "5m"
use_sqlite = true
sqlite_mem_mode = false
When using SQLite to save data and setting sqlite_mem_mode
to true
, it uses SQLite's memory mode; the default is SQLite disk mode.
Warning
This feature is currently unsupported on windows-386.
Practical Example¶
Write the above JSON text into a file test.json
, place it under /var/www/html after installing NGINX via apt
on Ubuntu18.04+
Execute curl -v localhost/test.json
to test if the file can be retrieved via HTTP GET. The output should look like this:
...
< Content-Type: application/json
< Content-Length: 522
< Last-Modified: Tue, 16 Aug 2022 06:20:52 GMT
< Connection: keep-alive
< ETag: "62fb3744-20a"
< Accept-Ranges: bytes
<
[
{
"table_name": "table_abc",
"column_name": ["col", "col2", "col3", "col4"],
"column_type": ["string", "float", "int", "bool"],
"row_data": [
...
Modify the value of refer_table_url
in the configuration file datakit.conf
:
[pipeline]
refer_table_url = "http://localhost/test.json"
refer_table_pull_interval = "5m"
use_sqlite = false
sqlite_mem_mode = false
Enter the Datakit pipeline/logging directory and create a test script refer_table_for_test.p
with the following content:
# Extract table name, column names, and column values from input
json(_, table)
json(_, key)
json(_, value)
# Query and append current column data, default is added to the data as a field
query_refer_table(table, key, value)
cd /usr/local/datakit/pipeline/logging
vim refer_table_for_test.p
datakit pipeline -P refer_table_for_test.p -T '{"table": "table_abc", "key": "col2", "value": 1234.0}' --date
From the following output, we can see that the columns col, col2, col3, col4 were successfully appended to the output result:
2022-08-16T15:02:14.150+0800 DEBUG refer-table refertable/cli.go:26 performing request[method GET url http://localhost/test.json]
{
"col": "ab",
"col2": 1234,
"col3": 123,
"col4": true,
"key": "col2",
"message": "{\"table\": \"table_abc\", \"key\": \"col2\", \"value\": 1234.0}",
"status": "unknown",
"table": "table_abc",
"time": "2022-08-16T15:02:14.158452592+08:00",
"value": 1234
}