Common Configs
If you have seen the examples in basics/overview, you probably already know how to add a configs to connect different data sources from Nebula.
In this section, we will list all common configs and explain thier usage.
#
Config StructureThe whole cluster config is a yaml file which you specified for NebulaServer to start with (eg. --CLS_CONF cluster.yaml
), in the config, there are just a few major sections:
version
: version value of the configserver
: the nebula cluster settings, useful sub-configs are:anode
: boolean value - treat nebula server as one of the normal nebula node, default to "false".auth
: boolean value - indicating if nebula query execution will enforce auth check. If set true, the necessary auth model needs to be set up in the same config, check basics/access_control for details, default to "false".meta
: specify what metadata tracking service to use, it's either an external meta db service or the built-in one implemented using LevelDB + cloud storage (such as S3) for data persistence. It can be used to store nebula shortened URL and some other server running metrics, default to none.
or# configure meta db as none meta: db: none
# use the built-in metadb and use s3://nebula/meta as the data store. meta: db: native store: s3://nebula/meta
discovery
: config how nebula cluster is discover each other between Nebula Server and Nebula Nodes. Only two options provided so far, check basics/discovery for details.
or# most common way - let nebula nodes connect nebula server an self-register itself discovery: method: service
when# useful for static setup with known address for all nebula nodes discovery: method: config
method: config
is specified, all nebula nodes needs to configured in this same config file, as shown in the next confignodes
: configs to list all static nebula nodes, this is only used when discover/method specified asconfig
, here is an example:
# will be provided by environmentnodes: - node: host: neula.node.1 port: 9199 - node: host: neula.node.2 port: 9199
tables
: this is the most active configured section to add/change/update all the data sources available for current nebula cluster. It includes all tables definition. A table could be backed by a cloud storage location, could be backed by a real time streaming, could be even backed by a HTTP API endpoint or simply a data service like Google Spreadsheet. This config will also includedata format
settings (csv, json, thrift, parquet, spreadsheet, ...) as well asdata rolling policy
.
We will use following documentations to describe each different ways to connect different data source.
tables: table-name-1: {configs} table-name-2: {configs} ...
#
Resource ConstraintsFor each table, we can define resource constraints for it, this is required for rolling tables, so that Nebula only keeps the most recent data in memory, data past the contraints will be either erased or backup for future query.
retention
is used to specify maximum size and maximum time window for the table, whichever reaches firstly, nebula will purge out the old data segments.
Let's take a look at one example:
tables: table-name-1: retention: max-mb: 10000 max-hr: 0.5
This config asks nebula cluster to keep maximum 10GB of data for table-name-1
, or maximum half an hour of the data if applicable. For real time streaming, this means, Nebula serves last 30minutes
of data.
max-mb
: MegaBytes - an integer.max-hr
: hours - an double, converted to seconds internally.
#
SchemaSchema is needed for every single table definition, regardless where the data is coming from or what the data format is. It's defined in a string of serialized format of schema.
Here is an exmaple:
tables: table-name-1: schema: "ROW<id:int, event:string, tag:string, items:list<string>, flag:bool, value:tinyint>"
Basically, it is a collection of name:type
. Nebula is hive
data types compatible, so here is the list supported, some type has multiple aliaes, whichever alias you use, it's effectively the same:
- bool, boolean
- tinyint, byte, char
- smallint, short
- int, integer
- bigint, long
- real, float
- double, decimal
- int128
- varchar, string, binary
- array, list, vector
- map, dictionary
- row, struct
Note that, compound types like array/map/struct are not well tested so far.
#
Data, Loader, Source, Backup, FormatThese are common configs for each table definition, depending on different type of data, their values vary a lot.
data
: this defines data source type, it could be one of these supported types -s3
: data from AWS S3gs
: data from Google Cloud Storagekafka
: data from kafka real time streaming.local
: data from a local file system (mostly used for single node testing)http
: data from a HTTP endpoint
loader
: this config indicates how Nebula loads data into nebula cluster, it's high level data ingestion policy, here are some possible values:Swap
: the table has one location to read data from, whever it changes, Nebula swaps new data in.Roll
: this is mostly used, it ask Nebula to compute how many different locations it could read data from under theResource Contraints
, and generate ingestion specs to roll data in, it will evict old data past the contraints and ingest new data in whenever it's available.Api
: used mostly for on-demand data loading when a client use API to ask Nebula to load some data
source
: source is usually address, path or location template.
for example, a S3 data could define source like below
tables: table-name-1: data: s3 loader: Roll source: s3://bucket/path/%7BDATE%7D/part=abc/ backup: s3://nebula/n107/ format: parquet
this definition asks Nebula to load data from path s3://bucket/path/%7BDATE%7D/part=abc/
with DATE
as a macro to compute real path based on max-hr
definition from resource contraints, and the data has parquet
file format, under the folder, there are one or more parquet files.
another example, let's look at how Kafka works
tables: table-name-1: data: kafka topic: topic-1 loader: Streaming source: <brokers> backup: s3://nebula/n116/
in this case, source
is used to place Kafka broker list that Nebula could connect to fetch data from topic-1
.
backup
: is always an accessible location where Nebula can backup the data to when those data become evictable due to resource contraints. Nebula doesn't practice this often till today, maybe supported much better in the future when it seeks support for querying historical data.format
: data format - the wide range and growing data format to support in Nebula so that we can ingest all types of data sources and query it in a single interface. Till writing time, the supported formats include:- CSV
- JSON
- THRIFT
- PARQUET
- GOOGLE SHEET
We will expand the format specific settings in the post for "data format".
#
TimeTime section is very important - as it impacts how Nebula organize the data internally, Nebula has an implicit built-in time column (_time_
) which is available for every single table.
Also for most rolling data, they rely on time
definition to materialize those time related macro to compute real data location to read data from.
Similarily, let's use examples to explain all different ways to define time.
Example 1: use a static value for time
tables: table-name-1: time: type: static value: 1571875200
Value is specfified as unix time in seconds, in this case, all table rows will have the same value.
Example 2: use current time
tables: table-name-1: time: type: current
Only type is needed, Nebula will put ingestion time of each row to its time column.
Example 3: use a column value as time column
tables: table-name-1: time: type: column column: col2 pattern: UNIXTIME_NANO
In this case, Nebula will convert an existing column into the value of time column. For this type, column name is needed and value pattern is needed as well for a successful conversion.
column
: the column name from "schema" definition discussed previously.pattern
: the supported time value format for correct conversion, this includes:- "UNIXTIME": column value is
unix time in seconds
. - "UNIXTIME_MS": column value is
unix time in milli seconds
. - "UNIXTIME_NANO": column value is
unix time in nano seconds
. - "SERIAL_NUMBER": column value is in
serial number
format which is usually defined by excel or google spreadsheet data. - "C++ time format": string pattern which Nebula uses "std::get_time" to convert, such as "%m/%d/%Y %H:%M:%S", or "%Y-%m-%d", check pattern details https://en.cppreference.com/w/cpp/io/manip/get_time
- "UNIXTIME": column value is
Example 4: use macro value for time column To support rolling data source from a cloud storage, Nebula supports list of macros from its source template, which is usually constructed by different time units.
nebula.daily: # the source has supported macro in it source: s3://nebula/ephemeral/dt=%7Bdate%7D/ time: type: macro pattern: daily
In this case, Nebula will generate real source like s3://nebula/ephemeral/dt=2021-01-01/
, and use day time level value to set its time column which is equivalent to 2021-01-01
.
Similarily, all supported granularity of time macros in path could be combination of these:
- DATE
- HOUR
- MINUTE
- SECOND
#
SettingsLastly, Nebula has a very generic config called settings
, this will be used as extension to accept any key-value pairs to impact Nebula behavior for the defined table. Both key and value are in string value, though sometimes we use a JSON string to hold a structure for more complex configurations.
We will touch these "misc" configs while talking about other scenarios.