There are 3 main workflows that are executed by ENTRADA at regular intervals:
The ENTRADA container should be running all the time as a server process, the built-in scheduler will execute the workflows.
The Processing workflow is used to process PCAP files, every 60s ENTRADA will run this workflow and it goes through the following steps for each configured name server.
The last step of this workflow is to save the workflow state to disk, this state is read from disk again at the start of the next workflow execution. The state contains: - unmatched DNS queries - active TCP-flows - fragmented IP-packets waiting for more fragments - unsent Graphite metrics
When running ENTRADA a common scenario is for pcap-files being created every few minutes on a name server and then have them sent to the ENTRADA input location. This means that ENTRADA processes a few files each time the Processing workflow is executed. The result will be many relatively small Parquet files.
Having many small Parquet files is not very efficient when querying the data using Impala or Athena. It is better to have fewer larger files. The process of transforming many small files into fewer larger files is called compaction.
This is where the Compaction workflow enters the picture, this workflows monitors the created database partitions. If it detects that no new data is being added to a partition then it will compact the files in the partition. The partition must be at least 1 day old and have no new data added to it, for at least a configurable amount of time, see ENTRADA_PARQUET_COMPACTION_AGE
.
This workflow has the following steps:
During the time between deleting the partition data and moving the new Parquet files to the partition and updating the metadata, there is a small window where the database is not consistent. Executing a SQL-query at this time, using the partition being compact, will result in an error because the data files cannot be found.
The maintenance workflow is used to cleanup resources at regular intervals.