Workflow

There are 3 main workflows that are executed by ENTRADA at regular intervals:

  • Processing
  • Compaction
  • Maintenance

The ENTRADA container should be running all the time as a server process, the built-in scheduler will execute the workflows.

Processing

The Processing workflow is used to process PCAP files, every 60s ENTRADA will run this workflow and it goes through the following steps for each configured name server.

  1. Scan for new PCAP files in the input directory
  2. Load previous run’s state from disk (contains previously unmatched queries and metrics)
  3. Check if the file has not been processed before
  4. Process each new PCAP file, building results in the work directory
  5. Archive the PCAP file (options are none, move or delete)
    none: do nothing
    move: move the file to the archive location
    delete: delete the file from the input location
  6. Upload the generated parquet files (options are local, HDFS, S3)
    local: the files are move to a directory on the local filesystem
    HDFS: the files are uploaded to a HDFS location
    S3: the files are uploaded to a S3 bucket location
  7. Create new database partition if required
  8. Delete the uploaded Parquet files from the local work directory
  9. Save this run’s state to disk

The last step of this workflow is to save the workflow state to disk, this state is read from disk again at the start of the next workflow execution. The state contains: - unmatched DNS queries - active TCP-flows - fragmented IP-packets waiting for more fragments - unsent Graphite metrics

Compaction

When running ENTRADA a common scenario is for pcap-files being created every few minutes on a name server and then have them sent to the ENTRADA input location. This means that ENTRADA processes a few files each time the Processing workflow is executed. The result will be many relatively small Parquet files.

Having many small Parquet files is not very efficient when querying the data using Impala or Athena. It is better to have fewer larger files. The process of transforming many small files into fewer larger files is called compaction.

This is where the Compaction workflow enters the picture, this workflows monitors the created database partitions. If it detects that no new data is being added to a partition then it will compact the files in the partition. The partition must be at least 1 day old and have no new data added to it, for at least a configurable amount of time, see ENTRADA_PARQUET_COMPACTION_AGE.

This workflow has the following steps:

  1. Get a list of not yet compacted partitions from the database
  2. Check if no more data is actively being appended to the partition
  3. Create a new temporary database table using “CREATE TABLE AS SELECT” (CTAS). Selecting all data from the partition that is being compacted, this will create a new table with compacted Parquet files.
  4. Delete the Parquet files in the partition that is being compacted
  5. Move the Parquet files from the temporary table to the partition that is being compacted
  6. Drop the temporary table
  7. Refresh database metadata.

During the time between deleting the partition data and moving the new Parquet files to the partition and updating the metadata, there is a small window where the database is not consistent. Executing a SQL-query at this time, using the partition being compact, will result in an error because the data files cannot be found.

Maintenance

The maintenance workflow is used to cleanup resources at regular intervals.

  • Remove meta data about processed files from the database, entries referencing pcap-files older than x days are delete from the database table. This keeps the number of rows in the database table low and increases performance.
  • Delete archived pcap-files from the archive location, files that have been processed more than x days ago are deleted.