All the steps required for getting data from a name server to the Entrada database are automated. Actions such as copying and moving data files and updating table metadata are implemented in Bash scripts. These scripts are execute periodically by the cron daemon, some scripts such as the script that appends new data to the staging table run every few minutes. Other scripts might only run once every night e.g. the script that moves all data from the staging to the warehouse table.
This diagram illustrates the ENTRADA workflow
On the name server or on another system (which is able to read network data going to and coming from the name server) pcap files are created with the tcpdump utility. See Data Capture for example scripts.
Each pcap file contains 5-10 minutes of network data, and needs to be sent to a central staging area. The server running this central staging area
must be able to connect to the Hadoop cluster (the server could also be a cluster member).
The pcap files are synced with the staging server with the use of the rsync utility. On the staging server exists a directory for each name server which is synced with the pcap files of that name server.
So if there is a name server with the name ns1.dns.nl then there exists a directory with the name /home/captures/ns1.dns.nl on the staging server. The contents of this directory must be synced with the name server directory containing pcap files.
The pcap files in the staging directories that are synced with the name servers (/home/captures/*) are copied to the "incoming" directory and from there they are moved to the "processing" directory. When the files are in the "processing" directory then they can be processed by the pcap decoder.
The pcap transformer is a Java application which transforms the pcap data to Parquet files. Before it writes the data to disk there are several sub-steps it performes.
Read the pcap files into memory object. This part is based the Hadoop PCAP library from RIPE NCC. The library has been modified to make use of the SIDN Java DNS library.
Every DNS request is joined with the corresponding DNS response, by pre-joining the data there is no need to perform expensive SQL join operations when analyzing the data. See the Data model to find out which columns are from the DNS request and which are from the response.
Some attributes of the DNS response are removed, these are not required for the use case at SIDN. The answer, authority and additional sections of the DNS response are removed because it is difficult to effeciently store these without nesting data. Until recently Impala and Parquet did not support nested data. For the use case of a ccTLD this is not a problem because the response will be a referral in most cases. When Entrada is used to analyze DNS resolver data then this will become an issue that needs to be addressed.
The data is enriched by adding metadata about the IP address, such as the Geographical location and autonomous system number of the IP address. The MaxMind GEO IP database is used for the lookup process. Other information is also added, the META column type indicates that the column contains metadata. See the Data model.
When the data is joined, filtered and enriched it is written to disk using the Apache Parquet dataformat. Writing Parquet data is done with the use of the KiteSDK library. The data is written to local disk on the storage server by the Java convertor. When the convertor is finished the data is moved to Hadop hdfs by a workflow script.
The staging table is the table where new network data gets appended to. Because new data arrives in relatively small batches (5-10 minute pcap files) it is not possible to create Parquet files with the optimal size. For this reason, all new data is written into small(er) Parquet files and are appended to the staging table. Data from the staging table is immediately availabe for analysis.
Data from the staging table is moved to the warehouse table once ever 24 hours (usually at night), this move operation combines a large number smaller Parquet files from the staging table into a smaller number of larger files which are added to the warehouse table. Aggregating the data into fewer but larger files increases Impala performance.
After the pcap files have been processed and appended to the database they are moved to the archive directory. There, they can be stored for a configurable number of days. A cleanup script automatically deletes pcap files that expire.