In a nutshell, ENTRADA is an open-source big data platform designed to ingest and quickly analyze large amounts of network data, even in a small cluster. More technically, it is in fact a high-performance data streaming warehouse (DSW) . ENTRADA delivers such performance due to two major features:
Please refer to our research paper for more details, a comparison to existing solutions, and a performance evaluation.
ENTRADA consists of multiple components, there are generic and ENTRADA specific components and all of them are open source. Figure 1 below shows it.
At SIDN, we collect network traffic in pcap format from our .nl authoritative servers. To optimize query lookup times, we convert these pcap files to Parquet files (see our research paper and workflow), which are ultimately stored in the Hadoop filesystem (HDFS) of the Hadoop cluster. Then, these files are accessible by Impala, the Massively Parallel Processing (MPP) SQL query engine or any Parquet compatible engine, such as Apache Spark.
Applications and services are built on top of the platform and access the platform through a variety of standardized interfaces such as SQL, Java JDBC and Python DB API.