Overview

What is ENTRADA?

In a nutshell, ENTRADA is an open-source big data platform designed to ingest and quickly analyze large amounts of network data, even in a small cluster. More technically, it is in fact a high-performance data streaming warehouse (DSW) . ENTRADA delivers such performance due to two major features:

Please refer to our research paper for more details, a comparison to existing solutions, and a performance evaluation.

Features

  • Performance: analyze the equivalent of 52 TB of pcap under 3.5 minutes data in a small 6 data-node cluster (4 data processing nodes).
  • Interface: benefit from easy SQL statements to analyze your data
  • Scalable: just add more nodes for faster processing
  • Deployment: ENTRADA has been operational at SIDN for over 2 years, having ingested more than 100 TB pcap data containing over 100 billion DNS queries from .nl DNS authoritative name servers. Among other things, we use it daily to provide automatic updates on .nl DNS traffic statistics.
  • Built-in conversion of DNS/IP/TCP/UDP/ICMP network data to Parquet data.
  • Open-source

Components

ENTRADA consists of multiple components, there are generic and ENTRADA specific components and all of them are open source. Figure 1 below shows it.

At SIDN, we collect network traffic in pcap format from our .nl authoritative servers. To optimize query lookup times, we convert these pcap files to Parquet files (see our research paper and workflow), which are ultimately stored in the Hadoop filesystem (HDFS) of the Hadoop cluster. 

Then, these files are accessible by Impala, the Massively Parallel Processing (MPP) SQL query engine or any Parquet compatible engine, such as Apache Spark.

Applications and services are built on top of the platform and access the platform through a variety of standardized interfaces such as SQL, Java JDBC and Python DB API.

ENTRADA Components