Data model

ENTRADA uses the Apache Parquet columnar data format. Parquet uses efficient encoding and compression algorithms to create compact data files that are fast to query. Compression of Parquet data files is performed with Apache Snappy. The resulting datafiles can be analyzed with an analytical query engine such as Apache Impala or Apache Spark.

Schema

There are two schemas defined, the “dns” schema contains tables for DNS network data. The "icmp" schema contains tables for ICMP network data. Adding support for other network protocols is possible by adding additional columns to these tables or by creating new tables. The protocol decoder would also need to be modified ofcourse, but this is not a part of the data model.

Staging and warehouse

Every schema contains at least two tables, a staging table and a warehouse table. The staging table is the table where new network data gets appended to. Because new data arrives in relatively small batches (5 minute pcap files) it is not possible to create Parquet files with the optimal size. That is why all new data is written into small(er) Parquet files and are appended to the staging table. The data from the staging table is moved to the warehouse table at night, this move operation combines the smaller Parquet files into a smaller number of larger files.

Partitioning

The staging and warehouse tables use a partitioning schema to divide the data into seperate partition. These partitions can be used in SQL queries to indicate which data files must be used when executing the SQL query, this process is called "partition pruning". The partition schema contains 4 columns.

Column Description
year The capture year
month The capture month
day The capture day
server The destination server

The following example SQL query only analyzes data for server "ns1.dns.nl" that was captured on 05-12-2015. All other data in the table is skipped. Partitioning functions as an index to enable fast data lookups.

select qname
from dns.queries
where year=2015 and month=12 and day=5 and server="ns1.dns.nl"
limit 10

DNS

For performance reasons the DNS request and response packets are joined into a single row, to avoid having to do expensive join operations with large tables. Besides DNS information there is also IP and TCP/UDP and meta information added to each row.

The table below contains all available columns, the protocol indicates the network protocol the column data is extracted from. The type can be either request, response or meta. The meta type is used for columns that contain data that is not directly extracted from network packet data. Meta data is descriptive data about the network data, such as the geographical location for an IP address.

Tables

  • Staging: used for appending new data to the database.
  • Queries: The data warehouse table, every night the data from the previous day is moved from the "staging" table to the "queries" tables.

Columns

column protocol type description
id DNS query message id
rcode DNS response rcode (-1 is no matching server response is found)
opcode DNS query opcode
query_ts - META packet timestamp in UTC, uses TIMESTAMP datatype
unixtime - META packet timestamp, seconds since January 1, 1970, UTC, BIGINT datatype
time - META milliseconds since January 1, 1970, 00:00:00 UTC
qname DNS request qname from request
qtype DNS request qtype from request
domainname DNS META secondlevel domainname (extracted from qname)
labels DNS META count of the number of qname labels
src IP request source IP address
dst IP request destination IP address
ttl IP request TTL
frag IP request fragment count
ipv IP request IP version, 4 or 6
prot IP request protocol, 6(TCP) or 17(UDP)
srcp UDP/TCP request source port
dstp UDP/TCP request destination port
udp_sum UDP request checksum for the UDP request
dns_len DNS request length dns request excluding ip/tcp/udp headers
dns_res_len DNS response length dns response excluding ip/tcp/udp headers
len DNS/UDP/TCP request length of the request packet including all headers
res_len DNS/UDP/TCP response length of the response packet including all headers
aa DNS response header Authoritative Answer
tc DNS response header Truncation
rd DNS request header Recursion Desired
ra DNS response header Recursion Available
z DNS request header Zero
ad DNS response header Authenticated data (DNSSEC)
cd DNS request header Checking Disabled (DNSSEC)
ancount DNS response header Answer Record Count
arcount DNS response header Additional Record Count
nscount DNS response header Authority Record Count
qdcount DNS request header Question Count
country IP META country location of the source IP address
asn IP META autonomous system number of the source IP address
edns_udp DNS request max UDP packet length supported by client
edns_version DNS request EDNS0 version
edns_do DNS request DNSSEC do-bit
edns_ping DNS request EDNS0 ping option of powerdns
edns_nsid DNS request name server identifier (rfc5001)
edns_dnssec_dau DNS request DNSSEC Algorithm signalling, DNSSEC Algorithm Understood, (rfc6975)
edns_dnssec_dhu DNS request DNSSEC Algorithm signalling, DS Hash Understoodd, (rfc6975)
edns_dnssec_n3u DNS request DNSSEC Algorithm signalling, NSEC3 Hash Understood, (rfc6975)
edns_client_subnet DNS request Client subnet option (draft-ietf-dnsop-edns-client-subnet-00)
edns_client_subnet_asn - META asn of the client subnet
edns_client_subnet_country - META country location of the client subnet IP address
edns_other DNS request All other used EDNS0 options (concatenated as string)
time_micro - META the microseconds of the request timestamp (unixtime is rounded to seconds)
resp_frag IP request the number of IP packet fragments required for the DNS response
proc_time - META the number microseconds between the request and the response
is_google - META true is the IP address matches one of the know Google resolver IP addresses
is_opendns - META true is the IP address matches one of the know OpenDNS resolver IP addresses
server_location META request location of the anycast node, only if anycast encoding is used for the file input directory
query_ts DNS request timestamp request timestamp
edns_padding DNS request Is EDNS0 Padding used
pcap_file DNS request Name of the input pcap file
edns_keytag_count DNS request number of EDNS0 keytags found
edns_keytag_list DNS request EDNS0 keytags as comma separated list
q_tc DNS request TC flag from request header
q_ra DNS request RA flag from request header
q_ad DNS request AD flag from request header
q_rcode DNS request RCODE flag from request header
year META request year part of timestamp
month META query month part of timestamp
day META request day part of timestamp
server DNS request The name server the DNS request was sent to
The query_ts attribute is much more efficient to use in SQL queries than the unixtime attribute.

The list of Google and OpenDNS resolver IP addresses used to determine if an IP belongs to Google or OpenDNS, is automatically fetched every day.

For more information about the DNS fields see DNS RFC 1035. The table column names match the RFC field names. For more information about possible DNS columns values see IANA DNS parameters.

ICMP

The ICMP schema contains tables for handling ICMP network data. For more information about ICMP see ICMP v4 and ICMP v6

Tables

  • Staging: Used for appending new ICMP data to the database.
  • Packets: The data warehouse table, every night the data from the previous day is moved from the "staging" table to the "packets" tables.

Columns

The protocol can be ICMP,IP,DNS,UDP or TCP or a combination of these protols. The meta type is used for columns that contain data that is not directly extracted from network packet data. Meta data is descriptive data about the network data, such as the geographical location for an IP address. Data in column names starting with "orig_" is extracted from the payload of ICMP packets. The payload data of an ICMP packet may contain the first 8 bytes of the original DNS response packet sent out by the name server.

column protocol type description
query_ts - META packet timestamp in UTC, uses TIMESTAMP datatype
unixtime - META packet timestamp, seconds since January 1, 1970, 00:00:00 GMT
icmp_type ICMP request ICMP type
icmp_code ICMP request ICMP code
icmp_echo_client_type ICMP request type of ICMP client. \ 1= RIPE Atlas, 2=Unix/Linux, 3=Windows, 4=PRTG
ip_ttl IP request TTL IP packet
ip_v IP request IP version 4 or 6
ip_src IP request source IP address
ip_dst IP request destination IP address
ip_country IP request geographical location source IP
ip_asn IP request autononmous system number source IP
ip_len META request length IP packet
l4_prot UDP/TCP request Layer 4 protocol TCP/UDP/ICMP
l4_srcp UDP/TCP request TCP/UDP source port
l4_dstp UDP/TCP request TCP/UDP destination port
orig_ip_ttl IP response TTL of the IP packet
orig_ip_v IP request IP version\ 4 or 6
orig_ip_src IP response source IP adres
orig_ip_dst IP response destination IP adres
orig_l4_prot UDP/TCP response Layer 4 protocol\ TCP/UDP/ICMP
orig_l4_srcp UDP/TCP response TCP/UDP source port
orig_l4_dstp UDP/TCP response TCP/UDP destination port
orig_udp_sum UDP response UDP checksum
orig_ip_len META response length packet
orig_icmp_type ICMP response ICMP type
orig_icmp_code ICMP response ICMP code
orig_icmp_echo_client_type ICMP response type of ICMP client. \ 1= RIPE Atlas, 2=Unix/Linux, 3=Windows, 4=PRTG
orig_dns_id DNS response see DNS table above
orig_dns_qname DNS response see DNS table above
orig_dns_domainname DNS response see DNS table above
orig_dns_len DNS response see DNS table above
orig_dns_aa DNS response see DNS table above
orig_dns_tc DNS response see DNS table above
orig_dns_rd DNS response see DNS table above
orig_dns_ra DNS response see DNS table above
orig_dns_z DNS response see DNS table above
orig_dns_ad DNS response see DNS table above
orig_dns_cd DNS response see DNS table above
orig_dns_ancount DNS response see DNS table above
orig_dns_arcount DNS response see DNS table above
orig_dns_nscount DNS response see DNS table above
orig_dns_qdcount DNS response see DNS table above
orig_dns_rcode DNS response see DNS table above
orig_dns_qtype DNS response see DNS table above
orig_dns_opcode DNS response see DNS table above
orig_dns_qclass DNS response see DNS table above
orig_dns_edns_udp DNS response see DNS table above
orig_dns_edns_version DNS response see DNS table above
orig_dns_edns_do DNS response see DNS table above
orig_dns_labels DNS META see DNS table above
server_location META request location of the anycast node, only if anycast encoding is used for the file input directory
svr META request The name server the packet was sent to
pcap_file DNS request Name of the input pcap file
year META query year part of timestamp
month META query month part of timestamp
day META query day part of timestamp