Installation

Dependencies

The following dependencies must have been installed before installing ENTRADA.

Hadoop

Cloudera Hadoop or Apache Hadoop with Apache Impala, this manual assumes the use of Cloudera Hadoop. See Cloudera requirements. For evaluation purpuses it is also possible to install ENTRADA on a Cloudera QuickStart VM. This is a virtual machine containing the complete Cloudera Hadoop platform. The performance of such a single node Hadoop setup will be poor but sufficient to get an idea of the ENTRADA functionality.

Java

Java: minimum supported JDK Version 1.7.0_55

GNU Parallel

GNU Parallel

Curl

Curl

Create ENTRADA user

This manual assumes the ENTRADA components will be installed in the home directory of the "entrada". Create a new user for ENTRADA with:

sudo useradd -d /home/entrada -m -s /bin/bash entrada
sudo passwd entrada

Download and extract the latest release

Go to the ENTRADA releases page for the latest release. As user entrada download the package and uncompress it with:

wget https://github.com/SIDN/entrada/releases/download/v0.0.5/entrada-0.0.5.tar.gz
tar -xvzf entrada-0.0.5.tar.gz
#create a symlink to the latest version
ln -s entrada-0.0.5 entrada-latest

Configuration options

There are two configuration files:
1. entrada-latest/scripts/run/config.sh
2. entrada-latest/scripts/config/entrada-settings.properties

See configuration for more information about all the available options.

This installation manual uses the default values of most configuration options, there are however some options that must be specified now.

config.sh

Option Description
DATA_RSYNC_DIR Directory where rsync will copy remote pcap files to on this server
IMPALA_NODE Hostname of a node running the Impala daemon
NAMESERVERS List of name servers (colon separated)
KRB_USER If Kerberos is used then enter the username@REALM
KEYTAB_FILE If Kerberos is used then the path to the kerberos keytab file of the user
ERROR_MAIL The e-mail address to send error e-mail to

entrada-settings.properties

Option Description
graphite.host Hostname for the server running graphite

If the graphite.host option is left empty then ENTRADA will work fine but no metrics will be sent to Graphite.

Data directories

Rsync

Make sure the directory for the DATA_RSYNC_DIR option mentioned above, exists and has a subdirectory for each name server.

mkdir /home/entrada/captures
mkdir /home/entrada/captures/<nameserver>

Create a root directory for pcap file processing, add a sub-directory for each name server:

mkdir /home/entrada/pcap
mkdir /home/entrada/pcap/<nameserver>
The <nameserver> name value can have the format of <name>_<location> Any input directory using the format <name>_<location> will be parsed into a server and location component. The server component value will be used for the Parquet "server" partition, the location component value will be stored in the "server_location" column. This allows for partition pruning at the logical server level while still being able to determine the anycast location for every packet.

HDFS directory

Create the base directory for ENTRADA data on HDFS.

export HADOOP_USER_NAME=hdfs
hdfs dfs -mkdir /user/hive/entrada
hdfs dfs -chown impala:hive /user/hive/entrada

Impala tables

create the Impala database tables by running the following script:

entrada-latest/scripts/install/create_impala_tables.sh

Logging

As root, create a new directory for the logfiles:

sudo mkdir -p /var/log/entrada
sudo chown entrada:entrada /var/log/entrada

Setup logrotate to keep only the logfiles for the last 10 days and compress old logfiles. Create a logrotate config file with name /etc/logrotate.d/entrada and add the following:

/var/log/entrada/*.log {
  size 10k
  daily
  maxage 10
  compress
  missingok
}

Update Maxmind database

ENTRADA uses the Maxmind IP databases for Geographical and ASN lookups, the run_update_geo_ip_db.sh script is scheduled with cron to periodically update the database files. The first time however this script needs to be executed manually with:

cd /home/entrada/entrada-latest/scripts/run/
source config.sh && ./run_update_geo_ip_db.sh

Schedule jobs

The ENTRADA processing is executed by a series of bash scripts, these script are scheduled with cron. As the entrada user, execute the following command:

crontab -e

Choose your favourite editor and paste the following lines.

#set path to ENTRADA scripts
PATH=/home/entrada/entrada-latest/scripts/run/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

#Copy pcap to incoming dir every minute
*/1 * * * * . config.sh && run_00_copy-pcap-to-staging_bootstrap.sh >> $ENTRADA_LOG_DIR/entrada-copy-pcap.log 2>&1

#move pcap files from the incoming dir to the processing dir, every minute.
*/1 * * * * . config.sh && run_01_move_to_processing_bootstrap.sh >> $ENTRADA_LOG_DIR/entrada-move-pcap.log 2>&1

#proces the pcap files to parquet files and import into staging database table, every 2 minutes
*/2 * * * * . config.sh && run_02_partial_loader_bootstrap.sh >> $ENTRADA_LOG_DIR/entrada-partial-loader.log 2>&1

#move the data from the staging table to the datawarehouse table, ervery night at 4 am
#the parquet data uses UTC time, do not schedule this job before 2 am to avoid partitioning issues.
#Update the Maxmind GEO-IP database every 1st Wednessday of the month
0 4 * * * . config.sh && run_03_staging_to_warehouse.sh >> $ENTRADA_LOG_DIR/entrada-staging-2-warehouse.log 2>&1 && run-if-today.sh 1 Wed && run_update_geo_ip_db.sh

#cleanup pcap archive, remove files older than 15 days
0 9 * * * . config.sh && run_04_pcap_cleanup.sh >> $ENTRADA_LOG_DIR/entrada-pcap-cleanup.log 2>&1

#Update staging stats every hour
0 * * * * . config.sh && run_05_update_stats_staging.sh >> $ENTRADA_LOG_DIR/entrada-update-stats.log 2>&1

When the above cron jobs are scheduled the ENTRADA scripts should pickup new pcap files from the data directories and process these. Check the logfiles in /var/log/entrada for any errors.

Too avoid picking up partial files, the newest pcap file in captures directory is ignored because it might still be written to by rsync. This means that there need to be at least 2 pcap files available before ENTRADA will start pcap processing.

Monitoring

Graphite

Installation instructions for Graphite can be found here

The storage configuration of the Graphite Carbon database must be modified for the ENTRADA metrics, add to following lines to /etc/carbon/storage-schemas.conf:

[entrada_svr_metrics]
pattern = entrada.*
retentions = 10s:5d,1m:14d,10m:5y
aggregationMethod = sum

[entrada_test_metrics]
pattern = test.entrada.*
retentions = 10s:5d,1m:14d,10m:5y
aggregationMethod = sum

Grafana

Installation instructions for Graphite can be found here

Configure Grafana to retrieve metrics from the Graphite server. ENTRADA has default Grafana monitoring dashboards, these dashboards cound be found the in entrada-latest/grafana-dashboard directory.

Adding additional metrics

The ENTRADA PCAP to Parquet convertor generates a basic set of metrics, additional metrics can be sent to Graphite by querying the database and sending the results to the Graphite server.

This example is provided by our friends at nic.lv.

#!/bin/bash

set -e
GRAPHITE_SERVER=0.0.0.0
NAMESERVER="ns1"
query4days="select count(distinct country) as countries from dns.queries where domainname=\"example.com\" and unixtime BETWEEN unix_timestamp(now()-interval 4 days) and unix_timestamp(now()-interval 3 days);"
query3days="select count(distinct country) as countries from dns.queries where domainname=\"example.com\" and unixtime BETWEEN unix_timestamp(now()-interval 3 days) and unix_timestamp(now()-interval 2 days);"
query2days="select count(distinct country) as countries from dns.queries where domainname=\"example.com\" and unixtime BETWEEN unix_timestamp(now()-interval 2 days) and unix_timestamp(now()-interval 1 days);"
query1day="select count(distinct country) as countries from dns.queries where domainname=\"example.com\" and unixtime BETWEEN unix_timestamp(now()-interval 1 days) and unix_timestamp(now());"

impala-shell -B -q "$query4days" -o output.txt
day4=$(cat output.txt)

impala-shell -B -q "$query3days" -o output.txt
day3=$(cat output.txt)

impala-shell -B -q "$query2days" -o output.txt
day2=$(cat output.txt)

impala-shell -B -q "$query1day" -o output.txt
day1=$(cat output.txt)

timestamp=$(date +%s)

echo "entrada.${NAMESERVER}.countries.4daysAgo "${day4} ${timestamp} | nc ${GRAPHITE_SERVER} 2003
echo "entrada.${NAMESERVER}.countries.3daysAgo "${day3} ${timestamp} | nc ${GRAPHITE_SERVER} 2003
echo "entrada.${NAMESERVER}.countries.2daysAgo "${day2} ${timestamp} | nc ${GRAPHITE_SERVER} 2003
echo "entrada.${NAMESERVER}.countries.1dayAgo "${day1} ${timestamp} | nc ${GRAPHITE_SERVER} 2003

exit 0