Querying ENTRADA

ENTRADA data analysis is performed with standard SQL syntax which follows the SQL-92 standard.

There are multiple ways to submit queries to ENTRADA. Below are four examples that serve different personal preferences and use cases.

Hue

The Hue web interface provides an easy access to the data stored in ENTRADA. It enables the user to submit SQL queries directly in a query editor and allows exploring the data interactvely. Based on the results, Hue can create simple charts for a quick analysis. Additionally, data can be exported to other formats like csv or xls.

Hue screenshot

Shell

You can submit commands for setting up tables, inserting data and querying tables through a shell as well. Therefore, the impala-shell is provided. Besides submitting SQL commands, the shell also lets you run shell-only commands for tuning the performance and running diagnostics. impala-shell can be invoked from inside a shell script as well.

Impyla

A more comfortable way to submit SQL commands to impala from a python script is using the library impyla. Below, there is an example on how to connect to impala, submit a query and handle the response:

from impala.dbapi import connect

conn = connect(host='myhost.example.nl', port=21050) 

impala_cursor = conn.cursor()

impala_cursor.execute('''
  select qname, src from dns.staging
  limit 100
         ''')

rows = impala_cursor.fetchall()

for row in rows:
    print rows

In case you want to authenticate at your server using Kerberos, add the auth_mechanism parameter to the connect function:

conn = connect(host='my.host.nl', port=21050, auth_mechanism="GSSAPI")        

Impyla further supports python pandas which is a data structure to handle and analyze large data sets easily. If you want to load the results of a query directly into a panda data structure, modify the code as follows:

from impala.dbapi import connect
from impala.util import as_pandas
import pandas as pd


conn = connect(host='myhost.example.nl', port=21050)                                   

impala_cursor = conn.cursor()

impala_cursor.execute('''
  select qname, src from dns.staging
  limit 100
         ''')

df = as_pandas(impala_cursor)

Java

With the help of a JDBC driver, ENTRADA can be queried directly from Java programs as well. After following the instructions here, one can connect to Impala using:

jdbc:hive2://host:port/;auth=noSasl

With Kerberos authentication use:

jdbc:hive2://host:port;AuthMech=1;KrbHostFQDN=host;KrbServiceName=impala;KrbRealm=realm_name