Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Drill >> mail # user >> Using Apache Drill - Sample Data and Sample Queries - Examples


Copy link to this message
-
Using Apache Drill - Sample Data and Sample Queries - Examples
I recently gave a presentation on how to use Apache Drill with some
demonstrations.

The video for the presentation is available here:
http://vimeo.com/chug/using-apache-drill
The slides are available here:
http://www.slideshare.net/ChicagoHUG/using-apache-drill-chug-august-2014-jim-scott

*The data used in this presentation can be found and downloaded (1.3GB CSV
file and a 100KB CSV file) *
*here:*
https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
https://data.cityofchicago.org/Public-Safety/Sex-Offenders/vc9r-bqvy

Once you start your drill session you can connect to it via:
http://localhost:8047/
*Queries (pay attention to back ticks) *
To see a full stack trace when errors occur:
alter session set `exec.errors.verbose`=true;
*Looking through the information schema:*SELECT * from
INFORMATION_SCHEMA.SCHEMATA;
SELECT * from INFORMATION_SCHEMA.`TABLES`;
SELECT * from INFORMATION_SCHEMA.`COLUMNS`;
SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE
TABLE_SCHEMA = 'sys' and TABLE_NAME = 'options';
*How many of each type of incident:*select count(*) as incidents,
columns[5] as category from
dfs.`/home/jscott/Downloads/chicago/Crimes_-_2001_to_present.csv` group by
columns[5] order by incidents desc limit 10;
*How many of each type of incident and subtype:*select count(*) as
incidents, columns[5] as type, columns[6] as subtype from
dfs.`/home/jscott/Downloads/chicago/Crimes_-_2001_to_present.csv` group by
columns[6], columns[5] order by incidents desc limit 10;
*CREATE A TABLE in PARQUET format:*

*Please note that I created a workspace "db" at my path of /opt/drill/db
and I gave it a default storage format of parquet. *

create table dfs.db.`crime_data` as select columns[0] as REFID, columns[1]
as CaseNumber, columns[2] as DateTime, columns[3] as Block, columns[4] as
IUCR, columns[5] as PrimaryType, columns[6] as Description, columns[7] as
LocationDescription, columns[8] as Arrest, columns[9] as Domestic,
columns[10] as Beat, columns[11] as District, columns[12] as Ward,
columns[13] as CommunityArea, columns[14] as FBICode, columns[15] as
XCoordinate, columns[16] as YCoordinate, columns[17] as YEAR_DATE,
columns[18] as UpdatedOn, columns[19] as Latitude, columns[20] as
Longitude, columns[21] as Location from
dfs.`/home/jscott/Downloads/chicago/Crimes_-_2001_to_present.csv`;
create table dfs.db.`sex_offenders` as select columns[0] as LAST,columns[1]
as FIRST,columns[2] as BLOCK,columns[3] as GENDER,columns[4] as
RACE,columns[5] as BIRTHDATE,columns[6] as AGE,columns[7] as
HEIGHT,columns[8] as WEIGHT,columns[9] as VICTIM_MINOR from
dfs.`/home/jscott/Downloads/chicago/Sex_Offenders.csv`;

​
*Run the previous query on the newly created table:*select count(*) as
incidents, primarytype, description from dfs.db.`crime_data` group by
description, primarytype order by incidents desc limit 10;
*How many crimes occurred on a block with a registered sex offender?*SELECT
count(*) as incidents FROM dfs.db.`crime_data` crimes JOIN
dfs.db.`sex_offenders` offenders ON crimes.block = offenders.block ORDER BY
incidents DESC;
*What types of incidents occurred?*SELECT count(*) as incidents,
crimes.primarytype as type, crimes.description as subtype FROM
dfs.db.`crime_data` crimes JOIN dfs.db.`sex_offenders` offenders ON
crimes.block = offenders.block GROUP BY crimes.description,
crimes.primarytype ORDER BY incidents DESC LIMIT 20;
*What all is in the BATTERY category?*SELECT count(*) as incidents,
crimes.description as subtype FROM dfs.db.`crime_data` crimes JOIN
dfs.db.`sex_offenders` offenders ON crimes.block = offenders.block WHERE
crimes.primarytype = 'BATTERY' GROUP BY crimes.description ORDER BY
incidents DESC;
*How many total BATTERY incidents?*SELECT count(*) as incidents FROM
dfs.db.`crime_data` crimes JOIN dfs.db.`sex_offenders` offenders ON
crimes.block = offenders.block WHERE crimes.primarytype = 'BATTERY';
*How many "BATTERY" incidents occurred on those blocks by year (has it
gotten better or worse over time)?*SELECT count(*) as incidents,
extract(year from to_date(crimes.datetime, 'MM/DD/YYYY hh:mm:ss a')) as
`year` FROM dfs.db.`crime_data` crimes JOIN dfs.db.`sex_offenders`
offenders ON crimes.block = offenders.block WHERE crimes.primarytype =
'BATTERY' and crimes.datetime not like '%Date%' GROUP BY extract(year from
to_date(crimes.datetime, 'MM/DD/YYYY hh:mm:ss a')) ORDER BY `year` ASC;
​

​Do you have queries against this data set that you have written that you
would like to share with others? Reply to the thread... ​
*Jim Scott*
Director, Enterprise Strategy & Architecture

 <http://www.mapr.com/>
[image: MapR Technologies] <http://www.mapr.com>

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB