Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Drill >> mail # user >> Using Apache Drill - Sample Data and Sample Queries - Examples


Copy link to this message
-
Using Apache Drill - Sample Data and Sample Queries - Examples
I recently gave a presentation on how to use Apache Drill with some
demonstrations.

The video for the presentation is available here:
http://vimeo.com/chug/using-apache-drill
The slides are available here:
http://www.slideshare.net/ChicagoHUG/using-apache-drill-chug-august-2014-jim-scott

*The data used in this presentation can be found and downloaded (1.3GB CSV
file and a 100KB CSV file) *
*here:*
https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
https://data.cityofchicago.org/Public-Safety/Sex-Offenders/vc9r-bqvy

Once you start your drill session you can connect to it via:
http://localhost:8047/
*Queries (pay attention to back ticks) *
To see a full stack trace when errors occur:
alter session set `exec.errors.verbose`=true;
*Looking through the information schema:*SELECT * from
INFORMATION_SCHEMA.SCHEMATA;
SELECT * from INFORMATION_SCHEMA.`TABLES`;
SELECT * from INFORMATION_SCHEMA.`COLUMNS`;
SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE
TABLE_SCHEMA = 'sys' and TABLE_NAME = 'options';
*How many of each type of incident:*select count(*) as incidents,
columns[5] as category from
dfs.`/home/jscott/Downloads/chicago/Crimes_-_2001_to_present.csv` group by
columns[5] order by incidents desc limit 10;
*How many of each type of incident and subtype:*select count(*) as
incidents, columns[5] as type, columns[6] as subtype from
dfs.`/home/jscott/Downloads/chicago/Crimes_-_2001_to_present.csv` group by
columns[6], columns[5] order by incidents desc limit 10;
*CREATE A TABLE in PARQUET format:*

*Please note that I created a workspace "db" at my path of /opt/drill/db
and I gave it a default storage format of parquet. *

create table dfs.db.`crime_data` as select columns[0] as REFID, columns[1]
as CaseNumber, columns[2] as DateTime, columns[3] as Block, columns[4] as
IUCR, columns[5] as PrimaryType, columns[6] as Description, columns[7] as
LocationDescription, columns[8] as Arrest, columns[9] as Domestic,
columns[10] as Beat, columns[11] as District, columns[12] as Ward,
columns[13] as CommunityArea, columns[14] as FBICode, columns[15] as
XCoordinate, columns[16] as YCoordinate, columns[17] as YEAR_DATE,
columns[18] as UpdatedOn, columns[19] as Latitude, columns[20] as
Longitude, columns[21] as Location from
dfs.`/home/jscott/Downloads/chicago/Crimes_-_2001_to_present.csv`;
create table dfs.db.`sex_offenders` as select columns[0] as LAST,columns[1]
as FIRST,columns[2] as BLOCK,columns[3] as GENDER,columns[4] as
RACE,columns[5] as BIRTHDATE,columns[6] as AGE,columns[7] as
HEIGHT,columns[8] as WEIGHT,columns[9] as VICTIM_MINOR from
dfs.`/home/jscott/Downloads/chicago/Sex_Offenders.csv`;

​
*Run the previous query on the newly created table:*select count(*) as
incidents, primarytype, description from dfs.db.`crime_data` group by
description, primarytype order by incidents desc limit 10;
*How many crimes occurred on a block with a registered sex offender?*SELECT
count(*) as incidents FROM dfs.db.`crime_data` crimes JOIN
dfs.db.`sex_offenders` offenders ON crimes.block = offenders.block ORDER BY
incidents DESC;
*What types of incidents occurred?*SELECT count(*) as incidents,
crimes.primarytype as type, crimes.description as subtype FROM
dfs.db.`crime_data` crimes JOIN dfs.db.`sex_offenders` offenders ON
crimes.block = offenders.block GROUP BY crimes.description,
crimes.primarytype ORDER BY incidents DESC LIMIT 20;
*What all is in the BATTERY category?*SELECT count(*) as incidents,
crimes.description as subtype FROM dfs.db.`crime_data` crimes JOIN
dfs.db.`sex_offenders` offenders ON crimes.block = offenders.block WHERE
crimes.primarytype = 'BATTERY' GROUP BY crimes.description ORDER BY
incidents DESC;
*How many total BATTERY incidents?*SELECT count(*) as incidents FROM
dfs.db.`crime_data` crimes JOIN dfs.db.`sex_offenders` offenders ON
crimes.block = offenders.block WHERE crimes.primarytype = 'BATTERY';
*How many "BATTERY" incidents occurred on those blocks by year (has it
gotten better or worse over time)?*SELECT count(*) as incidents,
extract(year from to_date(crimes.datetime, 'MM/DD/YYYY hh:mm:ss a')) as
`year` FROM dfs.db.`crime_data` crimes JOIN dfs.db.`sex_offenders`
offenders ON crimes.block = offenders.block WHERE crimes.primarytype =
'BATTERY' and crimes.datetime not like '%Date%' GROUP BY extract(year from
to_date(crimes.datetime, 'MM/DD/YYYY hh:mm:ss a')) ORDER BY `year` ASC;
​

​Do you have queries against this data set that you have written that you
would like to share with others? Reply to the thread... ​
*Jim Scott*
Director, Enterprise Strategy & Architecture

 <http://www.mapr.com/>
[image: MapR Technologies] <http://www.mapr.com>