Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Noob question on creating tables


Copy link to this message
-
Re: Noob question on creating tables
That's a drawback of external tables, but it's actually not as difficult as
it sounds. It's easy to write a nightly "cron" job that creates the
partition for the next day (or a job per month...), if someone on your team
has some bash experience. Other job scheduling tools should support this
too. Here's an example. First, a hive script that uses parameters for the
date (Hive v0.8 or newer):

-- addlogpartition.hql
ALTER TABLE log ADD IF NOT EXISTS PARTITION (year = ${YEAR}, month ${MONTH}, day = ${DAY});

Then, run this bash script AFTER MIDNIGHT:

#!/bin/bash
YEAR=$(date +%Y)       # returns the string "2013" today.
MONTH=$(date +%m)   # returns the string "03" today, with the leading zero.
DAY=$(date +%d)          # returns the string "29" today. Will prefix with
0 for dates < 10.

# Assumes /path/to/2013/03/29 is the correct directory name:
/path/to/hive -f /path/to/addlogpartition.hql -d YEAR=$YEAR -d MON=$MONTH
-d DAY=$DAY
(Of course, all the /path/to will be different...)

So, be careful of how how "03" vs. "3" is handled in both the ALTER
TABLE statement and the path. Off hand, I don't know if Hive will complain
if you use 03 as an integer value in the ALTER TABLE statement.
On Fri, Mar 29, 2013 at 1:16 PM, Mark <[EMAIL PROTECTED]> wrote:

> Thanks
>
> Does this mean I need to create a partition for each day manually? There
> is no way to have infer that from my directory structure?
>
> On Mar 29, 2013, at 10:40 AM, Sanjay Subramanian <
> [EMAIL PROTECTED]> wrote:
>
> > Hi
> >
> > CREATE EXTERNAL TABLE IF NOT EXISTS log_data(col1 datatype1, col2
> > datatype2, . . . colN datatypeN) PARTITIONED BY (YEAR INT, MONTH INT, DAY
> > INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> >
> >
> > ALTER table log_data ADD PARTITION (YEAR=2013 , MONTH=2, DAY=27) LOCATION
> > '/path/to/YEAR/MONTH/DAY/directory/ON/HDFS';"
> >
> > Hive will read gzip and bz2 files out of the box.(so suppose you had
> > hourly log files in gzip format in your /YEAR/MONTH/DAY directory then it
> > will be read)
> > Snappy and LZO will need some jar installs and configs
> > https://github.com/toddlipcon/hadoop-lzo
> >
> > https://code.google.com/p/snappy/
> >
> >
> > Note that for example - gzip format is not splittable..so huge gzip files
> > without splits are not recommended as input to maps
> >
> > Hope this helps
> >
> > sanjay
> >
> >
> > On 3/29/13 10:19 AM, "Mark" <[EMAIL PROTECTED]> wrote:
> >
> >> We have existing log data in directories in the format of
> YEAR/MONTH/DAY.
> >>
> >> - How can we create a table over this table without hive modifying
> and/or
> >> moving it?
> >> - How can we tell Hive to partition this data so it knows about each day
> >> of logs?
> >> - Does hive out of the box work with reading compressed files?
> >>
> >> Thanks
> >
> >
> > CONFIDENTIALITY NOTICE
> > =====================> > This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
> >
>
>
--
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB