Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Create Partitioned Table w/ Partition= Substring of Raw Data


Copy link to this message
-
Re: Create Partitioned Table w/ Partition= Substring of Raw Data
Hi Dan,
What Gabi is right.

To solve your problem, you could have a non-partitioned table on the raw data and run a Hive query that reads this raw data and inserts it into a partitioned table. Dynamic partitioning could come in handy in that case. Look at https://cwiki.apache.org/Hive/tutorial.html#Tutorial-DynamicpartitionInsert for details.

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com

"Best Trading Platform" - World Finance's Forex Awards 2009.
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.
----- Original Message -----
From: "Gabi D" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Thursday, March 22, 2012 4:16:58 AM
Subject: Re: Create Partitioned Table w/ Partition= Substring of Raw Data
Dan,
the partition value does not look at your raw data, you assign a value to the partition when you put the data in.
So what you need to do is this:

Create table mytable (Time string, OtherData string)
Partition by (danDate string); (never a good idea to give fields a name that's a reserved word, such as 'date')

and when you put the data in using load for example, you need to specify the actual value you want to assign to 'danDate'.Meaning you should know what are the date values that are inside your file (and choose only one).
(e.g.,
load data local inpath '<file location>' into table mytable partition( danDate='01-01-2000' );
)

Note that danDate is not a field from within your files, but it is actually used by hive to create a subdirectory under your table's hdfs location which will be named:
'danDate=01-01-2000' (if the value you gave it is 01-01-2000)
hive always shows it as the last field in your 'describe <table>' commands though,again, it is not a regular field.

In this respect, it's better to use a date format that will be comparable, such as 'yyyy-mm-dd' so you will be able to run selects such as:
select count(*) form mytable where danDate >='2012-01-01' and danDate <'2012-02-01'

and hive will be able to run this using partition pruning (which means only read files in the partition directories needed to satisfy your query).

I hope I didn't go over stuff you already know and that this helps...

On Wed, Mar 21, 2012 at 5:07 PM, Dan Y < [EMAIL PROTECTED] > wrote:
Hi All,
My raw data looks like this:
DateTime,OtherData
01-01-2000-01:00:00,blablabla1
01-01-2000-04:00:00,blablabla2
01-02-2000-02:00:00,blablabla3
I would like to partition on the datepart of DateTime. What does not work, unfortunately, is this:

Create table mytable (DateTime string, OtherData string)
Partition by ( substr(DateTime,1,10) string);
I wish my raw data instead looked like:

Date ,Time ,OtherData

01-01-2000 ,01:00:00 ,blablabla1
01-01-2000 ,04:00:00 ,blablabla2
01-02-2000 ,02:00:00 ,blablabla3
...with Time a distinct field. Then I could use:
Create table mytable (Time string, OtherData string)
Partition by (Date string);
Any ideas for the best way to load my raw data into a hive table partitioned by the datepart of DateTime? The files are gynormous, so manipulating the raw data outside of Hive is not feasible for this problem. I would like to avoid using Select in the solution as well since my hive table will refer to zipped data (and the Select therefore would come with a big runtime cost).
Thanks!!
Dan