Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - Best practices for storing data on Hive


+
Mark Grover 2011-09-02, 17:18
+
wd 2011-09-04, 08:01
+
Mark Grover 2011-09-06, 19:39
Copy link to this message
-
RE: Best practices for storing data on Hive
Travis Powell 2011-09-06, 19:50
Hi Mark,

When we load data into Hive, we use a staging table to dynamically partition our data. This might help you too.

We create our initial table and our staging table:

DROP TABLE IF EXISTS staging_data;
CREATE TABLE staging_data ( ... )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
CREATE TABLE data ( ... )
PARTITIONED BY (dt STRING, hour, INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS SEQUENCEFILE;

INSERT OVERWRITE TABLE data PARTITION(dt, hour) SELECT q.*, to_date(q.session_timestamp) AS dt, hour(q.session_timestamp) AS hour FROM staging_session q ORDER BY user_id DISTRIBUTE BY user_id;
So.....
By distributing by (and preferably ordering by) user_id, we can minimize seek time in the table because Hive knows where all entries pertaining to a specific user are stored. Partitions by time have the best performance, because chances are almost every query will have some time-related component in it (and it spreads out the data among partitions fairly well.)

Let me know if this works for you. We start every job with those first few lines of Hive script. It works well for us.

Thanks,

Travis Powell

-----Original Message-----
From: Mark Grover [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, September 06, 2011 12:39 PM
To: [EMAIL PROTECTED]
Cc: wd; Bob Tiernay; Baiju Devani
Subject: Re: Best practices for storing data on Hive

Thanks for the response, wd.

I would REALLY APPRECIATE if other people can share their views as well.

Here are the possible solutions that I have thought about to the problem
(see original email for description of problem):

1) Multiple partitions: We would partition the table by day and userId.
However, given the amount of users that visit our website (hundreds of
thousands of unique users every day), this would lead to a large number
of partitions (and rather small file sizes, ranging from a couple of
bytes to a couple of KB). From the documentation I've read online, it
seems that Hive/Hadoop weren't designed to deal with such small file
sizes and such a situation should be avoided if possible.
We had a scenario previously where we were partitioning by day and hour
and because of the sheer number of partitions queries like "select *
from <table> LIMIT 1;" were taking very long and even failed because of
"Java out of Heap space" errors. My guess is that the master node was
munching through all these partitions and couldn't deal with the large
number of partitions.

2) Use of data locality: We could keep the data partitioned by day and
bucketed by userId. Within each bucket sort the data by the (userId,
time). This way we could keep the data related to each userId together
within a daily partition and if Hive could be made aware of this sorting
order and could make use of this order to improve search/query times,
that would alleviate the problem quite a bit. The big question here is:
Does Hive leverage sorting order of data within a partition bucket when
running (most/all?) queries, where possible?
3) Using an index: As wd mentioned, Hive 0.7 introduces the notion on an
index. If I do index on userId, given that we can hundreds of thousands
of unique users per day, would indexing prove to be a good move? Are
there people who are using it for similar purposes or on a similar scale?
4) Using 2 "orthogonal tables": As mentioned in my original email (see
below), we could have 2 independent tables, one which stores data
partitioned by day and other partitioned by userId. For maintaining
partitions in userId partitioned table, I am planning to do the following:
In the nightly job, if userId=X visited the website previous day, we
create a partition for userId=X if it doesn't already exist. Once the
partition is created, all clicks for that user Id on the day for in
question are put in a single file and dropped in the userId=X folder on
HDFS. This method could be used to simulate an "append" to the Hive
table. The file would only be a few bytes to a few KB and the format of
the table would be sequence file.

What are your thoughts about the above 4 methods? Any particular likes
or dislikes? Any comments, suggestions would be helpful.

Thank you again in advance!

Mark

On 11-09-04 04:01 AM, wd wrote:

Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com
e: [EMAIL PROTECTED]

"Best Trading Platform" - World Finance's Forex Awards 2009.
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.
+
Mark Grover 2011-09-06, 21:35
+
Steven Wong 2011-09-09, 00:30
+
Edward Capriolo 2011-09-09, 01:26
+
Mark Grover 2011-09-09, 11:17
+
Steven Wong 2011-09-09, 23:00
+
Mark Grover 2011-09-12, 17:09
+
Steven Wong 2011-09-12, 19:12
+
Aggarwal, Vaibhav 2011-09-06, 20:15
+
Aggarwal, Vaibhav 2011-09-06, 20:09