|
Mark Grover
2011-09-02, 17:18
wd
2011-09-04, 08:01
Mark Grover
2011-09-06, 19:39
Travis Powell
2011-09-06, 19:50
Mark Grover
2011-09-06, 21:35
Steven Wong
2011-09-09, 00:30
Edward Capriolo
2011-09-09, 01:26
Mark Grover
2011-09-09, 11:17
Steven Wong
2011-09-09, 23:00
Mark Grover
2011-09-12, 17:09
Steven Wong
2011-09-12, 19:12
Aggarwal, Vaibhav
2011-09-06, 20:15
Aggarwal, Vaibhav
2011-09-06, 20:09
|
-
Best practices for storing data on HiveMark Grover 2011-09-02, 17:18
Hello folks,
I am fairly new to Hive and am wondering if you could share some of the best practices for storing/querying data with Hive. Here is an example of the problem I am trying to solve. The traffic to our website is logged in files that contain information about clicks from various users. Simplified, the log file looks like: t_1, ip_1, userid_1 t_2, ip_2, userid_2 t_3, ip_3, userid_3 ... where t_i represents time of the click, ip_i represents ip address where the click originated from, and userid_i represents the user ID of the user. Since the clicks are logged on an ongoing basis, partitioning our Hive table by day seemed like the obvious choice. Every night we upload the data from the previous day into a new partition. However, we would also want the capability to find all log lines corresponding to a particular user. With our present partitioning scheme, all day partitions are searched for that user ID but this takes a long time. I am looking for ideas/suggestions/thoughts/comments on how to reduce this time. As a solution, I am thinking that perhaps we could have 2 independent tables, one which stores data partitioned by day and the other partitioned by userId. With the second table partitioned by userId, I will have to find some way of maintaining the partitions since Hive doesn't support appending of files. Also, this seems suboptimal, since we are doubling that the amount of data that we store. What do you folks think of this idea? Do you have any other suggestions on how we can approach this problem? What have other people in similar situations done? Please share. Thank you in advance! Mark +
Mark Grover 2011-09-02, 17:18
-
Re: Best practices for storing data on Hivewd 2011-09-04, 08:01
Hive support more than one partitions, have your tried? Maybe you can
create to partitions named as date and user. Hive 0.7 also support index, maybe you can have a try. On Sat, Sep 3, 2011 at 1:18 AM, Mark Grover <[EMAIL PROTECTED]> wrote: > Hello folks, > I am fairly new to Hive and am wondering if you could share some of the best practices for storing/querying data with Hive. > > Here is an example of the problem I am trying to solve. > > The traffic to our website is logged in files that contain information about clicks from various users. > Simplified, the log file looks like: > t_1, ip_1, userid_1 > t_2, ip_2, userid_2 > t_3, ip_3, userid_3 > ... > > where t_i represents time of the click, ip_i represents ip address where the click originated from, and userid_i represents the user ID of the user. > > Since the clicks are logged on an ongoing basis, partitioning our Hive table by day seemed like the obvious choice. Every night we upload the data from the previous day into a new partition. > > However, we would also want the capability to find all log lines corresponding to a particular user. With our present partitioning scheme, all day partitions are searched for that user ID but this takes a long time. I am looking for ideas/suggestions/thoughts/comments on how to reduce this time. > > As a solution, I am thinking that perhaps we could have 2 independent tables, one which stores data partitioned by day and the other partitioned by userId. With the second table partitioned by userId, I will have to find some way of maintaining the partitions since Hive doesn't support appending of files. Also, this seems suboptimal, since we are doubling that the amount of data that we store. What do you folks think of this idea? > > Do you have any other suggestions on how we can approach this problem? > > What have other people in similar situations done? Please share. > > Thank you in advance! > Mark > +
wd 2011-09-04, 08:01
-
Re: Best practices for storing data on HiveMark Grover 2011-09-06, 19:39
Thanks for the response, wd.
I would REALLY APPRECIATE if other people can share their views as well. Here are the possible solutions that I have thought about to the problem (see original email for description of problem): 1) Multiple partitions: We would partition the table by day and userId. However, given the amount of users that visit our website (hundreds of thousands of unique users every day), this would lead to a large number of partitions (and rather small file sizes, ranging from a couple of bytes to a couple of KB). From the documentation I've read online, it seems that Hive/Hadoop weren't designed to deal with such small file sizes and such a situation should be avoided if possible. We had a scenario previously where we were partitioning by day and hour and because of the sheer number of partitions queries like "select * from <table> LIMIT 1;" were taking very long and even failed because of "Java out of Heap space" errors. My guess is that the master node was munching through all these partitions and couldn't deal with the large number of partitions. 2) Use of data locality: We could keep the data partitioned by day and bucketed by userId. Within each bucket sort the data by the (userId, time). This way we could keep the data related to each userId together within a daily partition and if Hive could be made aware of this sorting order and could make use of this order to improve search/query times, that would alleviate the problem quite a bit. The big question here is: Does Hive leverage sorting order of data within a partition bucket when running (most/all?) queries, where possible? 3) Using an index: As wd mentioned, Hive 0.7 introduces the notion on an index. If I do index on userId, given that we can hundreds of thousands of unique users per day, would indexing prove to be a good move? Are there people who are using it for similar purposes or on a similar scale? 4) Using 2 "orthogonal tables": As mentioned in my original email (see below), we could have 2 independent tables, one which stores data partitioned by day and other partitioned by userId. For maintaining partitions in userId partitioned table, I am planning to do the following: In the nightly job, if userId=X visited the website previous day, we create a partition for userId=X if it doesn't already exist. Once the partition is created, all clicks for that user Id on the day for in question are put in a single file and dropped in the userId=X folder on HDFS. This method could be used to simulate an "append" to the Hive table. The file would only be a few bytes to a few KB and the format of the table would be sequence file. What are your thoughts about the above 4 methods? Any particular likes or dislikes? Any comments, suggestions would be helpful. Thank you again in advance! Mark On 11-09-04 04:01 AM, wd wrote: > Hive support more than one partitions, have your tried? Maybe you can > create to partitions named as date and user. > > Hive 0.7 also support index, maybe you can have a try. > > On Sat, Sep 3, 2011 at 1:18 AM, Mark Grover<[EMAIL PROTECTED]> wrote: >> Hello folks, >> I am fairly new to Hive and am wondering if you could share some of the best practices for storing/querying data with Hive. >> >> Here is an example of the problem I am trying to solve. >> >> The traffic to our website is logged in files that contain information about clicks from various users. >> Simplified, the log file looks like: >> t_1, ip_1, userid_1 >> t_2, ip_2, userid_2 >> t_3, ip_3, userid_3 >> ... >> >> where t_i represents time of the click, ip_i represents ip address where the click originated from, and userid_i represents the user ID of the user. >> >> Since the clicks are logged on an ongoing basis, partitioning our Hive table by day seemed like the obvious choice. Every night we upload the data from the previous day into a new partition. >> >> However, we would also want the capability to find all log lines corresponding to a particular user. With our present partitioning scheme, all day partitions are searched for that user ID but this takes a long time. I am looking for ideas/suggestions/thoughts/comments on how to reduce this time. Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: [EMAIL PROTECTED] "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. +
Mark Grover 2011-09-06, 19:39
-
RE: Best practices for storing data on HiveTravis Powell 2011-09-06, 19:50
Hi Mark,
When we load data into Hive, we use a staging table to dynamically partition our data. This might help you too. We create our initial table and our staging table: DROP TABLE IF EXISTS staging_data; CREATE TABLE staging_data ( ... ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE data ( ... ) PARTITIONED BY (dt STRING, hour, INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS SEQUENCEFILE; INSERT OVERWRITE TABLE data PARTITION(dt, hour) SELECT q.*, to_date(q.session_timestamp) AS dt, hour(q.session_timestamp) AS hour FROM staging_session q ORDER BY user_id DISTRIBUTE BY user_id; So..... By distributing by (and preferably ordering by) user_id, we can minimize seek time in the table because Hive knows where all entries pertaining to a specific user are stored. Partitions by time have the best performance, because chances are almost every query will have some time-related component in it (and it spreads out the data among partitions fairly well.) Let me know if this works for you. We start every job with those first few lines of Hive script. It works well for us. Thanks, Travis Powell -----Original Message----- From: Mark Grover [mailto:[EMAIL PROTECTED]] Sent: Tuesday, September 06, 2011 12:39 PM To: [EMAIL PROTECTED] Cc: wd; Bob Tiernay; Baiju Devani Subject: Re: Best practices for storing data on Hive Thanks for the response, wd. I would REALLY APPRECIATE if other people can share their views as well. Here are the possible solutions that I have thought about to the problem (see original email for description of problem): 1) Multiple partitions: We would partition the table by day and userId. However, given the amount of users that visit our website (hundreds of thousands of unique users every day), this would lead to a large number of partitions (and rather small file sizes, ranging from a couple of bytes to a couple of KB). From the documentation I've read online, it seems that Hive/Hadoop weren't designed to deal with such small file sizes and such a situation should be avoided if possible. We had a scenario previously where we were partitioning by day and hour and because of the sheer number of partitions queries like "select * from <table> LIMIT 1;" were taking very long and even failed because of "Java out of Heap space" errors. My guess is that the master node was munching through all these partitions and couldn't deal with the large number of partitions. 2) Use of data locality: We could keep the data partitioned by day and bucketed by userId. Within each bucket sort the data by the (userId, time). This way we could keep the data related to each userId together within a daily partition and if Hive could be made aware of this sorting order and could make use of this order to improve search/query times, that would alleviate the problem quite a bit. The big question here is: Does Hive leverage sorting order of data within a partition bucket when running (most/all?) queries, where possible? 3) Using an index: As wd mentioned, Hive 0.7 introduces the notion on an index. If I do index on userId, given that we can hundreds of thousands of unique users per day, would indexing prove to be a good move? Are there people who are using it for similar purposes or on a similar scale? 4) Using 2 "orthogonal tables": As mentioned in my original email (see below), we could have 2 independent tables, one which stores data partitioned by day and other partitioned by userId. For maintaining partitions in userId partitioned table, I am planning to do the following: In the nightly job, if userId=X visited the website previous day, we create a partition for userId=X if it doesn't already exist. Once the partition is created, all clicks for that user Id on the day for in question are put in a single file and dropped in the userId=X folder on HDFS. This method could be used to simulate an "append" to the Hive table. The file would only be a few bytes to a few KB and the format of the table would be sequence file. What are your thoughts about the above 4 methods? Any particular likes or dislikes? Any comments, suggestions would be helpful. Thank you again in advance! Mark On 11-09-04 04:01 AM, wd wrote: Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: [EMAIL PROTECTED] "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. +
Travis Powell 2011-09-06, 19:50
-
Re: Best practices for storing data on HiveMark Grover 2011-09-06, 21:35
Thanks for your reply, Travis.
I was under the impression that for Hive to make use of sorted structure of data (i.e. for the table named "data" in your example), the metadata of the table (specified during table creation) has to advertise such property. However, I don't see any special metadata specifying such property when "data" table was created. Is that true? If so, is such metadata specified by using CLUSTERED BY and SORTED BY clauses during table creation? On 11-09-06 03:50 PM, Travis Powell wrote: > Hi Mark, > > When we load data into Hive, we use a staging table to dynamically partition our data. This might help you too. > > We create our initial table and our staging table: > > DROP TABLE IF EXISTS staging_data; > CREATE TABLE staging_data ( ... ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; > CREATE TABLE data ( ... ) > PARTITIONED BY (dt STRING, hour, INT) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS SEQUENCEFILE; > > INSERT OVERWRITE TABLE data PARTITION(dt, hour) SELECT q.*, to_date(q.session_timestamp) AS dt, hour(q.session_timestamp) AS hour FROM staging_session q ORDER BY user_id DISTRIBUTE BY user_id; > > > So..... > By distributing by (and preferably ordering by) user_id, we can minimize seek time in the table because Hive knows where all entries pertaining to a specific user are stored. Partitions by time have the best performance, because chances are almost every query will have some time-related component in it (and it spreads out the data among partitions fairly well.) > > Let me know if this works for you. We start every job with those first few lines of Hive script. It works well for us. > > Thanks, > > Travis Powell > > -----Original Message----- > From: Mark Grover [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, September 06, 2011 12:39 PM > To: [EMAIL PROTECTED] > Cc: wd; Bob Tiernay; Baiju Devani > Subject: Re: Best practices for storing data on Hive > > Thanks for the response, wd. > > I would REALLY APPRECIATE if other people can share their views as well. > > Here are the possible solutions that I have thought about to the problem > (see original email for description of problem): > > 1) Multiple partitions: We would partition the table by day and userId. > However, given the amount of users that visit our website (hundreds of > thousands of unique users every day), this would lead to a large number > of partitions (and rather small file sizes, ranging from a couple of > bytes to a couple of KB). From the documentation I've read online, it > seems that Hive/Hadoop weren't designed to deal with such small file > sizes and such a situation should be avoided if possible. > We had a scenario previously where we were partitioning by day and hour > and because of the sheer number of partitions queries like "select * > from<table> LIMIT 1;" were taking very long and even failed because of > "Java out of Heap space" errors. My guess is that the master node was > munching through all these partitions and couldn't deal with the large > number of partitions. > > 2) Use of data locality: We could keep the data partitioned by day and > bucketed by userId. Within each bucket sort the data by the (userId, > time). This way we could keep the data related to each userId together > within a daily partition and if Hive could be made aware of this sorting > order and could make use of this order to improve search/query times, > that would alleviate the problem quite a bit. The big question here is: > Does Hive leverage sorting order of data within a partition bucket when > running (most/all?) queries, where possible? > > > 3) Using an index: As wd mentioned, Hive 0.7 introduces the notion on an > index. If I do index on userId, given that we can hundreds of thousands > of unique users per day, would indexing prove to be a good move? Are > there people who are using it for similar purposes or on a similar scale? > > > 4) Using 2 "orthogonal tables": As mentioned in my original email (see Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: [EMAIL PROTECTED] "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. +
Mark Grover 2011-09-06, 21:35
-
RE: Best practices for storing data on HiveSteven Wong 2011-09-09, 00:30
I think this statement is not true: "By distributing by (and preferably ordering by) user_id, we can minimize seek time in the table because Hive knows where all entries pertaining to a specific user are stored." I think it is not true whether the table is bucketed on user_id or not (assuming that user_id is not a partition column or indexed column).
-----Original Message----- From: Mark Grover [mailto:[EMAIL PROTECTED]] Sent: Tuesday, September 06, 2011 2:36 PM To: [EMAIL PROTECTED] Cc: Travis Powell; Baiju Devani; Bob Tiernay Subject: Re: Best practices for storing data on Hive Thanks for your reply, Travis. I was under the impression that for Hive to make use of sorted structure of data (i.e. for the table named "data" in your example), the metadata of the table (specified during table creation) has to advertise such property. However, I don't see any special metadata specifying such property when "data" table was created. Is that true? If so, is such metadata specified by using CLUSTERED BY and SORTED BY clauses during table creation? On 11-09-06 03:50 PM, Travis Powell wrote: > Hi Mark, > > When we load data into Hive, we use a staging table to dynamically partition our data. This might help you too. > > We create our initial table and our staging table: > > DROP TABLE IF EXISTS staging_data; > CREATE TABLE staging_data ( ... ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; > CREATE TABLE data ( ... ) > PARTITIONED BY (dt STRING, hour, INT) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS SEQUENCEFILE; > > INSERT OVERWRITE TABLE data PARTITION(dt, hour) SELECT q.*, to_date(q.session_timestamp) AS dt, hour(q.session_timestamp) AS hour FROM staging_session q ORDER BY user_id DISTRIBUTE BY user_id; > > > So..... > By distributing by (and preferably ordering by) user_id, we can minimize seek time in the table because Hive knows where all entries pertaining to a specific user are stored. Partitions by time have the best performance, because chances are almost every query will have some time-related component in it (and it spreads out the data among partitions fairly well.) > > Let me know if this works for you. We start every job with those first few lines of Hive script. It works well for us. > > Thanks, > > Travis Powell > > -----Original Message----- > From: Mark Grover [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, September 06, 2011 12:39 PM > To: [EMAIL PROTECTED] > Cc: wd; Bob Tiernay; Baiju Devani > Subject: Re: Best practices for storing data on Hive > > Thanks for the response, wd. > > I would REALLY APPRECIATE if other people can share their views as well. > > Here are the possible solutions that I have thought about to the problem > (see original email for description of problem): > > 1) Multiple partitions: We would partition the table by day and userId. > However, given the amount of users that visit our website (hundreds of > thousands of unique users every day), this would lead to a large number > of partitions (and rather small file sizes, ranging from a couple of > bytes to a couple of KB). From the documentation I've read online, it > seems that Hive/Hadoop weren't designed to deal with such small file > sizes and such a situation should be avoided if possible. > We had a scenario previously where we were partitioning by day and hour > and because of the sheer number of partitions queries like "select * > from<table> LIMIT 1;" were taking very long and even failed because of > "Java out of Heap space" errors. My guess is that the master node was > munching through all these partitions and couldn't deal with the large > number of partitions. > > 2) Use of data locality: We could keep the data partitioned by day and > bucketed by userId. Within each bucket sort the data by the (userId, > time). This way we could keep the data related to each userId together > within a daily partition and if Hive could be made aware of this sorting > order and could make use of this order to improve search/query times, Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: [EMAIL PROTECTED] "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. +
Steven Wong 2011-09-09, 00:30
-
Re: Best practices for storing data on HiveEdward Capriolo 2011-09-09, 01:26
On Thu, Sep 8, 2011 at 8:30 PM, Steven Wong <[EMAIL PROTECTED]> wrote:
> I think this statement is not true: "By distributing by (and preferably > ordering by) user_id, we can minimize seek time in the table because Hive > knows where all entries pertaining to a specific user are stored." I think > it is not true whether the table is bucketed on user_id or not (assuming > that user_id is not a partition column or indexed column). > > > -----Original Message----- > From: Mark Grover [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, September 06, 2011 2:36 PM > To: [EMAIL PROTECTED] > Cc: Travis Powell; Baiju Devani; Bob Tiernay > Subject: Re: Best practices for storing data on Hive > > Thanks for your reply, Travis. > > I was under the impression that for Hive to make use of sorted structure > of data (i.e. for the table named "data" in your example), the metadata > of the table (specified during table creation) has to advertise such > property. However, I don't see any special metadata specifying such > property when "data" table was created. > > Is that true? If so, is such metadata specified by using CLUSTERED BY > and SORTED BY clauses during table creation? > > On 11-09-06 03:50 PM, Travis Powell wrote: > > Hi Mark, > > > > When we load data into Hive, we use a staging table to dynamically > partition our data. This might help you too. > > > > We create our initial table and our staging table: > > > > DROP TABLE IF EXISTS staging_data; > > CREATE TABLE staging_data ( ... ) > > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; > > CREATE TABLE data ( ... ) > > PARTITIONED BY (dt STRING, hour, INT) > > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS > SEQUENCEFILE; > > > > INSERT OVERWRITE TABLE data PARTITION(dt, hour) SELECT q.*, > to_date(q.session_timestamp) AS dt, hour(q.session_timestamp) AS hour FROM > staging_session q ORDER BY user_id DISTRIBUTE BY user_id; > > > > > > So..... > > By distributing by (and preferably ordering by) user_id, we can minimize > seek time in the table because Hive knows where all entries pertaining to a > specific user are stored. Partitions by time have the best performance, > because chances are almost every query will have some time-related component > in it (and it spreads out the data among partitions fairly well.) > > > > Let me know if this works for you. We start every job with those first > few lines of Hive script. It works well for us. > > > > Thanks, > > > > Travis Powell > > > > -----Original Message----- > > From: Mark Grover [mailto:[EMAIL PROTECTED]] > > Sent: Tuesday, September 06, 2011 12:39 PM > > To: [EMAIL PROTECTED] > > Cc: wd; Bob Tiernay; Baiju Devani > > Subject: Re: Best practices for storing data on Hive > > > > Thanks for the response, wd. > > > > I would REALLY APPRECIATE if other people can share their views as well. > > > > Here are the possible solutions that I have thought about to the problem > > (see original email for description of problem): > > > > 1) Multiple partitions: We would partition the table by day and userId. > > However, given the amount of users that visit our website (hundreds of > > thousands of unique users every day), this would lead to a large number > > of partitions (and rather small file sizes, ranging from a couple of > > bytes to a couple of KB). From the documentation I've read online, it > > seems that Hive/Hadoop weren't designed to deal with such small file > > sizes and such a situation should be avoided if possible. > > We had a scenario previously where we were partitioning by day and hour > > and because of the sheer number of partitions queries like "select * > > from<table> LIMIT 1;" were taking very long and even failed because of > > "Java out of Heap space" errors. My guess is that the master node was > > munching through all these partitions and couldn't deal with the large > > number of partitions. > > > > 2) Use of data locality: We could keep the data partitioned by day and > > bucketed by userId. Within each bucket sort the data by the (userId, Hive does not optimize when the table is defined with ordereded by. The only way to optimize is to use partitions, bucketing, or indexes. +
Edward Capriolo 2011-09-09, 01:26
-
Re: Best practices for storing data on HiveMark Grover 2011-09-09, 11:17
Edward, Steven or anyone else on the mailing list:
Is it possible to optimize queries like the one below with bucketing? select * from <table> where user_id='blah' and dt >= '2011-05-26' and dt <= '2011-05-28'; where <table> is partitioned by dt (which represents day), bucketed by user_id and within each bucket data is sorted by user_id, time. The Hive wiki says, that bucketing can improve performance on certain kinds of queries. What kinds of queries are these? Only Sampling queries? Group by on bucketed column? Where clause on bucketed column? All of the above? Thanks in advance! Mark ----- Original Message ----- From: "Edward Capriolo" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: "Travis Powell" <[EMAIL PROTECTED]>, "Baiju Devani" <[EMAIL PROTECTED]>, "Bob Tiernay" <[EMAIL PROTECTED]> Sent: Thursday, September 8, 2011 9:26:10 PM Subject: Re: Best practices for storing data on Hive On Thu, Sep 8, 2011 at 8:30 PM, Steven Wong < [EMAIL PROTECTED] > wrote: I think this statement is not true: "By distributing by (and preferably ordering by) user_id, we can minimize seek time in the table because Hive knows where all entries pertaining to a specific user are stored." I think it is not true whether the table is bucketed on user_id or not (assuming that user_id is not a partition column or indexed column). -----Original Message----- From: Mark Grover [mailto: [EMAIL PROTECTED] ] Sent: Tuesday, September 06, 2011 2:36 PM To: [EMAIL PROTECTED] Cc: Travis Powell; Baiju Devani; Bob Tiernay Subject: Re: Best practices for storing data on Hive Thanks for your reply, Travis. I was under the impression that for Hive to make use of sorted structure of data (i.e. for the table named "data" in your example), the metadata of the table (specified during table creation) has to advertise such property. However, I don't see any special metadata specifying such property when "data" table was created. Is that true? If so, is such metadata specified by using CLUSTERED BY and SORTED BY clauses during table creation? On 11-09-06 03:50 PM, Travis Powell wrote: > Hi Mark, > > When we load data into Hive, we use a staging table to dynamically partition our data. This might help you too. > > We create our initial table and our staging table: > > DROP TABLE IF EXISTS staging_data; > CREATE TABLE staging_data ( ... ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; > CREATE TABLE data ( ... ) > PARTITIONED BY (dt STRING, hour, INT) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS SEQUENCEFILE; > > INSERT OVERWRITE TABLE data PARTITION(dt, hour) SELECT q.*, to_date(q.session_timestamp) AS dt, hour(q.session_timestamp) AS hour FROM staging_session q ORDER BY user_id DISTRIBUTE BY user_id; > > > So..... > By distributing by (and preferably ordering by) user_id, we can minimize seek time in the table because Hive knows where all entries pertaining to a specific user are stored. Partitions by time have the best performance, because chances are almost every query will have some time-related component in it (and it spreads out the data among partitions fairly well.) > > Let me know if this works for you. We start every job with those first few lines of Hive script. It works well for us. > > Thanks, > > Travis Powell > > -----Original Message----- > From: Mark Grover [mailto: [EMAIL PROTECTED] ] > Sent: Tuesday, September 06, 2011 12:39 PM > To: [EMAIL PROTECTED] > Cc: wd; Bob Tiernay; Baiju Devani > Subject: Re: Best practices for storing data on Hive > > Thanks for the response, wd. > > I would REALLY APPRECIATE if other people can share their views as well. > > Here are the possible solutions that I have thought about to the problem > (see original email for description of problem): > > 1) Multiple partitions: We would partition the table by day and userId. > However, given the amount of users that visit our website (hundreds of > thousands of unique users every day), this would lead to a large number Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: [EMAIL PROTECTED] "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. Hive does not optimize when the table is defined with ordereded by. The only way to optimize is to use partitions, bucketing, or indexes. +
Mark Grover 2011-09-09, 11:17
-
RE: Best practices for storing data on HiveSteven Wong 2011-09-09, 23:00
Bucketing only speeds up sampling queries. Hive doesn't know/remember the hash function(s) you use to bucket the data, so it cannot use that to speed up lookups by a bucketed column.
-----Original Message----- From: Mark Grover [mailto:[EMAIL PROTECTED]] Sent: Friday, September 09, 2011 4:18 AM To: [EMAIL PROTECTED] Cc: Travis Powell; Baiju Devani; Bob Tiernay Subject: Re: Best practices for storing data on Hive Edward, Steven or anyone else on the mailing list: Is it possible to optimize queries like the one below with bucketing? select * from <table> where user_id='blah' and dt >= '2011-05-26' and dt <= '2011-05-28'; where <table> is partitioned by dt (which represents day), bucketed by user_id and within each bucket data is sorted by user_id, time. The Hive wiki says, that bucketing can improve performance on certain kinds of queries. What kinds of queries are these? Only Sampling queries? Group by on bucketed column? Where clause on bucketed column? All of the above? Thanks in advance! Mark ----- Original Message ----- From: "Edward Capriolo" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: "Travis Powell" <[EMAIL PROTECTED]>, "Baiju Devani" <[EMAIL PROTECTED]>, "Bob Tiernay" <[EMAIL PROTECTED]> Sent: Thursday, September 8, 2011 9:26:10 PM Subject: Re: Best practices for storing data on Hive On Thu, Sep 8, 2011 at 8:30 PM, Steven Wong < [EMAIL PROTECTED] > wrote: I think this statement is not true: "By distributing by (and preferably ordering by) user_id, we can minimize seek time in the table because Hive knows where all entries pertaining to a specific user are stored." I think it is not true whether the table is bucketed on user_id or not (assuming that user_id is not a partition column or indexed column). -----Original Message----- From: Mark Grover [mailto: [EMAIL PROTECTED] ] Sent: Tuesday, September 06, 2011 2:36 PM To: [EMAIL PROTECTED] Cc: Travis Powell; Baiju Devani; Bob Tiernay Subject: Re: Best practices for storing data on Hive Thanks for your reply, Travis. I was under the impression that for Hive to make use of sorted structure of data (i.e. for the table named "data" in your example), the metadata of the table (specified during table creation) has to advertise such property. However, I don't see any special metadata specifying such property when "data" table was created. Is that true? If so, is such metadata specified by using CLUSTERED BY and SORTED BY clauses during table creation? On 11-09-06 03:50 PM, Travis Powell wrote: > Hi Mark, > > When we load data into Hive, we use a staging table to dynamically partition our data. This might help you too. > > We create our initial table and our staging table: > > DROP TABLE IF EXISTS staging_data; > CREATE TABLE staging_data ( ... ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; > CREATE TABLE data ( ... ) > PARTITIONED BY (dt STRING, hour, INT) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS SEQUENCEFILE; > > INSERT OVERWRITE TABLE data PARTITION(dt, hour) SELECT q.*, to_date(q.session_timestamp) AS dt, hour(q.session_timestamp) AS hour FROM staging_session q ORDER BY user_id DISTRIBUTE BY user_id; > > > So..... > By distributing by (and preferably ordering by) user_id, we can minimize seek time in the table because Hive knows where all entries pertaining to a specific user are stored. Partitions by time have the best performance, because chances are almost every query will have some time-related component in it (and it spreads out the data among partitions fairly well.) > > Let me know if this works for you. We start every job with those first few lines of Hive script. It works well for us. > > Thanks, > > Travis Powell > > -----Original Message----- > From: Mark Grover [mailto: [EMAIL PROTECTED] ] > Sent: Tuesday, September 06, 2011 12:39 PM > To: [EMAIL PROTECTED] > Cc: wd; Bob Tiernay; Baiju Devani > Subject: Re: Best practices for storing data on Hive > > Thanks for the response, wd. Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: [EMAIL PROTECTED] "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. Hive does not optimize when the table is defined with ordereded by. The only way to optimize is to use partitions, bucketing, or indexes. +
Steven Wong 2011-09-09, 23:00
-
Re: Best practices for storing data on HiveMark Grover 2011-09-12, 17:09
Thanks, Steven.
So, am I correct in understanding that even the sorting columns (user_id, time in my example) would not be used to optimize the query shown below? On 11-09-09 07:00 PM, Steven Wong wrote: > Bucketing only speeds up sampling queries. Hive doesn't know/remember the hash function(s) you use to bucket the data, so it cannot use that to speed up lookups by a bucketed column. > > > -----Original Message----- > From: Mark Grover [mailto:[EMAIL PROTECTED]] > Sent: Friday, September 09, 2011 4:18 AM > To: [EMAIL PROTECTED] > Cc: Travis Powell; Baiju Devani; Bob Tiernay > Subject: Re: Best practices for storing data on Hive > > Edward, Steven or anyone else on the mailing list: > > Is it possible to optimize queries like the one below with bucketing? > select * from<table> where user_id='blah' and dt>= '2011-05-26' and dt<= '2011-05-28'; > > where<table> is partitioned by dt (which represents day), bucketed by user_id and within each bucket data is sorted by user_id, time. > > The Hive wiki says, that bucketing can improve performance on certain kinds of queries. What kinds of queries are these? > Only Sampling queries? Group by on bucketed column? Where clause on bucketed column? All of the above? > > Thanks in advance! > Mark > > ----- Original Message ----- > From: "Edward Capriolo"<[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: "Travis Powell"<[EMAIL PROTECTED]>, "Baiju Devani"<[EMAIL PROTECTED]>, "Bob Tiernay"<[EMAIL PROTECTED]> > Sent: Thursday, September 8, 2011 9:26:10 PM > Subject: Re: Best practices for storing data on Hive > > > > > On Thu, Sep 8, 2011 at 8:30 PM, Steven Wong< [EMAIL PROTECTED]> wrote: > > > I think this statement is not true: "By distributing by (and preferably ordering by) user_id, we can minimize seek time in the table because Hive knows where all entries pertaining to a specific user are stored." I think it is not true whether the table is bucketed on user_id or not (assuming that user_id is not a partition column or indexed column). > > > -----Original Message----- > From: Mark Grover [mailto: [EMAIL PROTECTED] ] > Sent: Tuesday, September 06, 2011 2:36 PM > To: [EMAIL PROTECTED] > Cc: Travis Powell; Baiju Devani; Bob Tiernay > Subject: Re: Best practices for storing data on Hive > > Thanks for your reply, Travis. > > I was under the impression that for Hive to make use of sorted structure > of data (i.e. for the table named "data" in your example), the metadata > of the table (specified during table creation) has to advertise such > property. However, I don't see any special metadata specifying such > property when "data" table was created. > > Is that true? If so, is such metadata specified by using CLUSTERED BY > and SORTED BY clauses during table creation? > > On 11-09-06 03:50 PM, Travis Powell wrote: >> Hi Mark, >> >> When we load data into Hive, we use a staging table to dynamically partition our data. This might help you too. >> >> We create our initial table and our staging table: >> >> DROP TABLE IF EXISTS staging_data; >> CREATE TABLE staging_data ( ... ) >> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; >> CREATE TABLE data ( ... ) >> PARTITIONED BY (dt STRING, hour, INT) >> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS SEQUENCEFILE; >> >> INSERT OVERWRITE TABLE data PARTITION(dt, hour) SELECT q.*, to_date(q.session_timestamp) AS dt, hour(q.session_timestamp) AS hour FROM staging_session q ORDER BY user_id DISTRIBUTE BY user_id; >> >> >> So..... >> By distributing by (and preferably ordering by) user_id, we can minimize seek time in the table because Hive knows where all entries pertaining to a specific user are stored. Partitions by time have the best performance, because chances are almost every query will have some time-related component in it (and it spreads out the data among partitions fairly well.) >> >> Let me know if this works for you. We start every job with those first few lines of Hive script. It works well for us. >> >> Thanks, Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: [EMAIL PROTECTED] "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. +
Mark Grover 2011-09-12, 17:09
-
RE: Best practices for storing data on HiveSteven Wong 2011-09-12, 19:12
I haven't used sorted columns before. I expect sorted columns to give you some query optimization by avoiding some sorting in some cases, but not by pruning input. In this sense, sorted columns give you a very different kind of optimization than partitioned, bucketed, or indexed columns do.
-----Original Message----- From: Mark Grover [mailto:[EMAIL PROTECTED]] Sent: Monday, September 12, 2011 10:09 AM To: [EMAIL PROTECTED] Cc: Steven Wong; Travis Powell; Baiju Devani; Bob Tiernay Subject: Re: Best practices for storing data on Hive Thanks, Steven. So, am I correct in understanding that even the sorting columns (user_id, time in my example) would not be used to optimize the query shown below? On 11-09-09 07:00 PM, Steven Wong wrote: > Bucketing only speeds up sampling queries. Hive doesn't know/remember the hash function(s) you use to bucket the data, so it cannot use that to speed up lookups by a bucketed column. > > > -----Original Message----- > From: Mark Grover [mailto:[EMAIL PROTECTED]] > Sent: Friday, September 09, 2011 4:18 AM > To: [EMAIL PROTECTED] > Cc: Travis Powell; Baiju Devani; Bob Tiernay > Subject: Re: Best practices for storing data on Hive > > Edward, Steven or anyone else on the mailing list: > > Is it possible to optimize queries like the one below with bucketing? > select * from<table> where user_id='blah' and dt>= '2011-05-26' and dt<= '2011-05-28'; > > where<table> is partitioned by dt (which represents day), bucketed by user_id and within each bucket data is sorted by user_id, time. > > The Hive wiki says, that bucketing can improve performance on certain kinds of queries. What kinds of queries are these? > Only Sampling queries? Group by on bucketed column? Where clause on bucketed column? All of the above? > > Thanks in advance! > Mark > > ----- Original Message ----- > From: "Edward Capriolo"<[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: "Travis Powell"<[EMAIL PROTECTED]>, "Baiju Devani"<[EMAIL PROTECTED]>, "Bob Tiernay"<[EMAIL PROTECTED]> > Sent: Thursday, September 8, 2011 9:26:10 PM > Subject: Re: Best practices for storing data on Hive > > > > > On Thu, Sep 8, 2011 at 8:30 PM, Steven Wong< [EMAIL PROTECTED]> wrote: > > > I think this statement is not true: "By distributing by (and preferably ordering by) user_id, we can minimize seek time in the table because Hive knows where all entries pertaining to a specific user are stored." I think it is not true whether the table is bucketed on user_id or not (assuming that user_id is not a partition column or indexed column). > > > -----Original Message----- > From: Mark Grover [mailto: [EMAIL PROTECTED] ] > Sent: Tuesday, September 06, 2011 2:36 PM > To: [EMAIL PROTECTED] > Cc: Travis Powell; Baiju Devani; Bob Tiernay > Subject: Re: Best practices for storing data on Hive > > Thanks for your reply, Travis. > > I was under the impression that for Hive to make use of sorted structure > of data (i.e. for the table named "data" in your example), the metadata > of the table (specified during table creation) has to advertise such > property. However, I don't see any special metadata specifying such > property when "data" table was created. > > Is that true? If so, is such metadata specified by using CLUSTERED BY > and SORTED BY clauses during table creation? > > On 11-09-06 03:50 PM, Travis Powell wrote: >> Hi Mark, >> >> When we load data into Hive, we use a staging table to dynamically partition our data. This might help you too. >> >> We create our initial table and our staging table: >> >> DROP TABLE IF EXISTS staging_data; >> CREATE TABLE staging_data ( ... ) >> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; >> CREATE TABLE data ( ... ) >> PARTITIONED BY (dt STRING, hour, INT) >> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS SEQUENCEFILE; >> >> INSERT OVERWRITE TABLE data PARTITION(dt, hour) SELECT q.*, to_date(q.session_timestamp) AS dt, hour(q.session_timestamp) AS hour FROM staging_session q ORDER BY user_id DISTRIBUTE BY user_id; Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: [EMAIL PROTECTED] "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. +
Steven Wong 2011-09-12, 19:12
-
RE: Best practices for storing data on HiveAggarwal, Vaibhav 2011-09-06, 20:15
>However, given the amount of users that visit our website (hundreds of thousands of unique users every day), this would lead to a large number of partitions (and rather small file sizes, ranging from a >couple of bytes to a couple of KB). From the documentation I've read online, it seems that Hive/Hadoop weren't designed to deal with such small file sizes and such a situation should be avoided if >possible. Hive uses CombineHiveInputFormat which can be used to combine splits across files (if they are not compressed). The new version of Hive (0.7.1) also combines splits across partition boundaries. This might be worth an experiment. >We had a scenario previously where we were partitioning by day and hour and because of the sheer number of partitions queries like "select * from <table> LIMIT 1;" were taking very long and even >failed because of "Java out of Heap space" errors. My guess is that the master node was munching through all these partitions and couldn't deal with the large number of partitions. This is a real issue. In case you don't see an explosive growth in number of partitions, you can choose to increase the HADOOP_HEAPSIZE on just the master node. Thanks Vaibhav On 11-09-04 04:01 AM, wd wrote: > Hive support more than one partitions, have your tried? Maybe you can > create to partitions named as date and user. > > Hive 0.7 also support index, maybe you can have a try. > > On Sat, Sep 3, 2011 at 1:18 AM, Mark Grover<[EMAIL PROTECTED]> wrote: >> Hello folks, >> I am fairly new to Hive and am wondering if you could share some of the best practices for storing/querying data with Hive. >> >> Here is an example of the problem I am trying to solve. >> >> The traffic to our website is logged in files that contain information about clicks from various users. >> Simplified, the log file looks like: >> t_1, ip_1, userid_1 >> t_2, ip_2, userid_2 >> t_3, ip_3, userid_3 >> ... >> >> where t_i represents time of the click, ip_i represents ip address where the click originated from, and userid_i represents the user ID of the user. >> >> Since the clicks are logged on an ongoing basis, partitioning our Hive table by day seemed like the obvious choice. Every night we upload the data from the previous day into a new partition. >> >> However, we would also want the capability to find all log lines corresponding to a particular user. With our present partitioning scheme, all day partitions are searched for that user ID but this takes a long time. I am looking for ideas/suggestions/thoughts/comments on how to reduce this time. >> >> As a solution, I am thinking that perhaps we could have 2 independent tables, one which stores data partitioned by day and the other partitioned by userId. With the second table partitioned by userId, I will have to find some way of maintaining the partitions since Hive doesn't support appending of files. Also, this seems suboptimal, since we are doubling that the amount of data that we store. What do you folks think of this idea? >> >> Do you have any other suggestions on how we can approach this problem? >> >> What have other people in similar situations done? Please share. >> >> Thank you in advance! >> Mark >> -- Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: [EMAIL PROTECTED] "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. +
Aggarwal, Vaibhav 2011-09-06, 20:15
-
RE: Best practices for storing data on HiveAggarwal, Vaibhav 2011-09-06, 20:09
Hi
You could choose to have the second table (for user ids) partitioned by date also. table_root/userid=ab/date=2010-12-31/ That way you can split your data set by both a userid and a date. You can use dynamic partitions to transform existing date partitioned table into userid/date partitioned table. This table could help you execute fast queries across both a particular user and a particular date (or date range). Hive does partition pruning based on where clause. Thanks Vaibhav -----Original Message----- From: Mark Grover [mailto:[EMAIL PROTECTED]] Sent: Friday, September 02, 2011 10:19 AM To: [EMAIL PROTECTED] Cc: Baiju Devani; Bob Tiernay Subject: Best practices for storing data on Hive Hello folks, I am fairly new to Hive and am wondering if you could share some of the best practices for storing/querying data with Hive. Here is an example of the problem I am trying to solve. The traffic to our website is logged in files that contain information about clicks from various users. Simplified, the log file looks like: t_1, ip_1, userid_1 t_2, ip_2, userid_2 t_3, ip_3, userid_3 ... where t_i represents time of the click, ip_i represents ip address where the click originated from, and userid_i represents the user ID of the user. Since the clicks are logged on an ongoing basis, partitioning our Hive table by day seemed like the obvious choice. Every night we upload the data from the previous day into a new partition. However, we would also want the capability to find all log lines corresponding to a particular user. With our present partitioning scheme, all day partitions are searched for that user ID but this takes a long time. I am looking for ideas/suggestions/thoughts/comments on how to reduce this time. As a solution, I am thinking that perhaps we could have 2 independent tables, one which stores data partitioned by day and the other partitioned by userId. With the second table partitioned by userId, I will have to find some way of maintaining the partitions since Hive doesn't support appending of files. Also, this seems suboptimal, since we are doubling that the amount of data that we store. What do you folks think of this idea? Do you have any other suggestions on how we can approach this problem? What have other people in similar situations done? Please share. Thank you in advance! Mark +
Aggarwal, Vaibhav 2011-09-06, 20:09
|