Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - Questions for the future work of Hive


+
Schubert Zhang 2009-08-05, 07:06
+
Zheng Shao 2009-08-05, 07:26
+
Schubert Zhang 2009-08-05, 18:42
+
Ashish Thusoo 2009-08-05, 19:08
Copy link to this message
-
Re: Questions for the future work of Hive
Schubert Zhang 2009-08-06, 04:01
Ashish,

Yes, we need append new data to a existing partition.

I think the approach in Zheng Shao's reply to place different rows into
different partitions is ineffective, since we must do many SELECT ....
WHERE.... mapreduce jobs. And in many times, we cannot list the partitions
in the source dataset.

I my project, we have a experience to implement a mapreduce job to achieve
it, but it is very specific (We have not found a good way to generalize
it.). Following is what we done:
(1) Sort rows by key = PartitionColumn+TheKeyColumnToSort
(2) Estimate the partition changes in the MyOutputFormat to write to
different files in different partitions.

Schubert
On Thu, Aug 6, 2009 at 3:08 AM, Ashish Thusoo <[EMAIL PROTECTED]> wrote:

>  Do you also need to be able to append the new data to an
> existing partition?
>
> Ashish
>
>  ------------------------------
> *From:* Schubert Zhang [mailto:[EMAIL PROTECTED]]
> *Sent:* Wednesday, August 05, 2009 11:43 AM
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Questions for the future work of Hive
>
>   Thanks Zheng.
>
> Regards to automatic-multi-partition insertion, is it the future stuff
> "Inserts without listing partitions"?
> In our applications, we really want this feature, since our data will come
> into data warehouse continually and we cannot know which partition before
> read each row.
>
> Regards to Hive backended by HBase, I think it can also store persistent
> data in HBase, with following advantages:
> 1. The placement of each row are handled by HBase.
> 2. The stored rows are sorted and indexed by HBase, and the index is a
> global table index.
> 3. The data in HBase can provide SQL query interface via Hive.
> Schubert
> On Wed, Aug 5, 2009 at 3:26 PM, Zheng Shao <[EMAIL PROTECTED]> wrote:
>
>> 1) We have not started working on cost-based optimizer yet. Index is
>> one of the ongoing works on the performance side. We are working on a
>> couple more, e.g. more compact on-disk format (LazyBinarySerDe
>> https://issues.apache.org/jira/browse/HIVE-640 ) which gives a nice
>> speed-up for queries with multiple map-reduce jobs.
>>
>> 2) We don't have a short-term plan for automatic-multi-partition
>> insertion. However there is a simple workaround if you know the
>> partition values (and Hive can do multiple inserts in a single
>> map-reduce job!). "src" can be a sub query as well.
>> FROM src
>> INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-01") SELECT * WHERE
>> ts = "2009-08-01"
>> INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-02") SELECT * WHERE
>> ts = "2009-08-02"
>> INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-03") SELECT * WHERE
>> ts = "2009-08-03"
>> INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-04") SELECT * WHERE
>> ts = "2009-08-04"
>> INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-05") SELECT * WHERE
>> ts = "2009-08-05"
>> INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-06") SELECT * WHERE
>> ts = "2009-08-06"
>> INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-07") SELECT * WHERE
>> ts = "2009-08-07";
>>
>> There is some ongoing work for integrating HBase tables with Hive:
>> https://issues.apache.org/jira/browse/HIVE-705
>> We won't know which storage backend is the best until we have them
>> done and tested, but at the least HBase looks very promising for
>> datasets that fit in the memory.
>>
>> Here is the slides which contains examples for how to add new storage
>> backend (file format) to Hive:
>> http://www.slideshare.net/ragho/hive-user-meeting-august-2009-facebookpage
>> Hive is completely open and we hope Hive can have more storage
>> backends, because it's not likely that one storage backend will be the
>> best for all kinds of applications.
>>
>> Zheng
>>
>> On Wed, Aug 5, 2009 at 12:06 AM, Schubert Zhang<[EMAIL PROTECTED]> wrote:
>> > In the Hive paper <Hive - A Warehousing Solution Over a MapReduce
>> > Framework>, the section 5 describes the FUTURE WORK of Hive. I want to
>> get
>> > more detail of following tow points:
+
Andraz Tori 2009-08-10, 08:11
+
Ashish Thusoo 2009-08-10, 19:34