-RE: Questions for the future work of Hive
Ashish Thusoo 2009-08-05, 19:08
Do you also need to be able to append the new data to an existing partition?
From: Schubert Zhang [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, August 05, 2009 11:43 AM
To: [EMAIL PROTECTED]
Subject: Re: Questions for the future work of Hive
Regards to automatic-multi-partition insertion, is it the future stuff "Inserts without listing partitions"?
In our applications, we really want this feature, since our data will come into data warehouse continually and we cannot know which partition before read each row.
Regards to Hive backended by HBase, I think it can also store persistent data in HBase, with following advantages:
1. The placement of each row are handled by HBase.
2. The stored rows are sorted and indexed by HBase, and the index is a global table index.
3. The data in HBase can provide SQL query interface via Hive.
On Wed, Aug 5, 2009 at 3:26 PM, Zheng Shao <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
1) We have not started working on cost-based optimizer yet. Index is
one of the ongoing works on the performance side. We are working on a
couple more, e.g. more compact on-disk format (LazyBinarySerDe
https://issues.apache.org/jira/browse/HIVE-640 ) which gives a nice
speed-up for queries with multiple map-reduce jobs.
2) We don't have a short-term plan for automatic-multi-partition
insertion. However there is a simple workaround if you know the
partition values (and Hive can do multiple inserts in a single
map-reduce job!). "src" can be a sub query as well.
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-01") SELECT * WHERE
ts = "2009-08-01"
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-02") SELECT * WHERE
ts = "2009-08-02"
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-03") SELECT * WHERE
ts = "2009-08-03"
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-04") SELECT * WHERE
ts = "2009-08-04"
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-05") SELECT * WHERE
ts = "2009-08-05"
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-06") SELECT * WHERE
ts = "2009-08-06"
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-07") SELECT * WHERE
ts = "2009-08-07";
There is some ongoing work for integrating HBase tables with Hive:
We won't know which storage backend is the best until we have them
done and tested, but at the least HBase looks very promising for
datasets that fit in the memory.
Here is the slides which contains examples for how to add new storage
backend (file format) to Hive:
Hive is completely open and we hope Hive can have more storage
backends, because it's not likely that one storage backend will be the
best for all kinds of applications.
On Wed, Aug 5, 2009 at 12:06 AM, Schubert Zhang<[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
> In the Hive paper <Hive - A Warehousing Solution Over a MapReduce
> Framework>, the section 5 describes the FUTURE WORK of Hive. I want to get
> more detail of following tow points:
> (1) Hive currently has a naive rule-based optimizer with a small number of
> simple rules. We plan to build a cost-based optimizer and adaptive
> optimization techniques to come up with more efficient plans.
> Q: Is the ongoing work of "Indexing" the one of this improvement?
> Q: Is there any more?
> (2) We are exploring columnar storage and more intelligent data placement to
> improve scan performance.
> Q: We found that current Hive cannot place the data in different partitions
> intelligently (we must specify the partition value in statements). Is the
> intelligent/dynamic placement of partitions is one of this improvement? For
> example, we have many input files which contain many records for diffenent
> timestamp, and we want place each record into a proper partition according
> to the timestamp colum.
> Q: Do you think Bigtable/HBase is a good columnar storage which provides