-Re: Questions for the future work of Hive
Schubert Zhang 2009-08-05, 18:42
Regards to automatic-multi-partition insertion, is it the future stuff
"Inserts without listing partitions"?
In our applications, we really want this feature, since our data will come
into data warehouse continually and we cannot know which partition before
read each row.
Regards to Hive backended by HBase, I think it can also store persistent
data in HBase, with following advantages:
1. The placement of each row are handled by HBase.
2. The stored rows are sorted and indexed by HBase, and the index is a
global table index.
3. The data in HBase can provide SQL query interface via Hive.
On Wed, Aug 5, 2009 at 3:26 PM, Zheng Shao <[EMAIL PROTECTED]> wrote:
> 1) We have not started working on cost-based optimizer yet. Index is
> one of the ongoing works on the performance side. We are working on a
> couple more, e.g. more compact on-disk format (LazyBinarySerDe
> https://issues.apache.org/jira/browse/HIVE-640 ) which gives a nice
> speed-up for queries with multiple map-reduce jobs.
> 2) We don't have a short-term plan for automatic-multi-partition
> insertion. However there is a simple workaround if you know the
> partition values (and Hive can do multiple inserts in a single
> map-reduce job!). "src" can be a sub query as well.
> FROM src
> INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-01") SELECT * WHERE
> ts = "2009-08-01"
> INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-02") SELECT * WHERE
> ts = "2009-08-02"
> INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-03") SELECT * WHERE
> ts = "2009-08-03"
> INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-04") SELECT * WHERE
> ts = "2009-08-04"
> INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-05") SELECT * WHERE
> ts = "2009-08-05"
> INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-06") SELECT * WHERE
> ts = "2009-08-06"
> INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-07") SELECT * WHERE
> ts = "2009-08-07";
> There is some ongoing work for integrating HBase tables with Hive:
> We won't know which storage backend is the best until we have them
> done and tested, but at the least HBase looks very promising for
> datasets that fit in the memory.
> Here is the slides which contains examples for how to add new storage
> backend (file format) to Hive:
> Hive is completely open and we hope Hive can have more storage
> backends, because it's not likely that one storage backend will be the
> best for all kinds of applications.
> On Wed, Aug 5, 2009 at 12:06 AM, Schubert Zhang<[EMAIL PROTECTED]> wrote:
> > In the Hive paper <Hive - A Warehousing Solution Over a MapReduce
> > Framework>, the section 5 describes the FUTURE WORK of Hive. I want to
> > more detail of following tow points:
> > (1) Hive currently has a naive rule-based optimizer with a small number
> > simple rules. We plan to build a cost-based optimizer and adaptive
> > optimization techniques to come up with more efficient plans.
> > Q: Is the ongoing work of "Indexing" the one of this improvement?
> > Q: Is there any more?
> > (2) We are exploring columnar storage and more intelligent data placement
> > improve scan performance.
> > Q: We found that current Hive cannot place the data in different
> > intelligently (we must specify the partition value in statements). Is the
> > intelligent/dynamic placement of partitions is one of this improvement?
> > example, we have many input files which contain many records for
> > timestamp, and we want place each record into a proper partition
> > to the timestamp colum.
> > Q: Do you think Bigtable/HBase is a good columnar storage which provides
> > good model of intelligent data placement?
> > Schubert