Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Multi-GroupBy-Insert optimization


Copy link to this message
-
Re: Multi-GroupBy-Insert optimization
Jan Dolinár 2012-06-05, 05:42
On 6/4/12, shan s <[EMAIL PROTECTED]> wrote:
> Thanks for the explanation Jan.
> If I understand correctly, the input will be read one single time and will
> be preprocessed in some form,  and this intermediate data is used for
> subsequent group-by..
> Not sure if my scenario will help this single step, since group-by varies
> across vast entities.

Yes, that is that is correct. The very simplest use case is when you
only scan a part of table. But if you are interested in all the data,
it is not going to help you much.

> If I were to implement group-by,manually, generally  we could club them
> together in single program. Can I do better with hive, with some
> hints/optimizations?
> Or  is there a possibility that Pig might perform better in this case.(
> Assuming Pig would probably handle this in a single job?)

In some cases it might be able to outsmart the hive optimizer and
write the mapreduce job directly in java in such way that it might
perform better. In most cases though, it is probably not worth the
trouble. You might easily end up in situation where buying more
machines is cheaper than developing the low level solutions that might
or might not be slightly faster... I'm not familiar with Pig or any
other tools that might be of use in your situation.

Jan