Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - mapper is slower than hive' mapper


Copy link to this message
-
Re: mapper is slower than hive' mapper
Edward Capriolo 2012-08-01, 15:49
Hive does not use combiners it uses map side aggregation. Hive does
use writables, sometimes it uses ones from hadoop, sometimes it uses
its own custom writables for things like timestamps.

On Wed, Aug 1, 2012 at 11:40 AM, Bertrand Dechoux <[EMAIL PROTECTED]> wrote:
> I am not sure about Hive but if you look at Cascading they use a pseudo
> combiner instead of the standard (I mean Hadoop's) combiner.
> I guess Hive has a similar strategy.
>
> The point is that when you use a compiler, the compiler does smart thing
> that you don't need to think about (like loop unwinding).
> The result is that your code is still readable but optimized and in most
> cases the compiler will do better than you.
>
> Even your naive implementation of the Mapper (without the Reducer and the
> configuration) is more complicated than the whole Hive query.
>
> Like Chuck said Hive is basically a MapReduce compiler. It is fun to look at
> how it works. But it is often best to let the compiler work for you instead
> of trying to beat it.
>
> For simple cases, like a 'select', Hive (or any other same-level alternative
> solutions) is helpful. And for complex cases, with multiple joins, you will
> want to have something like Hive too because with the vanilla MapReduce API
> it can become quite hard to grasp everything. Basically, two reasons :
> faster to express and cheaper to maintain.
>
> One reason not to use Hive is if your approach is more programmatic like if
> you want to do machine learning which will require highly specific workflow
> and user defined functions.
>
> It would be interesting to know your issue : are you trying to benchmark
> Hive (and you)? Or have you any other reasons?
>
> Bertrand
>
>
> On Wed, Aug 1, 2012 at 5:13 PM, Edward Capriolo <[EMAIL PROTECTED]>
> wrote:
>>
>> As mentioned, if you avoid using new, by re-using objects and possibly
>> use buffer objects you may be able to match or beat the speed. But in
>> the general case the hive saves you time by allowing you not to worry
>> about low level details like this.
>>
>> On Wed, Aug 1, 2012 at 10:35 AM, Connell, Chuck
>> <[EMAIL PROTECTED]> wrote:
>> > This is actually not surprising. Hive is essentially a MapReduce
>> > compiler. It is common for regular compilers (C, C#, Fortran) to emit faster
>> > assembler code than you write yourself. Compilers know the tricks of their
>> > target language.
>> >
>> > Chuck Connell
>> > Nuance R&D Data Team
>> > Burlington, MA
>> >
>> >
>> > -----Original Message-----
>> > From: Yue Guan [mailto:[EMAIL PROTECTED]]
>> > Sent: Wednesday, August 01, 2012 10:29 AM
>> > To: [EMAIL PROTECTED]
>> > Subject: mapper is slower than hive' mapper
>> >
>> > Hi, there
>> >
>> > I'm writing mapreduce to replace some hive query and I find that my
>> > mapper is slow than hive's mapper. The Hive query is like:
>> >
>> > select sum(column1) from table group by column2, column3;
>> >
>> > My mapreduce program likes this:
>> >
>> >      public static class HiveTableMapper extends Mapper<BytesWritable,
>> > Text, MyKey, DoubleWritable> {
>> >
>> >          public void map(BytesWritable key, Text value, Context context)
>> > throws IOException, InterruptedException {
>> >                  String[] sLine = StringUtils.split(value.toString(),
>> > StringUtils.ESCAPE_CHAR, HIVE_FIELD_DELIMITER_CHAR);
>> >              context.write(new MyKey(Integer.parseInt(sLine[0]),
>> > sLine[1]), new DoubleWritable(Double.parseDouble(sLine[2])));
>> >          }
>> >
>> >      }
>> >
>> > I assume hive is doing something similar. Is there any trick in hive to
>> > speed this thing up? Thank you!
>> >
>> > Best,
>> >
>
>
>
>
> --
> Bertrand Dechoux