Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Writing MR-Job: Something like OracleReducer, JDBCReducer ...


Copy link to this message
-
Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...
Guys,

Ok... You're putting a lot of thought in to this, which is a good thing.

I really haven't looked at the bulk load, so I have some homework :-)
In response to your discussion...
1) how fast is fast enough?
I mean sure if you create a temp table on the fly, you could end up w a single region becoming a hot spot. Is it more than just a bottleneck, or can you hurt you RS and HBase? If it's only a bottleneck, remember that this is only a temp table. You have control of setting the max file size and pre splitting.  

2) KISS.
The first step is starting to realize that you have a database so why do you not want to take advantage of it? :-)
Your first iteration may not be the most efficient solution, but it should be faster than using a reducer and/or combiner/reducer. Sure, there's no free lunch, but using the HBase tables should be more efficient. I'm not suggesting that this is always going to be faster, or better, but that from the problem sets we have worked with... It made more sense.
( ok, I'm an old database guy... So my opinion is skewed... )

3) Keeping data till the end of the task, may work for some jobs.
In the cleanup() method you could write out the data, provided you have enough memory...
I'm sure there are pros and cons to it... But it's a good design idea to think about.

It's really cool that people are now thinking about this...
Sent from a remote device. Please excuse any typos...

Mike Segel

On Sep 16, 2011, at 8:47 PM, Doug Meil <[EMAIL PROTECTED]> wrote:

>
> Map-task heap size would definitely be a concern, but since the hashmap
> would only contain aggregations, ostensibly this map would be holding a
> far smaller number of the rows that were passed into the mapper.
>
> At least that's how I'd use it.
>
>
>
> On 9/16/11 9:39 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote:
>
>> Aren't there memory considerations with this approach ? I would assume
>> the HashMap can get pretty big , if it retains in memory every record
>> that passes through .. (Apologies, if I am being ignorant with my
>> limited knowledge of hadoop's internal workings ... )
>>
>> On Fri, Sep 16, 2011 at 6:14 PM, Doug Meil
>> <[EMAIL PROTECTED]> wrote:
>>>
>>> However, if the aggregations in the mapper were kept in a HashMap (key
>>> being the aggregate, value being the count), and then the mapper made a
>>> single pass over this map during the cleanup method and then did the
>>> checkAndPuts, it would mean that the writes would only happen once per
>>> map-task, and not do it on a per-row basis (which would be really
>>> expensive).
>>>
>>> A single region on a single RS could handle that no problem.
>>>
>>>
>>>
>>>
>>> On 9/16/11 9:00 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote:
>>>
>>>> I see what you are saying about the temp table being hosted at a
>>>> single regions server  - especially for a limited set of rows that
>>>> just care about the aggregations, but receive a lot of traffic. I
>>>> wonder if this will also be the case, if I was to use the source table
>>>> to maintain these temporary records, and not create a temp table on
>>>> the fly ...
>>>>
>>>> On Fri, Sep 16, 2011 at 5:24 PM, Doug Meil
>>>> <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>> I'll add this to the book in the MR section.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 9/16/11 8:22 PM, "Doug Meil" <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>>>
>>>>>> I was in the middle of responding to Mike's email when yours arrived,
>>>>>> so
>>>>>> I'll respond to both.
>>>>>>
>>>>>> I think the temp-table idea is interesting.  The caution is that a
>>>>>> default
>>>>>> temp-table creation will be hosted on a single RS and thus be a
>>>>>> bottleneck
>>>>>> for aggregation.  So I would imagine that you would need to tune the
>>>>>> temp-table for the job and pre-create regions.
>>>>>>
>>>>>> Doug
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 9/16/11 8:16 PM, "Sam Seigal" <[EMAIL PROTECTED]> wrote:
>>>>>>
>>>>>>> I am trying to do something similar with HBase Map/Reduce.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB