Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> How to best decide mapper output/reducer input for a huge string?


Copy link to this message
-
Re: How to best decide mapper output/reducer input for a huge string?
@Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
are functional at the same time.. Although, @Pradeep, i should do the
compression like you say.. I'll give it a shot.. As far as i can see, i
think i'll need to implement Writable and write out the key of the mapper
using the specific data types instead of writing it out as a string which
might slow the operation down..
On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <[EMAIL PROTECTED]>wrote:

> Pavan,
>
> It's hard to tell whether there's anything wrong with your design or not
> since you haven't given us specific enough details. The best thing you can
> do is instrument your code and see what is taking a long time. Rahul
> mentioned a problem that I myself have seen before, with only one region
> (or a couple) having any data. So even if you have 21 regions, only mapper
> might be doing the heavy lifting.
>
> A combiner is hugely helpful in terms of reducing the data output of
> mappers. Writing a combiner is a best practice and you should almost always
> have one. Compression can be turned on by setting the following properties
> in your job config.
>  <property>
>      <name> mapreduce.map.output.compress </name>
>      <value> true</value>
>  </property>
>  <property>
>      <name>mapreduce.map.output.compress.codec</name>
>      <value>org.apache.hadoop.io.compress.GzipCodec</value>
>  </property>
> You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
> depending on your use cases. Gzip is really slow but gets the best
> compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
> compression ratio. If your computations are CPU bound, then you'd probably
> want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
> are idle, you can use Gzip. You'll have to experiment and find the best
> settings for you. There are a lot of other tweaks that you can try to get
> the best performance out of your cluster.
>
> One of the best things you can do is to install Ganglia (or some other
> similar tool) on your cluster and monitor usage of resources while your job
> is running. This will tell you if your job is I/O bound or CPU bound.
>
> Take a look at this paper by Intel about optimizing your Hadoop cluster
> and see if that fits your deployment.
> http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf
>
> If your cluster is already optimized and your job is not I/O bound, then
> there might be a problem with your algorithm and might warrant a redesign.
>
> Hope this helps!
> - Pradeep
>
>
> On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
> [EMAIL PROTECTED]> wrote:
>
>> One mapper is spawned per hbase table region. You can use the admin ui of
>> hbase to find the number of regions per table. It might happen that all the
>> data is sitting in a single region , so a single mapper is spawned and you
>> are not getting enough parallel work getting done.
>>
>> If that is the case then you can recreate the tables with predefined
>> splits to create more regions.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <[EMAIL PROTECTED]>wrote:
>>
>>>  Pavan,****
>>>
>>> How large are the rows in HBase?  22 million rows is not very much but
>>> you mentioned “huge strings”.  Can you tell which part of the processing is
>>> the limiting factor (read from HBase, mapper output, reducers)?****
>>>
>>> John****
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> *From:* Pavan Sudheendra [mailto:[EMAIL PROTECTED]]
>>> *Sent:* Saturday, September 21, 2013 2:17 AM
>>> *To:* [EMAIL PROTECTED]
>>> *Subject:* Re: How to best decide mapper output/reducer input for a
>>> huge string?****
>>>
>>> ** **
>>>
>>> No, I don't have a combiner in place. Is it necessary? How do I make my
>>> map output compressed? Yes, the Tables in HBase are compressed.****
>>>
>>> Although, there's no real bottleneck, the time it takes to process the
>>> entire table is huge. I have to constantly check if i can optimize it
Regards-
Pavan
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB