Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Cluster Tuning


Copy link to this message
-
Re: Cluster Tuning
Set mapred.reduce.slowstart.completed.maps to a number close to 1.0.
1.0 means the maps have to completely finish before the reduce starts
copying any data. I often run jobs with this set to .90-.95.

-Joey

On Fri, Jul 8, 2011 at 11:25 AM, Juan P. <[EMAIL PROTECTED]> wrote:
> Here's another thought. I realized that the reduce operation in my
> map/reduce jobs is a flash. But it goes reaaaaaaaaally slow until the
> mappers end. Is there a way to configure the cluster to make the reduce wait
> for the map operations to complete? Specially considering my hardware
> restraints
>
> Thanks!
> Pony
>
> On Fri, Jul 8, 2011 at 11:41 AM, Juan P. <[EMAIL PROTECTED]> wrote:
>
>> Hey guys,
>> Thanks all of you for your help.
>>
>> Joey,
>> I tweaked my MapReduce to serialize/deserialize only escencial values and
>> added a combiner and that helped a lot. Previously I had a domain object
>> which was being passed between Mapper and Reducer when I only needed a
>> single value.
>>
>> Esteban,
>> I think you underestimate the constraints of my cluster. Adding multiple
>> jobs per JVM really kills me in terms of memory. Not to mention that by
>> having a single core there's not much to gain in terms of paralelism (other
>> than perhaps while a process is waiting of an I/O operation). Still I gave
>> it a shot, but even though I kept changing the config I always ended with a
>> Java heap space error.
>>
>> Is it me or performance tuning is mostly a per job task? I mean it will, in
>> the end, depend on the the data you are processing (structure, size, weather
>> it's in one file or many, etc). If my jobs have different sets of data,
>> which are in different formats and organized in different  file structures,
>> Do you guys recommend moving some of the configuration to Java code?
>>
>> Thanks!
>> Pony
>>
>> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex <[EMAIL PROTECTED]> wrote:
>>
>>> Eres el Esteban que conozco?
>>>
>>>
>>>
>>> El 07/07/2011, a las 15:53, Esteban Gutierrez <[EMAIL PROTECTED]>
>>> escribió:
>>>
>>> > Hi Pony,
>>> >
>>> > There is a good chance that your boxes are doing some heavy swapping and
>>> > that is a killer for Hadoop.  Have you tried
>>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
>>> > heap on that boxes?
>>> >
>>> > Cheers,
>>> > Esteban.
>>> >
>>> > --
>>> > Get Hadoop!  http://www.cloudera.com/downloads/
>>> >
>>> >
>>> >
>>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <[EMAIL PROTECTED]> wrote:
>>> >
>>> >> Hi guys!
>>> >>
>>> >> I'd like some help fine tuning my cluster. I currently have 20 boxes
>>> >> exactly
>>> >> alike. Single core machines with 600MB of RAM. No chance of upgrading
>>> the
>>> >> hardware.
>>> >>
>>> >> My cluster is made out of 1 NameNode/JobTracker box and 19
>>> >> DataNode/TaskTracker boxes.
>>> >>
>>> >> All my config is default except i've set the following in my
>>> >> mapred-site.xml
>>> >> in an effort to try and prevent choking my boxes.
>>> >> *<property>*
>>> >> *      <name>mapred.tasktracker.map.tasks.maximum</name>*
>>> >> *      <value>1</value>*
>>> >> *  </property>*
>>> >>
>>> >> I'm running a MapReduce job which reads a Proxy Server log file (2GB),
>>> maps
>>> >> hosts to each record and then in the reduce task it accumulates the
>>> amount
>>> >> of bytes received from each host.
>>> >>
>>> >> Currently it's producing about 65000 keys
>>> >>
>>> >> The hole job takes forever to complete, specially the reduce part. I've
>>> >> tried different tuning configs by I can't bring it down under 20mins.
>>> >>
>>> >> Any ideas?
>>> >>
>>> >> Thanks for your help!
>>> >> Pony
>>> >>
>>>
>>
>>
>

--
Joseph Echeverria
Cloudera, Inc.
443.305.9434
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB