Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Limit number of records or total size in combiner input using jobconf?


Copy link to this message
-
Re: Limit number of records or total size in combiner input using jobconf?
Thank you.
On Fri, Feb 20, 2009 at 5:34 PM, Chris Douglas <[EMAIL PROTECTED]> wrote:
>> So here are my questions:
>> (1) is there a  jobconf hint to limit the number of records in kviter?
>> I can (and have) made a fix to my code that processes the values in a
>> combiner step in batches (i.e takes N at a go,processes that and
>> repeat), but was wondering if i could just set an option.
>
> Approximately and indirectly, yes. You can limit the amount of memory
> allocated to storing serialized records in memory (io.sort.mb) and the
> percentage of that space reserved for storing record metadata
> (io.sort.record.percent, IIRC). That can be used to limit the number of
> records in each spill, though you may also need to disable the combiner
> during the merge, where you may run into the same problem.
>
> You're almost certainly better off designing your combiner to scale well (as
> you have), since you'll hit this in the reduce, too.
>
>> Since this occurred in the MapContext, changing the number of reducers
>> wont help.
>> (2) How does changing the number of reducers help at all? I have 7
>> machines, so I feel 11 (a prime close to 7, why a prime?) is good
>> enough (some machines are 16GB others 32GB)
>
> Your combiner will look at all the records for a partition and only those
> records in a partition. If your partitioner distributes your records evenly
> in a particular spill, then increasing the total number of partitions will
> decrease the number of records your combiner considers in each call. For
> most partitioners, whether the number of reducers is prime should be
> irrelevant. -C
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB