Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Counting records

Copy link to this message
Re: Counting records
If the task fails the counter for that task is not used.

So if you have speculative execution turned on and the JT kills a task, it won't affect your end results.  

Again the only major caveat is that the counters are in memory so if you have a lot of counters...

On Jul 23, 2012, at 4:52 PM, Peter Marron wrote:

> Yeah, I thought about using counters but I was worried about
> what happens if a Mapper task fails. Does the counter get adjusted to
> remove any contributions that the failed Mapper made before
> another replacement Mapper is started? Otherwise in the case of any
> Mapper failure I'm going to get an overcount am I not?
> Or is there some way to make sure that counters have
> the correct semantics in the face of failures?
> Peter Marron
>> -----Original Message-----
>> From: Dave Shine
>> [mailto:Dave.Shine@channelintelligence.
>> com]
>> Sent: 23 July 2012 15:35
>> Subject: RE: Counting records
>> You could just use a counter and never
>> emit anything from the Map().  Use the
>> getCounter("MyRecords",
>> "RecordTypeToCount").increment(1)
>> whenever you find the type of record you
>> are looking for.  Never call
>> output.collect().  Call the job with
>> reduceTasks(0).  When the job finishes,
>> you can programmatically get the values
>> of all counters including the one you
>> create in the Map() method.
>> Dave Shine
>> Sr. Software Engineer
>> 321.939.5093 direct |  407.314.0122
>> mobile CI Boost(tm) Clients  Outperform
>> Online(tm)  www.ciboost.com
>> -----Original Message-----
>> From: Peter Marron
>> [mailto:Peter.Marron@trilliumsoftware.
>> com]
>> Sent: Monday, July 23, 2012 10:25 AM
>> Subject: Counting records
>> Hi,
>> I am a complete noob with Hadoop and
>> MapReduce and I have a question that is
>> probably silly, but I still don't know the
>> answer.
>> For the purposes of discussion I'll assume
>> that I'm using a standard
>> TextInputFormat.
>> (I don't think that this changes things too
>> much.)
>> To simplify (a fair bit) I want to count all
>> the records that meet specific criteria.
>> I would like to use MapReduce because I
>> anticipate large sources and I want to
>> get the performance and reliability that
>> MapReduce offers.
>> So the obvious and simple approach is to
>> have my Mapper check whether each
>> record meets the criteria and emit a 0 or
>> a 1. Then I could use a combiner which
>> accumulates (like a LongSumReducer)
>> and use this as a reducer as well, and I
>> am sure that that would work fine.
>> However it seems massive overkill to
>> have all those "1"s and "0"s emitted and
>> stored on disc.
>> It seems tempting to have the Mapper
>> accumulate the count for all of the
>> records that it sees and then just emit
>> once at the end the total value. This
>> seems simple enough, except that the
>> Mapper doesn't seem to have any easy
>> way to know when it is presented with
>> the last record.
>> Now I could just make the Mapper take a
>> copy of the OutputCollector for each
>> record called and then in the close
>> method it could do a single emit.
>> However, although, this looks like it
>> would work with the current
>> implementation, there seem to be no
>> guarantees that the collector is valid at
>> the time that the close is called. This just
>> seems ugly.
>> Or I could get the Mapper to record the
>> first offset that it sees and read the split
>> length using
>> report.getInputSplit().getLength() and
>> then it could monitor how far it is
>> through the split and it should be able to
>> detect the last record. It looks like the
>> MapRunner class creates a Mapper
>> object and uses it to process a split, and
>> so it looks like it's safe to store state in
>> the mapper class between invocations of
>> the map method. (But is this just an
>> implementation artefact? Is the mapper
>> class supposed to be completely
>> stateless?)