The first thing I would check is that your mappers are processing the
same amount of data. I'm not familiar with the Cassandra InputFormat,
but if it doesn't properly split the data, then you could end up with
this behavior. If the data is split properly, I'd look into swapping
as a possible cause.
Is it always the same nodes that are slow?
On Thu, Nov 3, 2011 at 10:43 AM, Brendan W. <[EMAIL PROTECTED]> wrote:
> The input is actually performed by the apache-cassandra 0.6.9 api for
> map-reduce. And yes, the cassandra row that is read into the mapper
> consists of a block of 100 compressed lines of text. So maybe that
> accounts for the progress report.
> Any idea what the huge time difference might be due to (2 minutes average
> vs. 20 hrs for the last 3 tasks)? Does that sound like swapping to you?
> On Thu, Nov 3, 2011 at 9:44 AM, Joey Echeverria <[EMAIL PROTECTED]> wrote:
>> Is you input data compressed? There have been some bugs in the past
>> with reporting progress when reading compressed data.
>> On Thu, Nov 3, 2011 at 9:18 AM, Brendan W. <[EMAIL PROTECTED]> wrote:
>> > Hi,
>> > Running 0.20.2:
>> > A job with about 4000 map tasks quickly blew through all but 3 in a
>> > of hours, with the tasks taking about two minutes each. The remaining
>> > three, however, inched along, with their progress passing 100% and
>> > on going. After 20 hours or so, I killed the running task attempts.
>> > restarted, and same thing: they inched their way past 100%, getting up
>> > past 400% and continuing. They finally finished in the middle of last
>> > night.
>> > What does progress > 100% indicate?
>> > Thanks for any help.
>> Joseph Echeverria
>> Cloudera, Inc.