Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Linear slowdown producing streaming output

Copy link to this message
Re: Linear slowdown producing streaming output
Keith Wiley 2011-03-10, 09:00
I'm still trying to solve this problem.  One person mentioned that mappers have to sort the data and that the sort buffer may be relevant but I'm seeing the same linear slowdown from the reducer, and more importantly, my data sizes are so small (a few MBs) that if the Hadoop settings out-of-the-box didn't work at these scales, Hadoop wouldn't work at all.  I'm only seeing this linear slowdown with streaming jobs btw.

The bottom line is -- as quoted below from my earlier post on this issue -- that the time required to write a block of a given size (say 1 MB, it doesn't matter for the purpose of this discussion) goes up linearly with each additional block written.  Even at scales of a few 10s of MBs total data, this results in write-times of several minutes and given the superlinearity, only a few more 10s of MBs quickly stretches to half an hour!

Does no one have any idea what could possibly be causing this?  As far as I can tell, anyone running a streaming job ought to see this behavior because my program doesn't do anything weird w.r.t. writing output data, it simply dumps it straight onto stdout.  Consequently, it seems to me that all streaming jobs should produce this behavior...and therefore that streaming should be virtually useless.  I'm sure I'm doing something silly, but my approach is so simple I just don't know what I could be doing wrong.

Here is the basic outline of my output routine.  It just shows that I'm not doing anything "funny", I'm just dumping bytes onto stdout:

double accumulatedWriteBlockTime = 0;
int lastPosToWrite = bufferLength - 1; //Inclusive
int pos = 0;
while (pos <= lastPosToWrite)
long timeWriteOneBlockStart = getTimestamp_100s();

int writeBlockSize = kWriteBlockSize; //1048576 B (1 MB)
if (pos + writeBlockSize > bufferLength)
writeBlockSize = bufferLength - pos;

cout.write(buffer + pos, writeBlockSize);
pos += writeBlockSize;

long timeWriteOneBlockEnd = getTimestamp_100s();
double diffTimeWriteOneBlock = diffTimes_100s(timeWriteOneBlockStart, timeWriteOneBlockEnd);
accumulatedWriteBlockTime += diffTimeWriteOneBlock;

OUT "diffTimeWriteOneBlock: "
<< setw(8) << secsToTimeStr(diffTimeWriteOneBlock)
<< "        cumulativeWriteBlockTime: "
<< setw(8) << secsToTimeStr(accumulatedWriteBlockTime)
<< "        Progress: " << setw(8) << setprecision(2)
<< (((double)pos / (double)bufferLength) * 100.0) << "%"
<< endl;

I only broke it up into an iteration over blocks after I discovered this problem.  My original code was much simpler.  It just called write() once with the entire buffer.  I broke it up like this to investigate the problem and that was when I discovered that it took longer to write the same amount of data as time went on (actually I suspected this after observing the superlinear relationship between the datasize and the write time).  Here's a sample output from the code with the middle section removed for brevity:

                diffTimeWriteOneBlock:    0.35s        cumulativeWriteBlockTime:     0.35s        Progress:     3.99%
                diffTimeWriteOneBlock:    0.85s        cumulativeWriteBlockTime:      1.2s        Progress:     7.98%
                diffTimeWriteOneBlock:    1.28s        cumulativeWriteBlockTime:     2.48s        Progress:    11.97%
                diffTimeWriteOneBlock:    1.83s        cumulativeWriteBlockTime:     4.31s        Progress:    15.96%
                diffTimeWriteOneBlock:    2.34s        cumulativeWriteBlockTime:     6.65s        Progress:    19.95%
                diffTimeWriteOneBlock:    2.99s        cumulativeWriteBlockTime:     9.64s        Progress:    23.95%
[middle section removed for brevity]
                diffTimeWriteOneBlock:   10.25s        cumulativeWriteBlockTime: 1m 45.17s        Progress:    79.82%
                diffTimeWriteOneBlock:   10.72s        cumulativeWriteBlockTime: 1m 55.89s        Progress:    83.81%
                diffTimeWriteOneBlock:   11.32s        cumulativeWriteBlockTime:  2m 7.21s        Progress:    87.80%
                diffTimeWriteOneBlock:   12.25s        cumulativeWriteBlockTime: 2m 19.46s        Progress:    91.79%
                diffTimeWriteOneBlock:   12.48s        cumulativeWriteBlockTime: 2m 31.94s        Progress:    95.78%
                diffTimeWriteOneBlock:   13.06s        cumulativeWriteBlockTime:    2m 45s        Progress:    99.77%
Thanks for any help.

Keith Wiley     [EMAIL PROTECTED]     keithwiley.com    music.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
                                           --  Abe (Grandpa) Simpson