I'm still trying to solve this problem. One person mentioned that mappers have to sort the data and that the sort buffer may be relevant but I'm seeing the same linear slowdown from the reducer, and more importantly, my data sizes are so small (a few MBs) that if the Hadoop settings out-of-the-box didn't work at these scales, Hadoop wouldn't work at all. I'm only seeing this linear slowdown with streaming jobs btw.
The bottom line is -- as quoted below from my earlier post on this issue -- that the time required to write a block of a given size (say 1 MB, it doesn't matter for the purpose of this discussion) goes up linearly with each additional block written. Even at scales of a few 10s of MBs total data, this results in write-times of several minutes and given the superlinearity, only a few more 10s of MBs quickly stretches to half an hour!
Does no one have any idea what could possibly be causing this? As far as I can tell, anyone running a streaming job ought to see this behavior because my program doesn't do anything weird w.r.t. writing output data, it simply dumps it straight onto stdout. Consequently, it seems to me that all streaming jobs should produce this behavior...and therefore that streaming should be virtually useless. I'm sure I'm doing something silly, but my approach is so simple I just don't know what I could be doing wrong.
Here is the basic outline of my output routine. It just shows that I'm not doing anything "funny", I'm just dumping bytes onto stdout:
double accumulatedWriteBlockTime = 0;
int lastPosToWrite = bufferLength - 1; //Inclusive
int pos = 0;
while (pos <= lastPosToWrite)
long timeWriteOneBlockStart = getTimestamp_100s();
int writeBlockSize = kWriteBlockSize; //1048576 B (1 MB)
if (pos + writeBlockSize > bufferLength)
writeBlockSize = bufferLength - pos;
cout.write(buffer + pos, writeBlockSize);
pos += writeBlockSize;
long timeWriteOneBlockEnd = getTimestamp_100s();
double diffTimeWriteOneBlock = diffTimes_100s(timeWriteOneBlockStart, timeWriteOneBlockEnd);
accumulatedWriteBlockTime += diffTimeWriteOneBlock;
OUT "diffTimeWriteOneBlock: "
<< setw(8) << secsToTimeStr(diffTimeWriteOneBlock)
<< " cumulativeWriteBlockTime: "
<< setw(8) << secsToTimeStr(accumulatedWriteBlockTime)
<< " Progress: " << setw(8) << setprecision(2)
<< (((double)pos / (double)bufferLength) * 100.0) << "%"
I only broke it up into an iteration over blocks after I discovered this problem. My original code was much simpler. It just called write() once with the entire buffer. I broke it up like this to investigate the problem and that was when I discovered that it took longer to write the same amount of data as time went on (actually I suspected this after observing the superlinear relationship between the datasize and the write time). Here's a sample output from the code with the middle section removed for brevity:
diffTimeWriteOneBlock: 0.35s cumulativeWriteBlockTime: 0.35s Progress: 3.99%
diffTimeWriteOneBlock: 0.85s cumulativeWriteBlockTime: 1.2s Progress: 7.98%
diffTimeWriteOneBlock: 1.28s cumulativeWriteBlockTime: 2.48s Progress: 11.97%
diffTimeWriteOneBlock: 1.83s cumulativeWriteBlockTime: 4.31s Progress: 15.96%
diffTimeWriteOneBlock: 2.34s cumulativeWriteBlockTime: 6.65s Progress: 19.95%
diffTimeWriteOneBlock: 2.99s cumulativeWriteBlockTime: 9.64s Progress: 23.95%
[middle section removed for brevity]
diffTimeWriteOneBlock: 10.25s cumulativeWriteBlockTime: 1m 45.17s Progress: 79.82%
diffTimeWriteOneBlock: 10.72s cumulativeWriteBlockTime: 1m 55.89s Progress: 83.81%
diffTimeWriteOneBlock: 11.32s cumulativeWriteBlockTime: 2m 7.21s Progress: 87.80%
diffTimeWriteOneBlock: 12.25s cumulativeWriteBlockTime: 2m 19.46s Progress: 91.79%
diffTimeWriteOneBlock: 12.48s cumulativeWriteBlockTime: 2m 31.94s Progress: 95.78%
diffTimeWriteOneBlock: 13.06s cumulativeWriteBlockTime: 2m 45s Progress: 99.77%
Thanks for any help.
Keith Wiley [EMAIL PROTECTED] keithwiley.com music.keithwiley.com
"I used to be with it, but then they changed what it was. Now, what I'm with
isn't it, and what's it seems weird and scary to me."
-- Abe (Grandpa) Simpson