|
|
-
Re: Slow MR time and high network utilization with all local dataHarsh J 2013-02-25, 07:22
Hi Robert,
How are you measuring the network usage? Note that unless short circuit reading is on, data reads are done over a local socket as well, and may appear in network traffic observing tools too (but do not mean they are over the network). On Mon, Feb 25, 2013 at 2:35 AM, Robert Dyer <[EMAIL PROTECTED]> wrote: > I have a small 6 node dev cluster. I use a 1GB SequenceFile as input to a > MapReduce job, using a custom split size of 10MB (to increase the number of > maps). Each map call will read random entries out of a shared MapFile (that > is around 50GB). > > I set replication to 6 on both of these files, so all of the data should be > local for each map task. I verified via fsck that no blocks are > under-replicated. > > Despite this, for some reason the MR job maxes out the network and takes an > extremely long time. What could be causing this? > > Note that the total number of map outputs for this job is around 400 and the > reducer just passes the values through, so there shouldn't be much network > utilized by the output. > > As an experiment, I switched from the SeqFile input to an HBase table and > now see almost no network used. I also tried leaving the SeqFile as input > and switched the MapFile to an HBase table and see about 30% network used > (which makes sense, as now that 50GB data isn't always local). > > What is going on here? How can I debug to see what data is being > transferred over the network? -- Harsh J |