Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Problems with timeout when a Hadoop job generates a large number of key-value pairs


Copy link to this message
-
Re: Problems with timeout when a Hadoop job generates a large number of key-value pairs
Michael Segel 2012-01-20, 23:43
Steve,
Ok, first your client connection to the cluster is a non issue.

If you go in to /etc/Hadoop/conf
That supposed to be a little h but my iPhone knows what's best...

Look and see what you have set for your bandwidth... I forget which parameter but there are only a couple that deal with bandwidth.
I think it's set to 1mb or 10mb by default. You need to up it to 100-200mb if you're on a 1 GB network .

That would solve you balancing issue.

See if that helps...

Sent from my iPhone

On Jan 20, 2012, at 4:57 PM, "Steve Lewis" <[EMAIL PROTECTED]> wrote:

> On Fri, Jan 20, 2012 at 12:18 PM, Michel Segel <[EMAIL PROTECTED]>wrote:
>
>> Steve,
>> If you want me to debug your code, I'll be glad to set up a billable
>> contract... ;-)
>>
>> What I am willing to do is to help you to debug your code..
>
>
> The code seems to work well for small input files and is basically a
> standard sample.
>
>> .
>>
>> Did you time how long it takes in the Mapper.map() method?
>> The reason I asked this is to first confirm that you are failing within a
>> map() method.
>> It could be that you're just not updating your status...
>>
>
> The map map method starts out running very fast - generateSubstrings - the
> only interesting part runs in milliseconds. The only other thing the mapper
> does is context,write which SHOULD update status
>
>>
>> You said that you are writing many output records for a single input.
>>
>> So let's take a look at your code.
>> Are all writes of the same length? Meaning that in each iteration of
>> Mapper.map() you will always write. K number of rows?
>>
>
> Because in my sample the input strings are the same length - every call to
> the mapper will write the same number of records
>
>>
>> If so, ask yourself why some iterations are taking longer and longer?
>>
>
> I believe the issue may relate to local storage getting filled and Hadoop
> taking a LOT of time to rebalance the output, Assuming the string length is
> the same on each map there is no reason for some iterations to me longer
> than others
>
>>
>> Note: I'm assuming that the time for each iteration is taking longer than
>> the previous...
>>
>> I assume so as well since in m,y cluster the first 50% of mapping goes
> pretty fast
>
>> Or am I missing something?
>>
>> How do I get timing of map iteratons??
>
>> -Mike
>>
>> Sent from a remote device. Please excuse any typos...
>>
>> Mike Segel
>>
>> On Jan 20, 2012, at 11:16 AM, Steve Lewis <[EMAIL PROTECTED]> wrote:
>>
>>> We have been having problems with mappers timing out after 600 sec when
>> the
>>> mapper writes many more, say thousands of records for every
>>> input record - even when the code in the mapper is small and fast. I
>>> no idea what could cause the system to be so slow and am reluctant to
>> raise
>>> the 600 sec limit without understanding why there should be a timeout
>> when
>>> all MY code is very fast.
>>> P
>>> I am enclosing a small sample which illustrates the problem. It will
>>> generate a 4GB text file on hdfs if the input file does not exist or is
>> not
>>> at least that size and this will take some time (hours in my
>> configuration)
>>> - then the code is essentially wordcount but instead of finding and
>>> emitting words - the mapper emits all substrings of the input data - this
>>> generates a much larger output data and number of output records than
>>> wordcount generates.
>>> Still, the amount of data emitted is no larger than other data sets I
>> know
>>> Hadoop can handle.
>>>
>>> All mappers on my 8 node cluster eventually timeout after 600 sec - even
>>> though I see nothing in the code which is even a little slow and suspect
>>> that any slow behavior is in the  called Hadoop code. This is similar to
>> a
>>> problem we have in bioinformatics where a  colleague saw timeouts on his
>> 50
>>> node cluster.
>>>
>>> I would appreciate any help from the group. Note - if you have a text
>> file
>>> at least 4 GB the program will take that as an imput without trying to