Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - distributing hdfs put


Copy link to this message
-
Re: distributing hdfs put
Ashutosh Chauhan 2010-01-30, 04:23
You can set it through Pig as well as you have mentioned. Advantage is
that instead of setting permanently to high value through
hadoop-site.xml (which will then affect all subsequent hadoop jobs of
your cluster)  through Pig you can set it on per job basis.

Ashutosh

On Wed, Jan 27, 2010 at 21:55, prasenjit mukherjee
<[EMAIL PROTECTED]> wrote:
> Not sure I understand. Are you saying that pig takes -D<> parameters
> directly. Will the following  work :
>
> "pig -Dmapred.task.timeout=0 -f myfile.pig"
>
>
> On Thu, Jan 28, 2010 at 11:08 AM, Amogh Vasekar <[EMAIL PROTECTED]> wrote:
>
>> Hi,
>> You should be able to pass this as a cmd line argument using -D ... If you
>> want to change it for all jobs on your own cluster, it would be in
>> mapred-site.
>>
>> Amogh
>>
>>
>> On 1/28/10 11:03 AM, "prasenjit mukherjee" <[EMAIL PROTECTED]>
>> wrote:
>>
>> Thanks Amogh for your quick response. Changing this property only on
>> master's hadoop-site.xml will do or I need to do it on all the slaves as
>> well ?
>>
>> Any way I can do this from PIG ( or I guess I am asking too much here :) )
>>
>> On Thu, Jan 28, 2010 at 10:57 AM, Amogh Vasekar <[EMAIL PROTECTED]>
>> wrote:
>>
>> > Yes, parameter is mapred.task.timeout in mS.
>> > You can also update status / output to stdout after some time chunks to
>> > avoid this :)
>> >
>> > Amogh
>> >
>> >
>> > On 1/28/10 10:52 AM, "prasenjit mukherjee" <
>> [EMAIL PROTECTED]>
>> > wrote:
>> >
>> > Now I see. The tasks are failing with the following error message :
>> >
>> > *Task attempt_201001272359_0001_r_000000_0 failed to report status for
>> 600
>> > seconds. Killing!*
>> >
>> > Looks like hadoop kills/restarts  jobs which takes more than 600 seconds.
>> > Is
>> > there any way I can increase it to some very high number  ?
>> >
>> > -Thanks,
>> > Prasenjit
>> >
>> >
>> >
>> > On Tue, Jan 26, 2010 at 9:55 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]>
>> > wrote:
>> > >
>> > > Do you know why the jobs are failing? Take a look at the logs. I
>> > > suspect it may be due to s3, not hadoop.
>> > >
>> > > -D
>> > >
>> > > On Tue, Jan 26, 2010 at 7:57 AM, prasenjit mukherjee
>> > > <[EMAIL PROTECTED]> wrote:
>> > > > Hi Mridul,
>> > > >    Thanks your approach  works fine. This is how my current pig
>> script
>> > > > looks like :
>> > > >
>> > > > define CMD `s3fetch.py` SHIP('/root/s3fetch.py');
>> > > > r1 = LOAD '/ip/s3fetch_input_files' AS (filename:chararray);
>> > > > grp_r1 = GROUP r1 BY filename PARALLEL 5;
>> > > > r2 = FOREACH grp_r1 GENERATE FLATTEN(r1);
>> > > > r3 = STREAM r2 through CMD;
>> > > > store r3 INTO '/op/s3fetch_debug_log';
>> > > >
>> > > > And here is my s3fetch.py :
>> > > > for word in sys.stdin:
>> > > >  word=word.rstrip()
>> > > >  str='/usr/local/hadoop-0.20.0/bin/hadoop fs -cp
>> > > > s3n://<s3-credentials>@bucket/dir-name/'+word+' /ip/data/.';
>> > > >  sys.stdout.write('\n\n'+word+ ':\t'+str+'\n')
>> > > >  (input_str,out_err) = os.popen4(str);
>> > > >  for line in out_err.readlines():
>> > > >    sys.stdout.write('\t'+word+'::\t'+line)
>> > > >
>> > > >
>> > > >
>> > > > So, the job starts fine and I see that my hadoop directory (
>> > /ip/data/.)
>> > > > starts filling up with s3 files. But after sometime it gets stuck. I
>> > see
>> > > > lots of failed/restarted jobs  in the jobtracker. And the number of
>> > files
>> > > > dont increase in /ip/data.
>> > > >
>> > > > Could this be happening because of parallel hdfs writes ( via hadoop
>> fs
>> > -cp
>> > > > <> <> ) making primary-name-node a blocking server ?
>> > > >
>> > > > Any help is greatly appreciated.
>> > > >
>> > > > -Thanks,
>> > > > Prasen
>> > > >
>> > > > On Mon, Jan 25, 2010 at 8:58 AM, Mridul Muralidharan
>> > > > <[EMAIL PROTECTED]>wrote:
>> > > >
>> > > >>
>> > > >> If each line from your file has to be processed by a different
>> mapper
>> > -
>> > > >> other than by writing a custom slicer, a very dirty hack would be to
>> :
>> > > >