|
Yang Ling
2011-12-24, 05:01
Thejas Nair
2011-12-29, 02:43
Dmitriy Ryaboy
2011-12-29, 19:52
Yang Ling
2012-01-01, 02:05
Dmitriy Ryaboy
2012-01-02, 01:17
Dmitriy Ryaboy
2012-01-02, 02:02
Dmitriy Ryaboy
2012-01-02, 03:39
Dmitriy Ryaboy
2012-01-02, 04:14
Yang Ling
2012-01-02, 05:55
|
-
trunk is 3 times slower then 0.9?Yang Ling 2011-12-24, 05:01
I have a Pig job typically finish in 20 minutes. I tried Pig code from trunk, it takes more than 1 hours to finish. My input and output are on Amazon s3. One interesting thing is it takes about 40 minutes to start the mapreduce job, but for 0.9.1 release, it takes only less than 1 minute. Any idea?
-
Re: trunk is 3 times slower then 0.9?Thejas Nair 2011-12-29, 02:43
I haven't seen/heard this issue.
Do you mean to say that the extra time is actually a delay before MR job is launched ? Did you have free map/reduce slots when you ran pig job from trunk ? Thanks, Thejas On 12/23/11 9:01 PM, Yang Ling wrote: > I have a Pig job typically finish in 20 minutes. I tried Pig code from trunk, it takes more than 1 hours to finish. My input and output are on Amazon s3. One interesting thing is it takes about 40 minutes to start the mapreduce job, but for 0.9.1 release, it takes only less than 1 minute. Any idea?
-
Re: trunk is 3 times slower then 0.9?Dmitriy Ryaboy 2011-12-29, 19:52
In the past, when I've observed this kind of insane behavior (no job should
take 40 minutes to submit), it's been due the NameNode or the JobTracker being extremely overloaded, responding slowly, causing timeouts+retries. 2011/12/28 Thejas Nair <[EMAIL PROTECTED]> > I haven't seen/heard this issue. > Do you mean to say that the extra time is actually a delay before MR job > is launched ? > Did you have free map/reduce slots when you ran pig job from trunk ? > > Thanks, > Thejas > > > > > On 12/23/11 9:01 PM, Yang Ling wrote: > >> I have a Pig job typically finish in 20 minutes. I tried Pig code from >> trunk, it takes more than 1 hours to finish. My input and output are on >> Amazon s3. One interesting thing is it takes about 40 minutes to start the >> mapreduce job, but for 0.9.1 release, it takes only less than 1 minute. Any >> idea? >> > >
-
Re:Re: trunk is 3 times slower then 0.9?Yang Ling 2012-01-01, 02:05
Thanks for reply. I spent yesterday and find out my 40 minutes is spent on JsonMetadta.findMetaFile. It seems this is new for trunk. In my setting, I have several thousand file/folders in my input, findMetaFile read it one by one and it takes a long time. I also see there is an option in PigStorage I can disable it using "-noschema". Once I use "noschema", I get my 40 minutes back. Can we do something so others do not get into this pitfall?
At 2011-12-30 03:52:34,"Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote: >In the past, when I've observed this kind of insane behavior (no job should >take 40 minutes to submit), it's been due the NameNode or the JobTracker >being extremely overloaded, responding slowly, causing timeouts+retries. > >2011/12/28 Thejas Nair <[EMAIL PROTECTED]> > >> I haven't seen/heard this issue. >> Do you mean to say that the extra time is actually a delay before MR job >> is launched ? >> Did you have free map/reduce slots when you ran pig job from trunk ? >> >> Thanks, >> Thejas >> >> >> >> >> On 12/23/11 9:01 PM, Yang Ling wrote: >> >>> I have a Pig job typically finish in 20 minutes. I tried Pig code from >>> trunk, it takes more than 1 hours to finish. My input and output are on >>> Amazon s3. One interesting thing is it takes about 40 minutes to start the >>> mapreduce job, but for 0.9.1 release, it takes only less than 1 minute. Any >>> idea? >>> >> >>
-
Re: Re: trunk is 3 times slower then 0.9?Dmitriy Ryaboy 2012-01-02, 01:17
Ah. That's unfortunate. Yeah reading thousands of files small is suboptimal
(it's always suboptimal, but in this case, it's extra bad). Pig committers -- currently JsonMetadata.fiindMetaFile looks for a metadata file for each file.. what do you think about making it look at directories, instead? Yang -- what's the ratio between # of directories and # of files in your case? D On Sat, Dec 31, 2011 at 6:05 PM, Yang Ling <[EMAIL PROTECTED]> wrote: > Thanks for reply. I spent yesterday and find out my 40 minutes is spent on > JsonMetadta.findMetaFile. It seems this is new for trunk. In my setting, I > have several thousand file/folders in my input, findMetaFile read it one by > one and it takes a long time. I also see there is an option in PigStorage I > can disable it using "-noschema". Once I use "noschema", I get my 40 > minutes back. Can we do something so others do not get into this pitfall? > At 2011-12-30 03:52:34,"Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote: > >In the past, when I've observed this kind of insane behavior (no job > should > >take 40 minutes to submit), it's been due the NameNode or the JobTracker > >being extremely overloaded, responding slowly, causing timeouts+retries. > > > >2011/12/28 Thejas Nair <[EMAIL PROTECTED]> > > > >> I haven't seen/heard this issue. > >> Do you mean to say that the extra time is actually a delay before MR job > >> is launched ? > >> Did you have free map/reduce slots when you ran pig job from trunk ? > >> > >> Thanks, > >> Thejas > >> > >> > >> > >> > >> On 12/23/11 9:01 PM, Yang Ling wrote: > >> > >>> I have a Pig job typically finish in 20 minutes. I tried Pig code from > >>> trunk, it takes more than 1 hours to finish. My input and output are on > >>> Amazon s3. One interesting thing is it takes about 40 minutes to start > the > >>> mapreduce job, but for 0.9.1 release, it takes only less than 1 > minute. Any > >>> idea? > >>> > >> > >> > >
-
Re: Re: trunk is 3 times slower then 0.9?Dmitriy Ryaboy 2012-01-02, 02:02
Filed https://issues.apache.org/jira/browse/PIG-2453
On Sun, Jan 1, 2012 at 5:17 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Ah. That's unfortunate. Yeah reading thousands of files small is > suboptimal (it's always suboptimal, but in this case, it's extra bad). > > Pig committers -- currently JsonMetadata.fiindMetaFile looks for a > metadata file for each file.. what do you think about making it look at > directories, instead? > > Yang -- what's the ratio between # of directories and # of files in your > case? > > D > > > On Sat, Dec 31, 2011 at 6:05 PM, Yang Ling <[EMAIL PROTECTED]> wrote: > >> Thanks for reply. I spent yesterday and find out my 40 minutes is spent >> on JsonMetadta.findMetaFile. It seems this is new for trunk. In my >> setting, I have several thousand file/folders in my input, findMetaFile >> read it one by one and it takes a long time. I also see there is an option >> in PigStorage I can disable it using "-noschema". Once I use "noschema", I >> get my 40 minutes back. Can we do something so others do not get into this >> pitfall? >> At 2011-12-30 03:52:34,"Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote: >> >In the past, when I've observed this kind of insane behavior (no job >> should >> >take 40 minutes to submit), it's been due the NameNode or the JobTracker >> >being extremely overloaded, responding slowly, causing timeouts+retries. >> > >> >2011/12/28 Thejas Nair <[EMAIL PROTECTED]> >> > >> >> I haven't seen/heard this issue. >> >> Do you mean to say that the extra time is actually a delay before MR >> job >> >> is launched ? >> >> Did you have free map/reduce slots when you ran pig job from trunk ? >> >> >> >> Thanks, >> >> Thejas >> >> >> >> >> >> >> >> >> >> On 12/23/11 9:01 PM, Yang Ling wrote: >> >> >> >>> I have a Pig job typically finish in 20 minutes. I tried Pig code from >> >>> trunk, it takes more than 1 hours to finish. My input and output are >> on >> >>> Amazon s3. One interesting thing is it takes about 40 minutes to >> start the >> >>> mapreduce job, but for 0.9.1 release, it takes only less than 1 >> minute. Any >> >>> idea? >> >>> >> >> >> >> >> >> >
-
Re: Re: trunk is 3 times slower then 0.9?Dmitriy Ryaboy 2012-01-02, 03:39
Yang, can you send the load statement you are using and a rought
description of the directory structure you are loading? That'll help test the fix. Thanks, D On Sun, Jan 1, 2012 at 6:02 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Filed https://issues.apache.org/jira/browse/PIG-2453 > > > On Sun, Jan 1, 2012 at 5:17 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > >> Ah. That's unfortunate. Yeah reading thousands of files small is >> suboptimal (it's always suboptimal, but in this case, it's extra bad). >> >> Pig committers -- currently JsonMetadata.fiindMetaFile looks for a >> metadata file for each file.. what do you think about making it look at >> directories, instead? >> >> Yang -- what's the ratio between # of directories and # of files in your >> case? >> >> D >> >> >> On Sat, Dec 31, 2011 at 6:05 PM, Yang Ling <[EMAIL PROTECTED]> wrote: >> >>> Thanks for reply. I spent yesterday and find out my 40 minutes is spent >>> on JsonMetadta.findMetaFile. It seems this is new for trunk. In my >>> setting, I have several thousand file/folders in my input, findMetaFile >>> read it one by one and it takes a long time. I also see there is an option >>> in PigStorage I can disable it using "-noschema". Once I use "noschema", I >>> get my 40 minutes back. Can we do something so others do not get into this >>> pitfall? >>> At 2011-12-30 03:52:34,"Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote: >>> >In the past, when I've observed this kind of insane behavior (no job >>> should >>> >take 40 minutes to submit), it's been due the NameNode or the JobTracker >>> >being extremely overloaded, responding slowly, causing timeouts+retries. >>> > >>> >2011/12/28 Thejas Nair <[EMAIL PROTECTED]> >>> > >>> >> I haven't seen/heard this issue. >>> >> Do you mean to say that the extra time is actually a delay before MR >>> job >>> >> is launched ? >>> >> Did you have free map/reduce slots when you ran pig job from trunk ? >>> >> >>> >> Thanks, >>> >> Thejas >>> >> >>> >> >>> >> >>> >> >>> >> On 12/23/11 9:01 PM, Yang Ling wrote: >>> >> >>> >>> I have a Pig job typically finish in 20 minutes. I tried Pig code >>> from >>> >>> trunk, it takes more than 1 hours to finish. My input and output are >>> on >>> >>> Amazon s3. One interesting thing is it takes about 40 minutes to >>> start the >>> >>> mapreduce job, but for 0.9.1 release, it takes only less than 1 >>> minute. Any >>> >>> idea? >>> >>> >>> >> >>> >> >>> >>> >> >
-
Re: Re: trunk is 3 times slower then 0.9?Dmitriy Ryaboy 2012-01-02, 04:14
Patch available.. please test if that fixes the issue.
https://issues.apache.org/jira/browse/PIG-2453 On Sun, Jan 1, 2012 at 7:39 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Yang, can you send the load statement you are using and a rought > description of the directory structure you are loading? That'll help test > the fix. > > Thanks, > D > > > On Sun, Jan 1, 2012 at 6:02 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > >> Filed https://issues.apache.org/jira/browse/PIG-2453 >> >> >> On Sun, Jan 1, 2012 at 5:17 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]>wrote: >> >>> Ah. That's unfortunate. Yeah reading thousands of files small is >>> suboptimal (it's always suboptimal, but in this case, it's extra bad). >>> >>> Pig committers -- currently JsonMetadata.fiindMetaFile looks for a >>> metadata file for each file.. what do you think about making it look at >>> directories, instead? >>> >>> Yang -- what's the ratio between # of directories and # of files in your >>> case? >>> >>> D >>> >>> >>> On Sat, Dec 31, 2011 at 6:05 PM, Yang Ling <[EMAIL PROTECTED]>wrote: >>> >>>> Thanks for reply. I spent yesterday and find out my 40 minutes is spent >>>> on JsonMetadta.findMetaFile. It seems this is new for trunk. In my >>>> setting, I have several thousand file/folders in my input, findMetaFile >>>> read it one by one and it takes a long time. I also see there is an option >>>> in PigStorage I can disable it using "-noschema". Once I use "noschema", I >>>> get my 40 minutes back. Can we do something so others do not get into this >>>> pitfall? >>>> At 2011-12-30 03:52:34,"Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote: >>>> >In the past, when I've observed this kind of insane behavior (no job >>>> should >>>> >take 40 minutes to submit), it's been due the NameNode or the >>>> JobTracker >>>> >being extremely overloaded, responding slowly, causing >>>> timeouts+retries. >>>> > >>>> >2011/12/28 Thejas Nair <[EMAIL PROTECTED]> >>>> > >>>> >> I haven't seen/heard this issue. >>>> >> Do you mean to say that the extra time is actually a delay before MR >>>> job >>>> >> is launched ? >>>> >> Did you have free map/reduce slots when you ran pig job from trunk ? >>>> >> >>>> >> Thanks, >>>> >> Thejas >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> On 12/23/11 9:01 PM, Yang Ling wrote: >>>> >> >>>> >>> I have a Pig job typically finish in 20 minutes. I tried Pig code >>>> from >>>> >>> trunk, it takes more than 1 hours to finish. My input and output >>>> are on >>>> >>> Amazon s3. One interesting thing is it takes about 40 minutes to >>>> start the >>>> >>> mapreduce job, but for 0.9.1 release, it takes only less than 1 >>>> minute. Any >>>> >>> idea? >>>> >>> >>>> >> >>>> >> >>>> >>>> >>> >> >
-
Re:Re: Re: trunk is 3 times slower then 0.9?Yang Ling 2012-01-02, 05:55
Thanks, I tried the patch and it takes no time for me to launch the job now. For directory structure, I only have files no directories, actually for me I can stick with "noschema", this is to help other s3 users.
At 2012-01-02 12:15:19,"Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote: >Patch available.. please test if that fixes the issue. >https://issues.apache.org/jira/browse/PIG-2453 > >On Sun, Jan 1, 2012 at 7:39 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > >> Yang, can you send the load statement you are using and a rought >> description of the directory structure you are loading? That'll help test >> the fix. >> >> Thanks, >> D >> >> >> On Sun, Jan 1, 2012 at 6:02 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: >> >>> Filed https://issues.apache.org/jira/browse/PIG-2453 >>> >>> >>> On Sun, Jan 1, 2012 at 5:17 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]>wrote: >>> >>>> Ah. That's unfortunate. Yeah reading thousands of files small is >>>> suboptimal (it's always suboptimal, but in this case, it's extra bad). >>>> >>>> Pig committers -- currently JsonMetadata.fiindMetaFile looks for a >>>> metadata file for each file.. what do you think about making it look at >>>> directories, instead? >>>> >>>> Yang -- what's the ratio between # of directories and # of files in your >>>> case? >>>> >>>> D >>>> >>>> >>>> On Sat, Dec 31, 2011 at 6:05 PM, Yang Ling <[EMAIL PROTECTED]>wrote: >>>> >>>>> Thanks for reply. I spent yesterday and find out my 40 minutes is spent >>>>> on JsonMetadta.findMetaFile. It seems this is new for trunk. In my >>>>> setting, I have several thousand file/folders in my input, findMetaFile >>>>> read it one by one and it takes a long time. I also see there is an option >>>>> in PigStorage I can disable it using "-noschema". Once I use "noschema", I >>>>> get my 40 minutes back. Can we do something so others do not get into this >>>>> pitfall? >>>>> At 2011-12-30 03:52:34,"Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote: >>>>> >In the past, when I've observed this kind of insane behavior (no job >>>>> should >>>>> >take 40 minutes to submit), it's been due the NameNode or the >>>>> JobTracker >>>>> >being extremely overloaded, responding slowly, causing >>>>> timeouts+retries. >>>>> > >>>>> >2011/12/28 Thejas Nair <[EMAIL PROTECTED]> >>>>> > >>>>> >> I haven't seen/heard this issue. >>>>> >> Do you mean to say that the extra time is actually a delay before MR >>>>> job >>>>> >> is launched ? >>>>> >> Did you have free map/reduce slots when you ran pig job from trunk ? >>>>> >> >>>>> >> Thanks, >>>>> >> Thejas >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> On 12/23/11 9:01 PM, Yang Ling wrote: >>>>> >> >>>>> >>> I have a Pig job typically finish in 20 minutes. I tried Pig code >>>>> from >>>>> >>> trunk, it takes more than 1 hours to finish. My input and output >>>>> are on >>>>> >>> Amazon s3. One interesting thing is it takes about 40 minutes to >>>>> start the >>>>> >>> mapreduce job, but for 0.9.1 release, it takes only less than 1 >>>>> minute. Any >>>>> >>> idea? >>>>> >>> >>>>> >> >>>>> >> >>>>> >>>>> >>>> >>> >> |