|
Vincent BARAT
2009-10-07, 14:54
Ashutosh Chauhan
2009-10-07, 16:33
zaki rahaman
2009-10-07, 20:08
Thejas Nair
2009-10-07, 20:16
Vincent BARAT
2009-10-08, 09:43
Vincent BARAT
2009-10-08, 13:33
Alan Gates
2009-10-12, 18:50
|
-
storing intermediate results ?Vincent BARAT 2009-10-07, 14:54
Hello,
I'm new to PIG, and I have a bunch of statements that process the same input, which is actually the result of a JOIN between two very big data set (millions of entries). I wonder if it is better (faster) to save the result of this JOIN into an Hadoop file and then to LOAD it, instead of just relying on PIG optimizations ? Thank a lot for your help.
-
Re: storing intermediate results ?Ashutosh Chauhan 2009-10-07, 16:33
Hi Vincent,
Pig has a multi-query optimization which if firing will automatically figure out that join needs to be done only once and there will not be any repetition of work. If Pig determines that its not safe to do that optimization then its possible that your join is getting computed more then once. If thats the case, then it will be better to do the join and store it. You can do that within same script using "exec" http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#exec You can read more about multi-query optimization here: http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution Hope it helps, Ashutosh On Wed, Oct 7, 2009 at 10:54, Vincent BARAT <[EMAIL PROTECTED]>wrote: > Hello, > > I'm new to PIG, and I have a bunch of statements that process the same > input, which is actually the result of a JOIN between two very big data set > (millions of entries). > > I wonder if it is better (faster) to save the result of this JOIN into an > Hadoop file and then to LOAD it, instead of just relying on PIG > optimizations ? > > Thank a lot for your help. >
-
Re: storing intermediate results ?zaki rahaman 2009-10-07, 20:08
Vincent,
I've run into this problem before, if you know beforehand that you're going to recycle this joined dataset for several different operations or pipelines, it is worth your time to simply store it intermediately. While Pig can definitely handle this and the Multiquery Optimizer is great, I've run into problems with it before (can't remember what now exactly), and pre-joining has worked well for me. Hopefully you found some part of that useful. On Wed, Oct 7, 2009 at 12:33 PM, Ashutosh Chauhan < [EMAIL PROTECTED]> wrote: > Hi Vincent, > > Pig has a multi-query optimization which if firing will automatically > figure > out that join needs to be done only once and there will not be any > repetition of work. If Pig determines that its not safe to do that > optimization then its possible that your join is getting computed more then > once. If thats the case, then it will be better to do the join and store > it. > You can do that within same script using "exec" > http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#exec > > You can read more about multi-query optimization here: > > http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution > > Hope it helps, > Ashutosh > > On Wed, Oct 7, 2009 at 10:54, Vincent BARAT <[EMAIL PROTECTED] > >wrote: > > > Hello, > > > > I'm new to PIG, and I have a bunch of statements that process the same > > input, which is actually the result of a JOIN between two very big data > set > > (millions of entries). > > > > I wonder if it is better (faster) to save the result of this JOIN into an > > Hadoop file and then to LOAD it, instead of just relying on PIG > > optimizations ? > > > > Thank a lot for your help. > > > -- Zaki Rahaman
-
Re: storing intermediate results ?Thejas Nair 2009-10-07, 20:16
Hi Zaki,
Please file a jira if you are able to identify the problem you were facing and the steps to reproduce it. Thanks, Thejas On 10/7/09 1:08 PM, "zaki rahaman" <[EMAIL PROTECTED]> wrote: > Vincent, > > I've run into this problem before, if you know beforehand that you're going > to recycle this joined dataset for several different operations or > pipelines, it is worth your time to simply store it intermediately. While > Pig can definitely handle this and the Multiquery Optimizer is great, I've > run into problems with it before (can't remember what now exactly), and > pre-joining has worked well for me. > > Hopefully you found some part of that useful. > > On Wed, Oct 7, 2009 at 12:33 PM, Ashutosh Chauhan < > [EMAIL PROTECTED]> wrote: > >> Hi Vincent, >> >> Pig has a multi-query optimization which if firing will automatically >> figure >> out that join needs to be done only once and there will not be any >> repetition of work. If Pig determines that its not safe to do that >> optimization then its possible that your join is getting computed more then >> once. If thats the case, then it will be better to do the join and store >> it. >> You can do that within same script using "exec" >> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#exec >> >> You can read more about multi-query optimization here: >> >> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution >> >> Hope it helps, >> Ashutosh >> >> On Wed, Oct 7, 2009 at 10:54, Vincent BARAT <[EMAIL PROTECTED] >>> wrote: >> >>> Hello, >>> >>> I'm new to PIG, and I have a bunch of statements that process the same >>> input, which is actually the result of a JOIN between two very big data >> set >>> (millions of entries). >>> >>> I wonder if it is better (faster) to save the result of this JOIN into an >>> Hadoop file and then to LOAD it, instead of just relying on PIG >>> optimizations ? >>> >>> Thank a lot for your help. >>> >> > >
-
Re: storing intermediate results ?Vincent BARAT 2009-10-08, 09:43
Hello,
Thank for your answer. Actually, I use PIG by running it from Java (using a set of registerQuery() methods). The exec you mention cannot be used in that context (AFAIK). Ashutosh Chauhan a �crit : > Hi Vincent, > > Pig has a multi-query optimization which if firing will automatically figure > out that join needs to be done only once and there will not be any > repetition of work. If Pig determines that its not safe to do that > optimization then its possible that your join is getting computed more then > once. If thats the case, then it will be better to do the join and store it. > You can do that within same script using "exec" > http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#exec > > You can read more about multi-query optimization here: > http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution > > Hope it helps, > Ashutosh > > On Wed, Oct 7, 2009 at 10:54, Vincent BARAT <[EMAIL PROTECTED]>wrote: > >> Hello, >> >> I'm new to PIG, and I have a bunch of statements that process the same >> input, which is actually the result of a JOIN between two very big data set >> (millions of entries). >> >> I wonder if it is better (faster) to save the result of this JOIN into an >> Hadoop file and then to LOAD it, instead of just relying on PIG >> optimizations ? >> >> Thank a lot for your help. >> >
-
Re: storing intermediate results ?Vincent BARAT 2009-10-08, 13:33
Ok, then I did some testing.
Actually, if I store my first JOIN into a file, I see a 50% increase of the speed of all my subsequents computations. I guess that it may be related to the fact I use PIG from Java (maybe the optimizer don't work in that mode?). Here is my code (including just the JOIN and the first computation): Data loading: ------------- Analytics.pigServer .registerQuery("start_sessions = LOAD 'startSession_sample' USING PigStorage(',') " + "AS (sid:chararray, infoid:chararray, imei:chararray, start:long);"); Analytics.pigServer .registerQuery("end_sessions = LOAD 'endSession_sample' USING PigStorage(',') " + "AS (sid:chararray, infoid:chararray, imei:chararray, end:long);"); First Join (with storage): --------------------------- Analytics.pigServer .registerQuery("sessions = JOIN start_sessions BY sid, end_sessions BY sid;"); Analytics.pigServer.store("sessions", "sessions"); Analytics.pigServer .registerQuery("sessions = LOAD 'sessions' " + "AS (start_sessions::sid:chararray, start_sessions::infoid:chararray, start_sessions::imei:chararray, start_sessions::start:long, " + "end_sessions::sid:chararray, end_sessions::infoid:chararray, end_sessions::imei:chararray, end_sessions::end:long);"); First join (without storage): ----------------------------- Analytics.pigServer .registerQuery("sessions = JOIN start_sessions BY sid, end_sessions BY sid;"); First computation: ------------------ Analytics.pigServer.registerQuery("session_periods = FOREACH sessions " + "GENERATE FLATTEN(SessionPeriods('" + timeBucket.toString() + "', start, end)) " + "AS (periodid:int, inner_length:long, outer_length:long);"); Analytics.pigServer.registerQuery("period_sessions = GROUP session_periods BY periodid;"); Analytics.pigServer.registerQuery("session_count_and_length" + " = FOREACH period_sessions " + "GENERATE group, " + "COUNT(session_periods), " + "SUM(session_periods.inner_length), " + "SUM(session_periods.outer_length);"); Analytics.pigServer.store("session_count_and_length", Analytics.getHadoopOutputFile( "session_count_and_length", timeBucket)); Thejas Nair a �crit : > Hi Zaki, > Please file a jira if you are able to identify the problem you were facing > and the steps to reproduce it. > Thanks, > Thejas > > > > > On 10/7/09 1:08 PM, "zaki rahaman" <[EMAIL PROTECTED]> wrote: > >> Vincent, >> >> I've run into this problem before, if you know beforehand that you're going >> to recycle this joined dataset for several different operations or >> pipelines, it is worth your time to simply store it intermediately. While >> Pig can definitely handle this and the Multiquery Optimizer is great, I've >> run into problems with it before (can't remember what now exactly), and >> pre-joining has worked well for me. >> >> Hopefully you found some part of that useful. >> >> On Wed, Oct 7, 2009 at 12:33 PM, Ashutosh Chauhan < >> [EMAIL PROTECTED]> wrote: >> >>> Hi Vincent, >>> >>> Pig has a multi-query optimization which if firing will automatically >>> figure >>> out that join needs to be done only once and there will not be any >>> repetition of work. If Pig determines that its not safe to do that >>> optimization then its possible that your join is getting computed more then >>> once. If thats the case, then it will be better to do the join and store >>> it. >>> You can do that within same script using "exec" >>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#exec >>> >>> You can read more about multi-query optimization here: >>> >>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution >>> >>> Hope it helps, >>> Ashutosh >>> >>> On Wed, Oct 7, 2009 at 10:54, Vincent BARAT <[EMAIL PROTECTED]
-
Re: storing intermediate results ?Alan Gates 2009-10-12, 18:50
The optimizer runs when Pig is invoked from Java. However, until
recently join and multi-query optimization did not work together. See http://issues.apache.org/jira/browse/PIG-983 Alan. On Oct 8, 2009, at 6:33 AM, Vincent BARAT wrote: > Ok, then I did some testing. > > Actually, if I store my first JOIN into a file, I see a 50% increase > of the speed of all my subsequents computations. > > I guess that it may be related to the fact I use PIG from Java > (maybe the optimizer don't work in that mode?). > > Here is my code (including just the JOIN and the first computation): > > Data loading: > ------------- > > Analytics.pigServer > .registerQuery("start_sessions = LOAD 'startSession_sample' > USING PigStorage(',') " > + "AS (sid:chararray, infoid:chararray, imei:chararray, > start:long);"); > Analytics.pigServer > .registerQuery("end_sessions = LOAD 'endSession_sample' > USING PigStorage(',') " > + "AS (sid:chararray, infoid:chararray, imei:chararray, > end:long);"); > > First Join (with storage): > --------------------------- > > Analytics.pigServer > .registerQuery("sessions = JOIN start_sessions BY sid, > end_sessions BY sid;"); > Analytics.pigServer.store("sessions", "sessions"); > Analytics.pigServer > .registerQuery("sessions = LOAD 'sessions' " > + "AS (start_sessions::sid:chararray, > start_sessions::infoid:chararray, start_sessions::imei:chararray, > start_sessions::start:long, " > + "end_sessions::sid:chararray, > end_sessions::infoid:chararray, end_sessions::imei:chararray, > end_sessions::end:long);"); > > First join (without storage): > ----------------------------- > > Analytics.pigServer > .registerQuery("sessions = JOIN start_sessions BY sid, > end_sessions BY sid;"); > > First computation: > ------------------ > > Analytics.pigServer.registerQuery("session_periods = > FOREACH sessions " > + "GENERATE FLATTEN(SessionPeriods('" + > timeBucket.toString() + "', start, end)) " > + "AS (periodid:int, inner_length:long, > outer_length:long);"); > Analytics.pigServer.registerQuery("period_sessions = GROUP > session_periods BY periodid;"); > Analytics.pigServer.registerQuery("session_count_and_length" > + " = FOREACH period_sessions " + "GENERATE group, " + > "COUNT(session_periods), " > + "SUM(session_periods.inner_length), " + > "SUM(session_periods.outer_length);"); > > Analytics.pigServer.store("session_count_and_length", > Analytics.getHadoopOutputFile( > "session_count_and_length", timeBucket)); > > > > Thejas Nair a écrit : >> Hi Zaki, >> Please file a jira if you are able to identify the problem you were >> facing >> and the steps to reproduce it. >> Thanks, >> Thejas >> On 10/7/09 1:08 PM, "zaki rahaman" <[EMAIL PROTECTED]> wrote: >>> Vincent, >>> >>> I've run into this problem before, if you know beforehand that >>> you're going >>> to recycle this joined dataset for several different operations or >>> pipelines, it is worth your time to simply store it >>> intermediately. While >>> Pig can definitely handle this and the Multiquery Optimizer is >>> great, I've >>> run into problems with it before (can't remember what now >>> exactly), and >>> pre-joining has worked well for me. >>> >>> Hopefully you found some part of that useful. >>> >>> On Wed, Oct 7, 2009 at 12:33 PM, Ashutosh Chauhan < >>> [EMAIL PROTECTED]> wrote: >>> >>>> Hi Vincent, >>>> >>>> Pig has a multi-query optimization which if firing will >>>> automatically >>>> figure >>>> out that join needs to be done only once and there will not be any >>>> repetition of work. If Pig determines that its not safe to do that >>>> optimization then its possible that your join is getting computed >>>> more then >>>> once. If thats the case, then it will be better to do the join |