|
|
-
Re: storing intermediate results ?Vincent BARAT 2009-10-08, 13:33
Ok, then I did some testing.
Actually, if I store my first JOIN into a file, I see a 50% increase of the speed of all my subsequents computations. I guess that it may be related to the fact I use PIG from Java (maybe the optimizer don't work in that mode?). Here is my code (including just the JOIN and the first computation): Data loading: ------------- Analytics.pigServer .registerQuery("start_sessions = LOAD 'startSession_sample' USING PigStorage(',') " + "AS (sid:chararray, infoid:chararray, imei:chararray, start:long);"); Analytics.pigServer .registerQuery("end_sessions = LOAD 'endSession_sample' USING PigStorage(',') " + "AS (sid:chararray, infoid:chararray, imei:chararray, end:long);"); First Join (with storage): --------------------------- Analytics.pigServer .registerQuery("sessions = JOIN start_sessions BY sid, end_sessions BY sid;"); Analytics.pigServer.store("sessions", "sessions"); Analytics.pigServer .registerQuery("sessions = LOAD 'sessions' " + "AS (start_sessions::sid:chararray, start_sessions::infoid:chararray, start_sessions::imei:chararray, start_sessions::start:long, " + "end_sessions::sid:chararray, end_sessions::infoid:chararray, end_sessions::imei:chararray, end_sessions::end:long);"); First join (without storage): ----------------------------- Analytics.pigServer .registerQuery("sessions = JOIN start_sessions BY sid, end_sessions BY sid;"); First computation: ------------------ Analytics.pigServer.registerQuery("session_periods = FOREACH sessions " + "GENERATE FLATTEN(SessionPeriods('" + timeBucket.toString() + "', start, end)) " + "AS (periodid:int, inner_length:long, outer_length:long);"); Analytics.pigServer.registerQuery("period_sessions = GROUP session_periods BY periodid;"); Analytics.pigServer.registerQuery("session_count_and_length" + " = FOREACH period_sessions " + "GENERATE group, " + "COUNT(session_periods), " + "SUM(session_periods.inner_length), " + "SUM(session_periods.outer_length);"); Analytics.pigServer.store("session_count_and_length", Analytics.getHadoopOutputFile( "session_count_and_length", timeBucket)); Thejas Nair a �crit : > Hi Zaki, > Please file a jira if you are able to identify the problem you were facing > and the steps to reproduce it. > Thanks, > Thejas > > > > > On 10/7/09 1:08 PM, "zaki rahaman" <[EMAIL PROTECTED]> wrote: > >> Vincent, >> >> I've run into this problem before, if you know beforehand that you're going >> to recycle this joined dataset for several different operations or >> pipelines, it is worth your time to simply store it intermediately. While >> Pig can definitely handle this and the Multiquery Optimizer is great, I've >> run into problems with it before (can't remember what now exactly), and >> pre-joining has worked well for me. >> >> Hopefully you found some part of that useful. >> >> On Wed, Oct 7, 2009 at 12:33 PM, Ashutosh Chauhan < >> [EMAIL PROTECTED]> wrote: >> >>> Hi Vincent, >>> >>> Pig has a multi-query optimization which if firing will automatically >>> figure >>> out that join needs to be done only once and there will not be any >>> repetition of work. If Pig determines that its not safe to do that >>> optimization then its possible that your join is getting computed more then >>> once. If thats the case, then it will be better to do the join and store >>> it. >>> You can do that within same script using "exec" >>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#exec >>> >>> You can read more about multi-query optimization here: >>> >>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution >>> >>> Hope it helps, >>> Ashutosh >>> >>> On Wed, Oct 7, 2009 at 10:54, Vincent BARAT <[EMAIL PROTECTED] |