Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> storing intermediate results ?

Copy link to this message
Re: storing intermediate results ?
Ok, then I did some testing.

Actually, if I store my first JOIN into a file, I see a 50% increase
of the speed of all my subsequents computations.

I guess that it may be related to the fact I use PIG from Java
(maybe the optimizer don't work in that mode?).

Here is my code (including just the JOIN and the first computation):

Data loading:

           .registerQuery("start_sessions = LOAD
'startSession_sample' USING PigStorage(',') "
             + "AS (sid:chararray, infoid:chararray, imei:chararray,
           .registerQuery("end_sessions = LOAD 'endSession_sample'
USING PigStorage(',') "
             + "AS (sid:chararray, infoid:chararray, imei:chararray,

First Join (with storage):

           .registerQuery("sessions = JOIN start_sessions BY sid,
end_sessions BY sid;");
         Analytics.pigServer.store("sessions", "sessions");
           .registerQuery("sessions = LOAD 'sessions' "
             + "AS (start_sessions::sid:chararray,
start_sessions::infoid:chararray, start_sessions::imei:chararray,
start_sessions::start:long, "
             + "end_sessions::sid:chararray,
end_sessions::infoid:chararray, end_sessions::imei:chararray,

First join (without storage):

           .registerQuery("sessions = JOIN start_sessions BY sid,
end_sessions BY sid;");

First computation:

           Analytics.pigServer.registerQuery("session_periods =
FOREACH sessions "
             + "GENERATE FLATTEN(SessionPeriods('" +
timeBucket.toString() + "', start, end)) "
             + "AS (periodid:int, inner_length:long,
           Analytics.pigServer.registerQuery("period_sessions =
GROUP session_periods BY periodid;");
             + " = FOREACH period_sessions " + "GENERATE group, " +
"COUNT(session_periods), "
             + "SUM(session_periods.inner_length), " +

             "session_count_and_length", timeBucket));

Thejas Nair a �crit :
> Hi Zaki,
> Please file a jira if you are able to identify the problem you were facing
> and the steps to reproduce it.
> Thanks,
> Thejas
> On 10/7/09 1:08 PM, "zaki rahaman" <[EMAIL PROTECTED]> wrote:
>> Vincent,
>> I've run into this problem before, if you know beforehand that you're going
>> to recycle this joined dataset for several different operations or
>> pipelines, it is worth your time to simply store it intermediately. While
>> Pig can definitely handle this and the Multiquery Optimizer is great, I've
>> run into problems with it before (can't remember what now exactly), and
>> pre-joining has worked well for me.
>> Hopefully you found some part of that useful.
>> On Wed, Oct 7, 2009 at 12:33 PM, Ashutosh Chauhan <
>> [EMAIL PROTECTED]> wrote:
>>> Hi Vincent,
>>> Pig has a multi-query optimization which if firing will automatically
>>> figure
>>> out that join needs to be done only once and there will not be any
>>> repetition of work. If Pig determines that its not safe to do that
>>> optimization then its possible that your join is getting computed more then
>>> once. If thats the case, then it will be better to do the join and store
>>> it.
>>> You can do that within same script using "exec"
>>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#exec
>>> You can read more about multi-query optimization here:
>>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution
>>> Hope it helps,
>>> Ashutosh
>>> On Wed, Oct 7, 2009 at 10:54, Vincent BARAT <[EMAIL PROTECTED]