Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> storing intermediate results ?


Copy link to this message
-
Re: storing intermediate results ?
Ok, then I did some testing.

Actually, if I store my first JOIN into a file, I see a 50% increase
of the speed of all my subsequents computations.

I guess that it may be related to the fact I use PIG from Java
(maybe the optimizer don't work in that mode?).

Here is my code (including just the JOIN and the first computation):

Data loading:
-------------

         Analytics.pigServer
           .registerQuery("start_sessions = LOAD
'startSession_sample' USING PigStorage(',') "
             + "AS (sid:chararray, infoid:chararray, imei:chararray,
start:long);");
         Analytics.pigServer
           .registerQuery("end_sessions = LOAD 'endSession_sample'
USING PigStorage(',') "
             + "AS (sid:chararray, infoid:chararray, imei:chararray,
end:long);");

First Join (with storage):
---------------------------

         Analytics.pigServer
           .registerQuery("sessions = JOIN start_sessions BY sid,
end_sessions BY sid;");
         Analytics.pigServer.store("sessions", "sessions");
         Analytics.pigServer
           .registerQuery("sessions = LOAD 'sessions' "
             + "AS (start_sessions::sid:chararray,
start_sessions::infoid:chararray, start_sessions::imei:chararray,
start_sessions::start:long, "
             + "end_sessions::sid:chararray,
end_sessions::infoid:chararray, end_sessions::imei:chararray,
end_sessions::end:long);");

First join (without storage):
-----------------------------

         Analytics.pigServer
           .registerQuery("sessions = JOIN start_sessions BY sid,
end_sessions BY sid;");

First computation:
------------------

           Analytics.pigServer.registerQuery("session_periods =
FOREACH sessions "
             + "GENERATE FLATTEN(SessionPeriods('" +
timeBucket.toString() + "', start, end)) "
             + "AS (periodid:int, inner_length:long,
outer_length:long);");
           Analytics.pigServer.registerQuery("period_sessions =
GROUP session_periods BY periodid;");
         Analytics.pigServer.registerQuery("session_count_and_length"
             + " = FOREACH period_sessions " + "GENERATE group, " +
"COUNT(session_periods), "
             + "SUM(session_periods.inner_length), " +
"SUM(session_periods.outer_length);");

           Analytics.pigServer.store("session_count_and_length",
Analytics.getHadoopOutputFile(
             "session_count_and_length", timeBucket));

Thejas Nair a �crit :
> Hi Zaki,
> Please file a jira if you are able to identify the problem you were facing
> and the steps to reproduce it.
> Thanks,
> Thejas
>
>
>
>
> On 10/7/09 1:08 PM, "zaki rahaman" <[EMAIL PROTECTED]> wrote:
>
>> Vincent,
>>
>> I've run into this problem before, if you know beforehand that you're going
>> to recycle this joined dataset for several different operations or
>> pipelines, it is worth your time to simply store it intermediately. While
>> Pig can definitely handle this and the Multiquery Optimizer is great, I've
>> run into problems with it before (can't remember what now exactly), and
>> pre-joining has worked well for me.
>>
>> Hopefully you found some part of that useful.
>>
>> On Wed, Oct 7, 2009 at 12:33 PM, Ashutosh Chauhan <
>> [EMAIL PROTECTED]> wrote:
>>
>>> Hi Vincent,
>>>
>>> Pig has a multi-query optimization which if firing will automatically
>>> figure
>>> out that join needs to be done only once and there will not be any
>>> repetition of work. If Pig determines that its not safe to do that
>>> optimization then its possible that your join is getting computed more then
>>> once. If thats the case, then it will be better to do the join and store
>>> it.
>>> You can do that within same script using "exec"
>>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#exec
>>>
>>> You can read more about multi-query optimization here:
>>>
>>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution
>>>
>>> Hope it helps,
>>> Ashutosh
>>>
>>> On Wed, Oct 7, 2009 at 10:54, Vincent BARAT <[EMAIL PROTECTED]