Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> storing intermediate results ?


Copy link to this message
-
Re: storing intermediate results ?
Ok, then I did some testing.

Actually, if I store my first JOIN into a file, I see a 50% increase
of the speed of all my subsequents computations.

I guess that it may be related to the fact I use PIG from Java
(maybe the optimizer don't work in that mode?).

Here is my code (including just the JOIN and the first computation):

Data loading:
-------------

         Analytics.pigServer
           .registerQuery("start_sessions = LOAD
'startSession_sample' USING PigStorage(',') "
             + "AS (sid:chararray, infoid:chararray, imei:chararray,
start:long);");
         Analytics.pigServer
           .registerQuery("end_sessions = LOAD 'endSession_sample'
USING PigStorage(',') "
             + "AS (sid:chararray, infoid:chararray, imei:chararray,
end:long);");

First Join (with storage):
---------------------------

         Analytics.pigServer
           .registerQuery("sessions = JOIN start_sessions BY sid,
end_sessions BY sid;");
         Analytics.pigServer.store("sessions", "sessions");
         Analytics.pigServer
           .registerQuery("sessions = LOAD 'sessions' "
             + "AS (start_sessions::sid:chararray,
start_sessions::infoid:chararray, start_sessions::imei:chararray,
start_sessions::start:long, "
             + "end_sessions::sid:chararray,
end_sessions::infoid:chararray, end_sessions::imei:chararray,
end_sessions::end:long);");

First join (without storage):
-----------------------------

         Analytics.pigServer
           .registerQuery("sessions = JOIN start_sessions BY sid,
end_sessions BY sid;");

First computation:
------------------

           Analytics.pigServer.registerQuery("session_periods =
FOREACH sessions "
             + "GENERATE FLATTEN(SessionPeriods('" +
timeBucket.toString() + "', start, end)) "
             + "AS (periodid:int, inner_length:long,
outer_length:long);");
           Analytics.pigServer.registerQuery("period_sessions =
GROUP session_periods BY periodid;");
         Analytics.pigServer.registerQuery("session_count_and_length"
             + " = FOREACH period_sessions " + "GENERATE group, " +
"COUNT(session_periods), "
             + "SUM(session_periods.inner_length), " +
"SUM(session_periods.outer_length);");

           Analytics.pigServer.store("session_count_and_length",
Analytics.getHadoopOutputFile(
             "session_count_and_length", timeBucket));

Thejas Nair a �crit :
> Hi Zaki,
> Please file a jira if you are able to identify the problem you were facing
> and the steps to reproduce it.
> Thanks,
> Thejas
>
>
>
>
> On 10/7/09 1:08 PM, "zaki rahaman" <[EMAIL PROTECTED]> wrote:
>
>> Vincent,
>>
>> I've run into this problem before, if you know beforehand that you're going
>> to recycle this joined dataset for several different operations or
>> pipelines, it is worth your time to simply store it intermediately. While
>> Pig can definitely handle this and the Multiquery Optimizer is great, I've
>> run into problems with it before (can't remember what now exactly), and
>> pre-joining has worked well for me.
>>
>> Hopefully you found some part of that useful.
>>
>> On Wed, Oct 7, 2009 at 12:33 PM, Ashutosh Chauhan <
>> [EMAIL PROTECTED]> wrote:
>>
>>> Hi Vincent,
>>>
>>> Pig has a multi-query optimization which if firing will automatically
>>> figure
>>> out that join needs to be done only once and there will not be any
>>> repetition of work. If Pig determines that its not safe to do that
>>> optimization then its possible that your join is getting computed more then
>>> once. If thats the case, then it will be better to do the join and store
>>> it.
>>> You can do that within same script using "exec"
>>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#exec
>>>
>>> You can read more about multi-query optimization here:
>>>
>>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution
>>>
>>> Hope it helps,
>>> Ashutosh
>>>
>>> On Wed, Oct 7, 2009 at 10:54, Vincent BARAT <[EMAIL PROTECTED]
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB