Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - LOAD once, use multiple times


Copy link to this message
-
Re: LOAD once, use multiple times
Thejas Nair 2011-10-04, 23:35
Removing the filter (B) in the example will still result in similar
plan, just that the filter operator will be absent.

For example, the following query uses only 1 MR job. If your explain
plan showed more MR jobs, your query probably has something that does
not allow the work to be done in single MR job.

cat mutli.pig
l = load 'x' as (a,b,c);
g1 = group l by a;
f1 = foreach g1 generate group as a, MAX(l.b);
store f1 into 'out1';

g2 = group l by b;
f2 = foreach g2 generate group as b, MAX(l.c);
store f2 into 'out2';

g3 = group l by c;
f3 = foreach g3 generate group as c, MAX(l.a);
store f3 into 'out3';
I used following explain command. Disabled combiner so that the plan is
smaller (to make it easy to read/understand) -
java -Dpig.exec.nocombiner=true -cp pig.jar  org.apache.pig.Main -x
local    -e explain   -script mutli.pig

Pasted the explain plan output here so that formatting is fine -
http://pastebin.com/yGJs8Q4t

-Thejas

On 10/4/11 4:13 PM, Something Something wrote:
> Thanks, Thejas!  Great presentation.  But the slides 16&  17 are a bit
> different from my example.  In your example on slide 16, the output from
> filter (B) is used twice.  Question is what happens when 'A' (output from
> LOAD) is used multiple times.
>
> I ran my pig script with command similar to the following:
>
> pig -e 'explain -dot -out ./my.dot -script myscript.pig'>>  explain.txt
>
> I noticed in the Graph that 'LOAD' splits the input into 3 SplitOutput[log]
> and then runs 3 MR jobs - which is what I expected.  Now I am just trying to
> understand how I can create 3 SplitOutput like this outside Pig&  feed them
> in 3 different MR jobs in Java.  Knowing this will make me appreciate Pig
> more -:)
>
>
> On Tue, Oct 4, 2011 at 10:15 AM, Thejas Nair<[EMAIL PROTECTED]>  wrote:
>
>> See slides 16,17 in http://www.slideshare.net/**thejasmn/apache-pig-**
>> performance-optimizations-**talk-at-apachecon-2010<http://www.slideshare.net/thejasmn/apache-pig-performance-optimizations-talk-at-apachecon-2010>.
>>
>> For the query in example, pig includes an index in the map output key, and
>> the reduce has a PODemux operator that sends the records to appropriate
>> reduce plans. There are no distinct reduce tasks for each group operation.
>>
>> -Thejas
>>
>>
>>
>> On 10/3/11 9:35 PM, Something Something wrote:
>>
>>> Let me ask the question differently.  Let's say I was not using Pig.  I
>>> wanted to do this just using Java MapReduce.  The input file is HUGE.  One
>>> obvious way to do this would be to write 3 different MR jobs.  But that
>>> means this huge file be read 3 times which is what I am trying to avoid.
>>>
>>> Is there a way to write a Mapper that will read this file only once, and
>>> then write to 3 different Reducers with different keys?
>>>
>>> Going back to Pig, when I LOAD this file&   then later 'group by' 3
>>> different
>>>
>>> keys, how does Pig do this?  Does it "LOAD" this input file into some
>>> interim file&   call 3 different Map Reduce jobs?
>>>
>>>
>>> If this makes no sense, please ignore it.  I will try to use 'Explain',
>>> 'Describe' to learn the internals.  Thanks.
>>>
>>>
>>> On Mon, Oct 3, 2011 at 6:04 PM, Jonathan Coveney<[EMAIL PROTECTED]>
>>>   wrote:
>>>
>>>   If you want to know more about the internals, I'd check out the paper
>>>> Yahoo
>>>> put out on the topic (or, of course, buy the book Programming Pig).
>>>>
>>>> The answer to this is pretty simple: if you load a file multiple times
>>>> into
>>>> different relations, then it will be scanned multiple times. So...
>>>>
>>>> a = load 'thing';
>>>> b = load 'thing;
>>>>
>>>> {..stuff using a..}
>>>> {..stuff using b..}
>>>>
>>>> would load 'thing' twice. This is done for joins and whatnot -- there are
>>>> cases when you need to load the same file separately, twice. What happens
>>>> is
>>>> essentially that you're going to load and scan the data twice.
>>>>
>>>> However, as in your case, if you instead combine the load, then you'd
>>>> have
>