Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> LOAD once, use multiple times


Copy link to this message
-
Re: LOAD once, use multiple times
Removing the filter (B) in the example will still result in similar
plan, just that the filter operator will be absent.

For example, the following query uses only 1 MR job. If your explain
plan showed more MR jobs, your query probably has something that does
not allow the work to be done in single MR job.

cat mutli.pig
l = load 'x' as (a,b,c);
g1 = group l by a;
f1 = foreach g1 generate group as a, MAX(l.b);
store f1 into 'out1';

g2 = group l by b;
f2 = foreach g2 generate group as b, MAX(l.c);
store f2 into 'out2';

g3 = group l by c;
f3 = foreach g3 generate group as c, MAX(l.a);
store f3 into 'out3';
I used following explain command. Disabled combiner so that the plan is
smaller (to make it easy to read/understand) -
java -Dpig.exec.nocombiner=true -cp pig.jar  org.apache.pig.Main -x
local    -e explain   -script mutli.pig

Pasted the explain plan output here so that formatting is fine -
http://pastebin.com/yGJs8Q4t

-Thejas

On 10/4/11 4:13 PM, Something Something wrote:
> Thanks, Thejas!  Great presentation.  But the slides 16&  17 are a bit
> different from my example.  In your example on slide 16, the output from
> filter (B) is used twice.  Question is what happens when 'A' (output from
> LOAD) is used multiple times.
>
> I ran my pig script with command similar to the following:
>
> pig -e 'explain -dot -out ./my.dot -script myscript.pig'>>  explain.txt
>
> I noticed in the Graph that 'LOAD' splits the input into 3 SplitOutput[log]
> and then runs 3 MR jobs - which is what I expected.  Now I am just trying to
> understand how I can create 3 SplitOutput like this outside Pig&  feed them
> in 3 different MR jobs in Java.  Knowing this will make me appreciate Pig
> more -:)
>
>
> On Tue, Oct 4, 2011 at 10:15 AM, Thejas Nair<[EMAIL PROTECTED]>  wrote:
>
>> See slides 16,17 in http://www.slideshare.net/**thejasmn/apache-pig-**
>> performance-optimizations-**talk-at-apachecon-2010<http://www.slideshare.net/thejasmn/apache-pig-performance-optimizations-talk-at-apachecon-2010>.
>>
>> For the query in example, pig includes an index in the map output key, and
>> the reduce has a PODemux operator that sends the records to appropriate
>> reduce plans. There are no distinct reduce tasks for each group operation.
>>
>> -Thejas
>>
>>
>>
>> On 10/3/11 9:35 PM, Something Something wrote:
>>
>>> Let me ask the question differently.  Let's say I was not using Pig.  I
>>> wanted to do this just using Java MapReduce.  The input file is HUGE.  One
>>> obvious way to do this would be to write 3 different MR jobs.  But that
>>> means this huge file be read 3 times which is what I am trying to avoid.
>>>
>>> Is there a way to write a Mapper that will read this file only once, and
>>> then write to 3 different Reducers with different keys?
>>>
>>> Going back to Pig, when I LOAD this file&   then later 'group by' 3
>>> different
>>>
>>> keys, how does Pig do this?  Does it "LOAD" this input file into some
>>> interim file&   call 3 different Map Reduce jobs?
>>>
>>>
>>> If this makes no sense, please ignore it.  I will try to use 'Explain',
>>> 'Describe' to learn the internals.  Thanks.
>>>
>>>
>>> On Mon, Oct 3, 2011 at 6:04 PM, Jonathan Coveney<[EMAIL PROTECTED]>
>>>   wrote:
>>>
>>>   If you want to know more about the internals, I'd check out the paper
>>>> Yahoo
>>>> put out on the topic (or, of course, buy the book Programming Pig).
>>>>
>>>> The answer to this is pretty simple: if you load a file multiple times
>>>> into
>>>> different relations, then it will be scanned multiple times. So...
>>>>
>>>> a = load 'thing';
>>>> b = load 'thing;
>>>>
>>>> {..stuff using a..}
>>>> {..stuff using b..}
>>>>
>>>> would load 'thing' twice. This is done for joins and whatnot -- there are
>>>> cases when you need to load the same file separately, twice. What happens
>>>> is
>>>> essentially that you're going to load and scan the data twice.
>>>>
>>>> However, as in your case, if you instead combine the load, then you'd
>>>> have
>