|
|
-
Re: LOAD once, use multiple timesThejas Nair 2011-10-04, 23:35
Removing the filter (B) in the example will still result in similar
plan, just that the filter operator will be absent. For example, the following query uses only 1 MR job. If your explain plan showed more MR jobs, your query probably has something that does not allow the work to be done in single MR job. cat mutli.pig l = load 'x' as (a,b,c); g1 = group l by a; f1 = foreach g1 generate group as a, MAX(l.b); store f1 into 'out1'; g2 = group l by b; f2 = foreach g2 generate group as b, MAX(l.c); store f2 into 'out2'; g3 = group l by c; f3 = foreach g3 generate group as c, MAX(l.a); store f3 into 'out3'; I used following explain command. Disabled combiner so that the plan is smaller (to make it easy to read/understand) - java -Dpig.exec.nocombiner=true -cp pig.jar org.apache.pig.Main -x local -e explain -script mutli.pig Pasted the explain plan output here so that formatting is fine - http://pastebin.com/yGJs8Q4t -Thejas On 10/4/11 4:13 PM, Something Something wrote: > Thanks, Thejas! Great presentation. But the slides 16& 17 are a bit > different from my example. In your example on slide 16, the output from > filter (B) is used twice. Question is what happens when 'A' (output from > LOAD) is used multiple times. > > I ran my pig script with command similar to the following: > > pig -e 'explain -dot -out ./my.dot -script myscript.pig'>> explain.txt > > I noticed in the Graph that 'LOAD' splits the input into 3 SplitOutput[log] > and then runs 3 MR jobs - which is what I expected. Now I am just trying to > understand how I can create 3 SplitOutput like this outside Pig& feed them > in 3 different MR jobs in Java. Knowing this will make me appreciate Pig > more -:) > > > On Tue, Oct 4, 2011 at 10:15 AM, Thejas Nair<[EMAIL PROTECTED]> wrote: > >> See slides 16,17 in http://www.slideshare.net/**thejasmn/apache-pig-** >> performance-optimizations-**talk-at-apachecon-2010<http://www.slideshare.net/thejasmn/apache-pig-performance-optimizations-talk-at-apachecon-2010>. >> >> For the query in example, pig includes an index in the map output key, and >> the reduce has a PODemux operator that sends the records to appropriate >> reduce plans. There are no distinct reduce tasks for each group operation. >> >> -Thejas >> >> >> >> On 10/3/11 9:35 PM, Something Something wrote: >> >>> Let me ask the question differently. Let's say I was not using Pig. I >>> wanted to do this just using Java MapReduce. The input file is HUGE. One >>> obvious way to do this would be to write 3 different MR jobs. But that >>> means this huge file be read 3 times which is what I am trying to avoid. >>> >>> Is there a way to write a Mapper that will read this file only once, and >>> then write to 3 different Reducers with different keys? >>> >>> Going back to Pig, when I LOAD this file& then later 'group by' 3 >>> different >>> >>> keys, how does Pig do this? Does it "LOAD" this input file into some >>> interim file& call 3 different Map Reduce jobs? >>> >>> >>> If this makes no sense, please ignore it. I will try to use 'Explain', >>> 'Describe' to learn the internals. Thanks. >>> >>> >>> On Mon, Oct 3, 2011 at 6:04 PM, Jonathan Coveney<[EMAIL PROTECTED]> >>> wrote: >>> >>> If you want to know more about the internals, I'd check out the paper >>>> Yahoo >>>> put out on the topic (or, of course, buy the book Programming Pig). >>>> >>>> The answer to this is pretty simple: if you load a file multiple times >>>> into >>>> different relations, then it will be scanned multiple times. So... >>>> >>>> a = load 'thing'; >>>> b = load 'thing; >>>> >>>> {..stuff using a..} >>>> {..stuff using b..} >>>> >>>> would load 'thing' twice. This is done for joins and whatnot -- there are >>>> cases when you need to load the same file separately, twice. What happens >>>> is >>>> essentially that you're going to load and scan the data twice. >>>> >>>> However, as in your case, if you instead combine the load, then you'd >>>> have > |