|
|
-
Re: LOAD once, use multiple timesThejas Nair 2011-10-04, 17:15
See slides 16,17 in
http://www.slideshare.net/thejasmn/apache-pig-performance-optimizations-talk-at-apachecon-2010 . For the query in example, pig includes an index in the map output key, and the reduce has a PODemux operator that sends the records to appropriate reduce plans. There are no distinct reduce tasks for each group operation. -Thejas On 10/3/11 9:35 PM, Something Something wrote: > Let me ask the question differently. Let's say I was not using Pig. I > wanted to do this just using Java MapReduce. The input file is HUGE. One > obvious way to do this would be to write 3 different MR jobs. But that > means this huge file be read 3 times which is what I am trying to avoid. > > Is there a way to write a Mapper that will read this file only once, and > then write to 3 different Reducers with different keys? > > Going back to Pig, when I LOAD this file& then later 'group by' 3 different > keys, how does Pig do this? Does it "LOAD" this input file into some > interim file& call 3 different Map Reduce jobs? > > If this makes no sense, please ignore it. I will try to use 'Explain', > 'Describe' to learn the internals. Thanks. > > > On Mon, Oct 3, 2011 at 6:04 PM, Jonathan Coveney<[EMAIL PROTECTED]> wrote: > >> If you want to know more about the internals, I'd check out the paper Yahoo >> put out on the topic (or, of course, buy the book Programming Pig). >> >> The answer to this is pretty simple: if you load a file multiple times into >> different relations, then it will be scanned multiple times. So... >> >> a = load 'thing'; >> b = load 'thing; >> >> {..stuff using a..} >> {..stuff using b..} >> >> would load 'thing' twice. This is done for joins and whatnot -- there are >> cases when you need to load the same file separately, twice. What happens >> is >> essentially that you're going to load and scan the data twice. >> >> However, as in your case, if you instead combine the load, then you'd have >> >> a = load 'thing'; >> {..stuff using a..} >> {..stuff using a (which previously used b)..} >> >> Now it will just scan a once, and then go into each of the pipelines you >> defined. >> >> Obviously it's more complex than that, but that's the general gist. >> >> 2011/10/3 Something Something<[EMAIL PROTECTED]> >> >>> I have 3 Pig scripts that load data from the same log file, but filter& >>> group this data differently. If I combine these 3 into one& LOAD only >>> once, performance seems to have improved, but now I am curious exactly >> what >>> does LOAD do? >>> >>> How does LOAD work internally? Does Pig save results of the LOAD into >> some >>> separate location in HDFS? Someone please explain how LOAD relates to >>> MapReduce? Thanks. >>> >> > |