Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> LOAD once, use multiple times


Copy link to this message
-
Re: LOAD once, use multiple times
See slides 16,17 in
http://www.slideshare.net/thejasmn/apache-pig-performance-optimizations-talk-at-apachecon-2010
.

For the query in example, pig includes an index in the map output key,
and the reduce has a PODemux operator that sends the records to
appropriate reduce plans. There are no distinct reduce tasks for each
group operation.

-Thejas
On 10/3/11 9:35 PM, Something Something wrote:
> Let me ask the question differently.  Let's say I was not using Pig.  I
> wanted to do this just using Java MapReduce.  The input file is HUGE.  One
> obvious way to do this would be to write 3 different MR jobs.  But that
> means this huge file be read 3 times which is what I am trying to avoid.
>
> Is there a way to write a Mapper that will read this file only once, and
> then write to 3 different Reducers with different keys?
>
> Going back to Pig, when I LOAD this file&  then later 'group by' 3 different
> keys, how does Pig do this?  Does it "LOAD" this input file into some
> interim file&  call 3 different Map Reduce jobs?
>
> If this makes no sense, please ignore it.  I will try to use 'Explain',
> 'Describe' to learn the internals.  Thanks.
>
>
> On Mon, Oct 3, 2011 at 6:04 PM, Jonathan Coveney<[EMAIL PROTECTED]>  wrote:
>
>> If you want to know more about the internals, I'd check out the paper Yahoo
>> put out on the topic (or, of course, buy the book Programming Pig).
>>
>> The answer to this is pretty simple: if you load a file multiple times into
>> different relations, then it will be scanned multiple times. So...
>>
>> a = load 'thing';
>> b = load 'thing;
>>
>> {..stuff using a..}
>> {..stuff using b..}
>>
>> would load 'thing' twice. This is done for joins and whatnot -- there are
>> cases when you need to load the same file separately, twice. What happens
>> is
>> essentially that you're going to load and scan the data twice.
>>
>> However, as in your case, if you instead combine the load, then you'd have
>>
>> a = load 'thing';
>> {..stuff using a..}
>> {..stuff using a (which previously used b)..}
>>
>> Now it will just scan a once, and then go into each of the pipelines you
>> defined.
>>
>> Obviously it's more complex than that, but that's the general gist.
>>
>> 2011/10/3 Something Something<[EMAIL PROTECTED]>
>>
>>> I have 3 Pig scripts that load data from the same log file, but filter&
>>> group this data differently.  If I combine these 3 into one&  LOAD only
>>> once, performance seems to have improved, but now I am curious exactly
>> what
>>> does LOAD do?
>>>
>>> How does LOAD work internally?  Does Pig save results of the LOAD into
>> some
>>> separate location in HDFS?  Someone please explain how LOAD relates to
>>> MapReduce?  Thanks.
>>>
>>
>