Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> LOAD once, use multiple times


Copy link to this message
-
Re: LOAD once, use multiple times
See slides 16,17 in
http://www.slideshare.net/thejasmn/apache-pig-performance-optimizations-talk-at-apachecon-2010
.

For the query in example, pig includes an index in the map output key,
and the reduce has a PODemux operator that sends the records to
appropriate reduce plans. There are no distinct reduce tasks for each
group operation.

-Thejas
On 10/3/11 9:35 PM, Something Something wrote:
> Let me ask the question differently.  Let's say I was not using Pig.  I
> wanted to do this just using Java MapReduce.  The input file is HUGE.  One
> obvious way to do this would be to write 3 different MR jobs.  But that
> means this huge file be read 3 times which is what I am trying to avoid.
>
> Is there a way to write a Mapper that will read this file only once, and
> then write to 3 different Reducers with different keys?
>
> Going back to Pig, when I LOAD this file&  then later 'group by' 3 different
> keys, how does Pig do this?  Does it "LOAD" this input file into some
> interim file&  call 3 different Map Reduce jobs?
>
> If this makes no sense, please ignore it.  I will try to use 'Explain',
> 'Describe' to learn the internals.  Thanks.
>
>
> On Mon, Oct 3, 2011 at 6:04 PM, Jonathan Coveney<[EMAIL PROTECTED]>  wrote:
>
>> If you want to know more about the internals, I'd check out the paper Yahoo
>> put out on the topic (or, of course, buy the book Programming Pig).
>>
>> The answer to this is pretty simple: if you load a file multiple times into
>> different relations, then it will be scanned multiple times. So...
>>
>> a = load 'thing';
>> b = load 'thing;
>>
>> {..stuff using a..}
>> {..stuff using b..}
>>
>> would load 'thing' twice. This is done for joins and whatnot -- there are
>> cases when you need to load the same file separately, twice. What happens
>> is
>> essentially that you're going to load and scan the data twice.
>>
>> However, as in your case, if you instead combine the load, then you'd have
>>
>> a = load 'thing';
>> {..stuff using a..}
>> {..stuff using a (which previously used b)..}
>>
>> Now it will just scan a once, and then go into each of the pipelines you
>> defined.
>>
>> Obviously it's more complex than that, but that's the general gist.
>>
>> 2011/10/3 Something Something<[EMAIL PROTECTED]>
>>
>>> I have 3 Pig scripts that load data from the same log file, but filter&
>>> group this data differently.  If I combine these 3 into one&  LOAD only
>>> once, performance seems to have improved, but now I am curious exactly
>> what
>>> does LOAD do?
>>>
>>> How does LOAD work internally?  Does Pig save results of the LOAD into
>> some
>>> separate location in HDFS?  Someone please explain how LOAD relates to
>>> MapReduce?  Thanks.
>>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB