Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> reuse same Tuple and ArrayList for every getNext call in LoadFunc?

Copy link to this message
Re: reuse same Tuple and ArrayList for every getNext call in LoadFunc?
Anything that builds a bag -- for example, I was just looking at the
DefaultDataBag code (and by extension, DistinctDataBag, etc) and it
does not do any tuple copies. We could, of course, change all the Pig
code to respect the assumption that tuples need to be copied if you
want to keep them across multiple getNext calls, but we'd still get
into trouble with UDFs that other people wrote before this change.

I am curious why you are interested in this particular inefficiency,
are you seeing severely degraded performance due to object allocation?


On Sun, Sep 16, 2012 at 10:16 PM, Jim Donofrio <[EMAIL PROTECTED]> wrote:
> Even if I make new tuple and lists I guess that also means I cannot safely
> reuse a DataByteArray object inside a Tuple across getNext calls?
> Also wouldnt the conversion to a Bag only likely happen in a reducer which
> would not be affected by the loader which only supplies input to the mapper?
> When you are talking about downstream code from the loader that assumes that
> each tuple is a new Tuple, is there any code in Pig that assumes that or are
> you just talking about UDF's and other 3rd party libs that people write for
> Pig?
> On 09/17/2012 12:44 AM, Dmitriy Ryaboy wrote:
>> I looked into this a while back -- trouble comes when something
>> downstream from the loader tries to collect inputs into a bag, and
>> doesn't do its own copies. One can easily argue that if someone wants
>> to do such collection, it should be their responsibility to ensure
>> they aren't just collecting the same object that keeps being
>> overwritten, but at this point, I think it's too late to convert
>> everyone who might be making the "each tuple is a new tuple"
>> assumption.
>> D
>> On Sun, Sep 16, 2012 at 9:33 PM, Jim Donofrio <[EMAIL PROTECTED]>
>> wrote:
>>> Is it ok to reuse the same Tuple and List of inputs from RecordReader
>>> across
>>> all getNext calls in a LoadFunc? I notice that PigStorage creates a new
>>> List, mProtoTuple, for every record along with a new tuple. Since
>>> PigMapBase
>>> just use newTupleNoCopy to copy the List, creating a new Tuple for every
>>> getNext seems unnecessary.