|
|
-
reuse same Tuple and ArrayList for every getNext call in LoadFunc?
Jim Donofrio 2012-09-17, 04:33
Is it ok to reuse the same Tuple and List of inputs from RecordReader across all getNext calls in a LoadFunc? I notice that PigStorage creates a new List, mProtoTuple, for every record along with a new tuple. Since PigMapBase just use newTupleNoCopy to copy the List, creating a new Tuple for every getNext seems unnecessary.
-
Re: reuse same Tuple and ArrayList for every getNext call in LoadFunc?
Dmitriy Ryaboy 2012-09-17, 04:44
I looked into this a while back -- trouble comes when something downstream from the loader tries to collect inputs into a bag, and doesn't do its own copies. One can easily argue that if someone wants to do such collection, it should be their responsibility to ensure they aren't just collecting the same object that keeps being overwritten, but at this point, I think it's too late to convert everyone who might be making the "each tuple is a new tuple" assumption.
D
On Sun, Sep 16, 2012 at 9:33 PM, Jim Donofrio <[EMAIL PROTECTED]> wrote: > Is it ok to reuse the same Tuple and List of inputs from RecordReader across > all getNext calls in a LoadFunc? I notice that PigStorage creates a new > List, mProtoTuple, for every record along with a new tuple. Since PigMapBase > just use newTupleNoCopy to copy the List, creating a new Tuple for every > getNext seems unnecessary.
-
Re: reuse same Tuple and ArrayList for every getNext call in LoadFunc?
Jim Donofrio 2012-09-17, 05:16
Even if I make new tuple and lists I guess that also means I cannot safely reuse a DataByteArray object inside a Tuple across getNext calls?
Also wouldnt the conversion to a Bag only likely happen in a reducer which would not be affected by the loader which only supplies input to the mapper?
When you are talking about downstream code from the loader that assumes that each tuple is a new Tuple, is there any code in Pig that assumes that or are you just talking about UDF's and other 3rd party libs that people write for Pig?
On 09/17/2012 12:44 AM, Dmitriy Ryaboy wrote: > I looked into this a while back -- trouble comes when something > downstream from the loader tries to collect inputs into a bag, and > doesn't do its own copies. One can easily argue that if someone wants > to do such collection, it should be their responsibility to ensure > they aren't just collecting the same object that keeps being > overwritten, but at this point, I think it's too late to convert > everyone who might be making the "each tuple is a new tuple" > assumption. > > D > > On Sun, Sep 16, 2012 at 9:33 PM, Jim Donofrio <[EMAIL PROTECTED]> wrote: >> Is it ok to reuse the same Tuple and List of inputs from RecordReader across >> all getNext calls in a LoadFunc? I notice that PigStorage creates a new >> List, mProtoTuple, for every record along with a new tuple. Since PigMapBase >> just use newTupleNoCopy to copy the List, creating a new Tuple for every >> getNext seems unnecessary.
-
Re: reuse same Tuple and ArrayList for every getNext call in LoadFunc?
Dmitriy Ryaboy 2012-09-17, 05:30
Anything that builds a bag -- for example, I was just looking at the DefaultDataBag code (and by extension, DistinctDataBag, etc) and it does not do any tuple copies. We could, of course, change all the Pig code to respect the assumption that tuples need to be copied if you want to keep them across multiple getNext calls, but we'd still get into trouble with UDFs that other people wrote before this change.
I am curious why you are interested in this particular inefficiency, are you seeing severely degraded performance due to object allocation?
D
On Sun, Sep 16, 2012 at 10:16 PM, Jim Donofrio <[EMAIL PROTECTED]> wrote: > Even if I make new tuple and lists I guess that also means I cannot safely > reuse a DataByteArray object inside a Tuple across getNext calls? > > Also wouldnt the conversion to a Bag only likely happen in a reducer which > would not be affected by the loader which only supplies input to the mapper? > > When you are talking about downstream code from the loader that assumes that > each tuple is a new Tuple, is there any code in Pig that assumes that or are > you just talking about UDF's and other 3rd party libs that people write for > Pig? > > > On 09/17/2012 12:44 AM, Dmitriy Ryaboy wrote: >> >> I looked into this a while back -- trouble comes when something >> downstream from the loader tries to collect inputs into a bag, and >> doesn't do its own copies. One can easily argue that if someone wants >> to do such collection, it should be their responsibility to ensure >> they aren't just collecting the same object that keeps being >> overwritten, but at this point, I think it's too late to convert >> everyone who might be making the "each tuple is a new tuple" >> assumption. >> >> D >> >> On Sun, Sep 16, 2012 at 9:33 PM, Jim Donofrio <[EMAIL PROTECTED]> >> wrote: >>> >>> Is it ok to reuse the same Tuple and List of inputs from RecordReader >>> across >>> all getNext calls in a LoadFunc? I notice that PigStorage creates a new >>> List, mProtoTuple, for every record along with a new tuple. Since >>> PigMapBase >>> just use newTupleNoCopy to copy the List, creating a new Tuple for every >>> getNext seems unnecessary. > >
-
Re: reuse same Tuple and ArrayList for every getNext call in LoadFunc?
Jim Donofrio 2012-09-17, 13:15
Ok thanks for the clarification.
I am interested in this because I am new to Pig and am use to writing RecordReaders for mapreduce that reuse the same objects so I thought the same logic would apply here. I have not done any performance tests.
On 09/17/2012 01:30 AM, Dmitriy Ryaboy wrote: > Anything that builds a bag -- for example, I was just looking at the > DefaultDataBag code (and by extension, DistinctDataBag, etc) and it > does not do any tuple copies. We could, of course, change all the Pig > code to respect the assumption that tuples need to be copied if you > want to keep them across multiple getNext calls, but we'd still get > into trouble with UDFs that other people wrote before this change. > > I am curious why you are interested in this particular inefficiency, > are you seeing severely degraded performance due to object allocation? > > D > > On Sun, Sep 16, 2012 at 10:16 PM, Jim Donofrio <[EMAIL PROTECTED]> wrote: >> Even if I make new tuple and lists I guess that also means I cannot safely >> reuse a DataByteArray object inside a Tuple across getNext calls? >> >> Also wouldnt the conversion to a Bag only likely happen in a reducer which >> would not be affected by the loader which only supplies input to the mapper? >> >> When you are talking about downstream code from the loader that assumes that >> each tuple is a new Tuple, is there any code in Pig that assumes that or are >> you just talking about UDF's and other 3rd party libs that people write for >> Pig? >> >> >> On 09/17/2012 12:44 AM, Dmitriy Ryaboy wrote: >>> I looked into this a while back -- trouble comes when something >>> downstream from the loader tries to collect inputs into a bag, and >>> doesn't do its own copies. One can easily argue that if someone wants >>> to do such collection, it should be their responsibility to ensure >>> they aren't just collecting the same object that keeps being >>> overwritten, but at this point, I think it's too late to convert >>> everyone who might be making the "each tuple is a new tuple" >>> assumption. >>> >>> D >>> >>> On Sun, Sep 16, 2012 at 9:33 PM, Jim Donofrio <[EMAIL PROTECTED]> >>> wrote: >>>> Is it ok to reuse the same Tuple and List of inputs from RecordReader >>>> across >>>> all getNext calls in a LoadFunc? I notice that PigStorage creates a new >>>> List, mProtoTuple, for every record along with a new tuple. Since >>>> PigMapBase >>>> just use newTupleNoCopy to copy the List, creating a new Tuple for every >>>> getNext seems unnecessary. >>
|
|