Jonathan Coveney 2012-05-23, 23:50
Jonathan Coveney 2012-05-24, 00:01
Jonathan Coveney 2012-05-24, 02:10
Jonathan Coveney 2012-05-25, 21:07
-Re: Some questions on intermediate serialization in Pig
I am not sure, but I will have a look at it (I implemented the raw
comparator for secondary sort).
I don't remember having to deal with this issue.
On Fri, May 25, 2012 at 11:07 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:
> I'll just bump this once. The main thing I'm still unsure on is just the
> relationship various raw comparators, Pig, and hadoop. If we're serializing
> RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3,
> Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3,
> Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come it appears
> that the raw comparators aren't aware of it?
> 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]>
> > And one more question to pile on:
> > What defines the binary data that the raw tuple comparator will be run
> > It seems like that it comes from hadoop, and the format generally makes
> > sense (you get bytes and do with them what you will). The thing that
> > confuses me is why don't you have to deal with the
> > RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with all of
> > and reads a deserialized tuple...so at what point do you get binary Tuple
> > data that doesn't have all of the split stuff? I'll keep digging through
> > but this is where my ignorance of the technicalities of the MR layer
> > in...
> > 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]>
> >> Another question is clarifying what BinStorage does compared to
> >> InterStorage. It looks like it might just be a legacy storage format?
> >> I'm assuming that you do the R_1/R_2/R_3 to be able to find the next
> >> Tuple in the stream, but once you do that, can't you just read a tuple,
> >> then read skip 12 bytes (3 ints), and keep reading?
> >> 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]>
> >>> I'm trying to understand how intermediate serialization in Pig works at
> >>> a deeper level (understanding the whole code path, not just
> >>> in its own vaccuum). Right now I am looking at
> >>> InterRecordReader/InterRecordWriter/InterStorage. Is that the right
> >>> to look for understanding how BinInterSedes is actually called?
> >>> Further, I'm trying to better understanding the
> >>> RECORD_1/RECORD_2/RECORD_3 thing. My guess is that it's to make the
> >>> splittable? But I'm not really sure. I'd love any pointers about where
> >>> look for how BinInterSedes is used, and how intermediate storage
> >>> Thanks!
> >>> Jon
Jonathan Coveney 2012-05-27, 04:04
Ashutosh Chauhan 2012-05-27, 07:48
Jonathan Coveney 2012-05-27, 08:28
Gianmarco De Francisci Mo... 2012-06-09, 17:11
Jonathan Coveney 2012-06-09, 20:48
Russell Jurney 2012-05-25, 21:20