|
|
-
Some questions on intermediate serialization in Pig
Jonathan Coveney 2012-05-23, 23:50
I'm trying to understand how intermediate serialization in Pig works at a deeper level (understanding the whole code path, not just BinInterSedes in its own vaccuum). Right now I am looking at InterRecordReader/InterRecordWriter/InterStorage. Is that the right place to look for understanding how BinInterSedes is actually called?
Further, I'm trying to better understanding the RECORD_1/RECORD_2/RECORD_3 thing. My guess is that it's to make the file splittable? But I'm not really sure. I'd love any pointers about where to look for how BinInterSedes is used, and how intermediate storage happens.
Thanks! Jon
-
Re: Some questions on intermediate serialization in Pig
Jonathan Coveney 2012-05-24, 00:01
Another question is clarifying what BinStorage does compared to InterStorage. It looks like it might just be a legacy storage format?
I'm assuming that you do the R_1/R_2/R_3 to be able to find the next Tuple in the stream, but once you do that, can't you just read a tuple, and then read skip 12 bytes (3 ints), and keep reading?
2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]>
> I'm trying to understand how intermediate serialization in Pig works at a > deeper level (understanding the whole code path, not just BinInterSedes in > its own vaccuum). Right now I am looking at > InterRecordReader/InterRecordWriter/InterStorage. Is that the right place > to look for understanding how BinInterSedes is actually called? > > Further, I'm trying to better understanding the RECORD_1/RECORD_2/RECORD_3 > thing. My guess is that it's to make the file splittable? But I'm not > really sure. I'd love any pointers about where to look for how > BinInterSedes is used, and how intermediate storage happens. > > Thanks! > Jon >
-
Re: Some questions on intermediate serialization in Pig
Jonathan Coveney 2012-05-24, 02:10
And one more question to pile on:
What defines the binary data that the raw tuple comparator will be run on? It seems like that it comes from hadoop, and the format generally makes sense (you get bytes and do with them what you will). The thing that confuses me is why don't you have to deal with the RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with all of that and reads a deserialized tuple...so at what point do you get binary Tuple data that doesn't have all of the split stuff? I'll keep digging through but this is where my ignorance of the technicalities of the MR layer comes in...
2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]>
> Another question is clarifying what BinStorage does compared to > InterStorage. It looks like it might just be a legacy storage format? > > I'm assuming that you do the R_1/R_2/R_3 to be able to find the next Tuple > in the stream, but once you do that, can't you just read a tuple, and then > read skip 12 bytes (3 ints), and keep reading? > > > 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]> > >> I'm trying to understand how intermediate serialization in Pig works at a >> deeper level (understanding the whole code path, not just BinInterSedes in >> its own vaccuum). Right now I am looking at >> InterRecordReader/InterRecordWriter/InterStorage. Is that the right place >> to look for understanding how BinInterSedes is actually called? >> >> Further, I'm trying to better understanding the >> RECORD_1/RECORD_2/RECORD_3 thing. My guess is that it's to make the file >> splittable? But I'm not really sure. I'd love any pointers about where to >> look for how BinInterSedes is used, and how intermediate storage happens. >> >> Thanks! >> Jon >> > >
-
Re: Some questions on intermediate serialization in Pig
Jonathan Coveney 2012-05-25, 21:07
I'll just bump this once. The main thing I'm still unsure on is just the relationship various raw comparators, Pig, and hadoop. If we're serializing RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come it appears that the raw comparators aren't aware of it?
2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]>
> And one more question to pile on: > > What defines the binary data that the raw tuple comparator will be run on? > It seems like that it comes from hadoop, and the format generally makes > sense (you get bytes and do with them what you will). The thing that > confuses me is why don't you have to deal with the > RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with all of that > and reads a deserialized tuple...so at what point do you get binary Tuple > data that doesn't have all of the split stuff? I'll keep digging through > but this is where my ignorance of the technicalities of the MR layer comes > in... > > 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]> > >> Another question is clarifying what BinStorage does compared to >> InterStorage. It looks like it might just be a legacy storage format? >> >> I'm assuming that you do the R_1/R_2/R_3 to be able to find the next >> Tuple in the stream, but once you do that, can't you just read a tuple, and >> then read skip 12 bytes (3 ints), and keep reading? >> >> >> 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]> >> >>> I'm trying to understand how intermediate serialization in Pig works at >>> a deeper level (understanding the whole code path, not just BinInterSedes >>> in its own vaccuum). Right now I am looking at >>> InterRecordReader/InterRecordWriter/InterStorage. Is that the right place >>> to look for understanding how BinInterSedes is actually called? >>> >>> Further, I'm trying to better understanding the >>> RECORD_1/RECORD_2/RECORD_3 thing. My guess is that it's to make the file >>> splittable? But I'm not really sure. I'd love any pointers about where to >>> look for how BinInterSedes is used, and how intermediate storage happens. >>> >>> Thanks! >>> Jon >>> >> >> >
-
Re: Some questions on intermediate serialization in Pig
Russell Jurney 2012-05-25, 21:20
Maverick goes in RECORD_1, Goose goes in RECORD_2 and Goose's dipshit ejection seat goes in RECORD_3. 1 has crooked teeth. 2 is a bloody corpse. And 3... well 3 is to blame for it all. Russell Jurney http://datasyndrome.comOn May 23, 2012, at 4:51 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote: > I'm trying to understand how intermediate serialization in Pig works at a > deeper level (understanding the whole code path, not just BinInterSedes in > its own vaccuum). Right now I am looking at > InterRecordReader/InterRecordWriter/InterStorage. Is that the right place > to look for understanding how BinInterSedes is actually called? > > Further, I'm trying to better understanding the RECORD_1/RECORD_2/RECORD_3 > thing. My guess is that it's to make the file splittable? But I'm not > really sure. I'd love any pointers about where to look for how > BinInterSedes is used, and how intermediate storage happens. > > Thanks! > Jon
-
Re: Some questions on intermediate serialization in Pig
Gianmarco De Francisci Mo... 2012-05-26, 20:14
I am not sure, but I will have a look at it (I implemented the raw comparator for secondary sort). I don't remember having to deal with this issue.
Cheers, -- Gianmarco On Fri, May 25, 2012 at 11:07 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:
> I'll just bump this once. The main thing I'm still unsure on is just the > relationship various raw comparators, Pig, and hadoop. If we're serializing > RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come it appears > that the raw comparators aren't aware of it? > > 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]> > > > And one more question to pile on: > > > > What defines the binary data that the raw tuple comparator will be run > on? > > It seems like that it comes from hadoop, and the format generally makes > > sense (you get bytes and do with them what you will). The thing that > > confuses me is why don't you have to deal with the > > RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with all of > that > > and reads a deserialized tuple...so at what point do you get binary Tuple > > data that doesn't have all of the split stuff? I'll keep digging through > > but this is where my ignorance of the technicalities of the MR layer > comes > > in... > > > > 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]> > > > >> Another question is clarifying what BinStorage does compared to > >> InterStorage. It looks like it might just be a legacy storage format? > >> > >> I'm assuming that you do the R_1/R_2/R_3 to be able to find the next > >> Tuple in the stream, but once you do that, can't you just read a tuple, > and > >> then read skip 12 bytes (3 ints), and keep reading? > >> > >> > >> 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]> > >> > >>> I'm trying to understand how intermediate serialization in Pig works at > >>> a deeper level (understanding the whole code path, not just > BinInterSedes > >>> in its own vaccuum). Right now I am looking at > >>> InterRecordReader/InterRecordWriter/InterStorage. Is that the right > place > >>> to look for understanding how BinInterSedes is actually called? > >>> > >>> Further, I'm trying to better understanding the > >>> RECORD_1/RECORD_2/RECORD_3 thing. My guess is that it's to make the > file > >>> splittable? But I'm not really sure. I'd love any pointers about where > to > >>> look for how BinInterSedes is used, and how intermediate storage > happens. > >>> > >>> Thanks! > >>> Jon > >>> > >> > >> > > >
-
Re: Some questions on intermediate serialization in Pig
Jonathan Coveney 2012-05-27, 04:04
I appreciate it, Gianmarco :)
2012/5/26 Gianmarco De Francisci Morales <[EMAIL PROTECTED]>
> I am not sure, but I will have a look at it (I implemented the raw > comparator for secondary sort). > I don't remember having to deal with this issue. > > Cheers, > -- > Gianmarco > > > > > On Fri, May 25, 2012 at 11:07 PM, Jonathan Coveney <[EMAIL PROTECTED] > >wrote: > > > I'll just bump this once. The main thing I'm still unsure on is just the > > relationship various raw comparators, Pig, and hadoop. If we're > serializing > > RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come it > appears > > that the raw comparators aren't aware of it? > > > > 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > And one more question to pile on: > > > > > > What defines the binary data that the raw tuple comparator will be run > > on? > > > It seems like that it comes from hadoop, and the format generally makes > > > sense (you get bytes and do with them what you will). The thing that > > > confuses me is why don't you have to deal with the > > > RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with all of > > that > > > and reads a deserialized tuple...so at what point do you get binary > Tuple > > > data that doesn't have all of the split stuff? I'll keep digging > through > > > but this is where my ignorance of the technicalities of the MR layer > > comes > > > in... > > > > > > 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > >> Another question is clarifying what BinStorage does compared to > > >> InterStorage. It looks like it might just be a legacy storage format? > > >> > > >> I'm assuming that you do the R_1/R_2/R_3 to be able to find the next > > >> Tuple in the stream, but once you do that, can't you just read a > tuple, > > and > > >> then read skip 12 bytes (3 ints), and keep reading? > > >> > > >> > > >> 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]> > > >> > > >>> I'm trying to understand how intermediate serialization in Pig works > at > > >>> a deeper level (understanding the whole code path, not just > > BinInterSedes > > >>> in its own vaccuum). Right now I am looking at > > >>> InterRecordReader/InterRecordWriter/InterStorage. Is that the right > > place > > >>> to look for understanding how BinInterSedes is actually called? > > >>> > > >>> Further, I'm trying to better understanding the > > >>> RECORD_1/RECORD_2/RECORD_3 thing. My guess is that it's to make the > > file > > >>> splittable? But I'm not really sure. I'd love any pointers about > where > > to > > >>> look for how BinInterSedes is used, and how intermediate storage > > happens. > > >>> > > >>> Thanks! > > >>> Jon > > >>> > > >> > > >> > > > > > >
-
Re: Some questions on intermediate serialization in Pig
Ashutosh Chauhan 2012-05-27, 07:48
Hey Jon, You raised some interesting question. I don't have answer for all, but have for few. * BinStorage is a legacy format which was used for intermediate serialization between MR jobs earlier. It is no longer used but is there because unfortunately folks have stored their end-data using BinStorage, even though it was considered internal format and subject to change. The reason folks chose to store data using it was BinStorage was schema aware, so once u wrote end-data with it, you can reload it without specifying schema. This feature led to its (mis)use. See https://issues.apache.org/jira/browse/PIG-798 for some related bugs around this. * I think you have a correct intuition that in addition to identify tuple boundaries, R1,R2,R3 is also used to identify block boundaries, that is to make file splittable. Since, then you can arbitrarily split the files among multiple mappers and they will know where does their first record starts. Hope it helps, Ashutosh On Sat, May 26, 2012 at 9:04 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote: > I appreciate it, Gianmarco :) > > 2012/5/26 Gianmarco De Francisci Morales <[EMAIL PROTECTED]> > > > I am not sure, but I will have a look at it (I implemented the raw > > comparator for secondary sort). > > I don't remember having to deal with this issue. > > > > Cheers, > > -- > > Gianmarco > > > > > > > > > > On Fri, May 25, 2012 at 11:07 PM, Jonathan Coveney <[EMAIL PROTECTED] > > >wrote: > > > > > I'll just bump this once. The main thing I'm still unsure on is just > the > > > relationship various raw comparators, Pig, and hadoop. If we're > > serializing > > > RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, > > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, > RECORD_3, > > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come it > > appears > > > that the raw comparators aren't aware of it? > > > > > > 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > > > And one more question to pile on: > > > > > > > > What defines the binary data that the raw tuple comparator will be > run > > > on? > > > > It seems like that it comes from hadoop, and the format generally > makes > > > > sense (you get bytes and do with them what you will). The thing that > > > > confuses me is why don't you have to deal with the > > > > RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with all > of > > > that > > > > and reads a deserialized tuple...so at what point do you get binary > > Tuple > > > > data that doesn't have all of the split stuff? I'll keep digging > > through > > > > but this is where my ignorance of the technicalities of the MR layer > > > comes > > > > in... > > > > > > > > 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > > > >> Another question is clarifying what BinStorage does compared to > > > >> InterStorage. It looks like it might just be a legacy storage > format? > > > >> > > > >> I'm assuming that you do the R_1/R_2/R_3 to be able to find the next > > > >> Tuple in the stream, but once you do that, can't you just read a > > tuple, > > > and > > > >> then read skip 12 bytes (3 ints), and keep reading? > > > >> > > > >> > > > >> 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]> > > > >> > > > >>> I'm trying to understand how intermediate serialization in Pig > works > > at > > > >>> a deeper level (understanding the whole code path, not just > > > BinInterSedes > > > >>> in its own vaccuum). Right now I am looking at > > > >>> InterRecordReader/InterRecordWriter/InterStorage. Is that the right > > > place > > > >>> to look for understanding how BinInterSedes is actually called? > > > >>> > > > >>> Further, I'm trying to better understanding the > > > >>> RECORD_1/RECORD_2/RECORD_3 thing. My guess is that it's to make the > > > file > > > >>> splittable? But I'm not really sure. I'd love any pointers about > > where > > > to > > > >>> look for how BinInterSedes is used, and how intermediate storage > > > happens. > > > >>>
-
Re: Some questions on intermediate serialization in Pig
Jonathan Coveney 2012-05-27, 08:28
Ashutosh, that definitely does help. Thanks for lending your insight. I think the thing I have little color on at the moment is the relationship between those raw bits ie RECORD_1 RECORD_2 RECORD_3 TUPLE_BITS and so on, and then the various byte[] compare functions. 2012/5/27 Ashutosh Chauhan <[EMAIL PROTECTED]> > Hey Jon, > > You raised some interesting question. I don't have answer for all, but have > for few. > > * BinStorage is a legacy format which was used for intermediate > serialization between MR jobs earlier. It is no longer used but is there > because unfortunately folks have stored their end-data using BinStorage, > even though it was considered internal format and subject to change. The > reason folks chose to store data using it was BinStorage was schema aware, > so once u wrote end-data with it, you can reload it without specifying > schema. This feature led to its (mis)use. See > https://issues.apache.org/jira/browse/PIG-798 for some related bugs around > this. > > * I think you have a correct intuition that in addition to identify tuple > boundaries, R1,R2,R3 is also used to identify block boundaries, that is to > make file splittable. Since, then you can arbitrarily split the files among > multiple mappers and they will know where does their first record starts. > > Hope it helps, > Ashutosh > > On Sat, May 26, 2012 at 9:04 PM, Jonathan Coveney <[EMAIL PROTECTED] > >wrote: > > > I appreciate it, Gianmarco :) > > > > 2012/5/26 Gianmarco De Francisci Morales <[EMAIL PROTECTED]> > > > > > I am not sure, but I will have a look at it (I implemented the raw > > > comparator for secondary sort). > > > I don't remember having to deal with this issue. > > > > > > Cheers, > > > -- > > > Gianmarco > > > > > > > > > > > > > > > On Fri, May 25, 2012 at 11:07 PM, Jonathan Coveney <[EMAIL PROTECTED] > > > >wrote: > > > > > > > I'll just bump this once. The main thing I'm still unsure on is just > > the > > > > relationship various raw comparators, Pig, and hadoop. If we're > > > serializing > > > > RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, > > > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, > > RECORD_3, > > > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come it > > > appears > > > > that the raw comparators aren't aware of it? > > > > > > > > 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > > > > > And one more question to pile on: > > > > > > > > > > What defines the binary data that the raw tuple comparator will be > > run > > > > on? > > > > > It seems like that it comes from hadoop, and the format generally > > makes > > > > > sense (you get bytes and do with them what you will). The thing > that > > > > > confuses me is why don't you have to deal with the > > > > > RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with all > > of > > > > that > > > > > and reads a deserialized tuple...so at what point do you get binary > > > Tuple > > > > > data that doesn't have all of the split stuff? I'll keep digging > > > through > > > > > but this is where my ignorance of the technicalities of the MR > layer > > > > comes > > > > > in... > > > > > > > > > > 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > > > > > >> Another question is clarifying what BinStorage does compared to > > > > >> InterStorage. It looks like it might just be a legacy storage > > format? > > > > >> > > > > >> I'm assuming that you do the R_1/R_2/R_3 to be able to find the > next > > > > >> Tuple in the stream, but once you do that, can't you just read a > > > tuple, > > > > and > > > > >> then read skip 12 bytes (3 ints), and keep reading? > > > > >> > > > > >> > > > > >> 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]> > > > > >> > > > > >>> I'm trying to understand how intermediate serialization in Pig > > works > > > at > > > > >>> a deeper level (understanding the whole code path, not just > > > > BinInterSedes > > > > >>> in its own vaccuum). Right now I am looking at
-
Re: Some questions on intermediate serialization in Pig
Gianmarco De Francisci Mo... 2012-06-09, 17:11
So, to recap. InterSedes writes the R1/R2/R3 thing. I am quite sure it is done for splittability purposes. The RawComparators, as well as InterStorage, operate on binary data that does not see this thing. DISCLAIMER: Guesswork below! My wild guess is that InterRecordReader has nothing to do with the RawComparator. The former is used at the input of the map phase (and InterRecordWriter is used at the output of the reduce phase) while RawComparator is used at the boundary between map and reduce phase. If you look at PigGenericMapReduce you will see that the mapper/reducer always writes a PigNullableWritable as a key. I think this is the place where the RawComparator actually plays a role, which means it does not directly see the R1/R2/R3 because they are stripped out by the RecordReader. Cheers, -- Gianmarco On Sun, May 27, 2012 at 10:28 AM, Jonathan Coveney <[EMAIL PROTECTED]>wrote: > Ashutosh, that definitely does help. Thanks for lending your insight. I > think the thing I have little color on at the moment is the relationship > between those raw bits ie RECORD_1 RECORD_2 RECORD_3 TUPLE_BITS and so on, > and then the various byte[] compare functions. > > 2012/5/27 Ashutosh Chauhan <[EMAIL PROTECTED]> > > > Hey Jon, > > > > You raised some interesting question. I don't have answer for all, but > have > > for few. > > > > * BinStorage is a legacy format which was used for intermediate > > serialization between MR jobs earlier. It is no longer used but is there > > because unfortunately folks have stored their end-data using BinStorage, > > even though it was considered internal format and subject to change. The > > reason folks chose to store data using it was BinStorage was schema > aware, > > so once u wrote end-data with it, you can reload it without specifying > > schema. This feature led to its (mis)use. See > > https://issues.apache.org/jira/browse/PIG-798 for some related bugs > around > > this. > > > > * I think you have a correct intuition that in addition to identify tuple > > boundaries, R1,R2,R3 is also used to identify block boundaries, that is > to > > make file splittable. Since, then you can arbitrarily split the files > among > > multiple mappers and they will know where does their first record starts. > > > > Hope it helps, > > Ashutosh > > > > On Sat, May 26, 2012 at 9:04 PM, Jonathan Coveney <[EMAIL PROTECTED] > > >wrote: > > > > > I appreciate it, Gianmarco :) > > > > > > 2012/5/26 Gianmarco De Francisci Morales <[EMAIL PROTECTED]> > > > > > > > I am not sure, but I will have a look at it (I implemented the raw > > > > comparator for secondary sort). > > > > I don't remember having to deal with this issue. > > > > > > > > Cheers, > > > > -- > > > > Gianmarco > > > > > > > > > > > > > > > > > > > > On Fri, May 25, 2012 at 11:07 PM, Jonathan Coveney < > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > I'll just bump this once. The main thing I'm still unsure on is > just > > > the > > > > > relationship various raw comparators, Pig, and hadoop. If we're > > > > serializing > > > > > RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, > > > > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, > > > RECORD_3, > > > > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come it > > > > appears > > > > > that the raw comparators aren't aware of it? > > > > > > > > > > 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > > > > > > > And one more question to pile on: > > > > > > > > > > > > What defines the binary data that the raw tuple comparator will > be > > > run > > > > > on? > > > > > > It seems like that it comes from hadoop, and the format generally > > > makes > > > > > > sense (you get bytes and do with them what you will). The thing > > that > > > > > > confuses me is why don't you have to deal with the > > > > > > RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with > all > > > of > > > > > that > > > > > > and reads a deserialized tuple...so at what point do you get
-
Re: Some questions on intermediate serialization in Pig
Jonathan Coveney 2012-06-09, 20:48
Gianmarco, I think you're absolutely correct. Thanks for weighing in :) 2012/6/9 Gianmarco De Francisci Morales <[EMAIL PROTECTED]> > So, to recap. > > InterSedes writes the R1/R2/R3 thing. > I am quite sure it is done for splittability purposes. > The RawComparators, as well as InterStorage, operate on binary data that > does not see this thing. > > DISCLAIMER: Guesswork below! > > My wild guess is that InterRecordReader has nothing to do with the > RawComparator. > The former is used at the input of the map phase (and InterRecordWriter is > used at the output of the reduce phase) while RawComparator is used at the > boundary between map and reduce phase. > If you look at PigGenericMapReduce you will see that the mapper/reducer > always writes a PigNullableWritable as a key. I think this is the place > where the RawComparator actually plays a role, which means it does not > directly see the R1/R2/R3 because they are stripped out by the > RecordReader. > > > Cheers, > > -- > Gianmarco > > > > On Sun, May 27, 2012 at 10:28 AM, Jonathan Coveney <[EMAIL PROTECTED] > >wrote: > > > Ashutosh, that definitely does help. Thanks for lending your insight. I > > think the thing I have little color on at the moment is the relationship > > between those raw bits ie RECORD_1 RECORD_2 RECORD_3 TUPLE_BITS and so > on, > > and then the various byte[] compare functions. > > > > 2012/5/27 Ashutosh Chauhan <[EMAIL PROTECTED]> > > > > > Hey Jon, > > > > > > You raised some interesting question. I don't have answer for all, but > > have > > > for few. > > > > > > * BinStorage is a legacy format which was used for intermediate > > > serialization between MR jobs earlier. It is no longer used but is > there > > > because unfortunately folks have stored their end-data using > BinStorage, > > > even though it was considered internal format and subject to change. > The > > > reason folks chose to store data using it was BinStorage was schema > > aware, > > > so once u wrote end-data with it, you can reload it without specifying > > > schema. This feature led to its (mis)use. See > > > https://issues.apache.org/jira/browse/PIG-798 for some related bugs > > around > > > this. > > > > > > * I think you have a correct intuition that in addition to identify > tuple > > > boundaries, R1,R2,R3 is also used to identify block boundaries, that is > > to > > > make file splittable. Since, then you can arbitrarily split the files > > among > > > multiple mappers and they will know where does their first record > starts. > > > > > > Hope it helps, > > > Ashutosh > > > > > > On Sat, May 26, 2012 at 9:04 PM, Jonathan Coveney <[EMAIL PROTECTED] > > > >wrote: > > > > > > > I appreciate it, Gianmarco :) > > > > > > > > 2012/5/26 Gianmarco De Francisci Morales <[EMAIL PROTECTED]> > > > > > > > > > I am not sure, but I will have a look at it (I implemented the raw > > > > > comparator for secondary sort). > > > > > I don't remember having to deal with this issue. > > > > > > > > > > Cheers, > > > > > -- > > > > > Gianmarco > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, May 25, 2012 at 11:07 PM, Jonathan Coveney < > > [EMAIL PROTECTED] > > > > > >wrote: > > > > > > > > > > > I'll just bump this once. The main thing I'm still unsure on is > > just > > > > the > > > > > > relationship various raw comparators, Pig, and hadoop. If we're > > > > > serializing > > > > > > RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, > RECORD_3, > > > > > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, > > > > RECORD_3, > > > > > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come > it > > > > > appears > > > > > > that the raw comparators aren't aware of it? > > > > > > > > > > > > 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > > > > > > > > > And one more question to pile on: > > > > > > > > > > > > > > What defines the binary data that the raw tuple comparator will > > be > > > > run > > > > > > on? > > > > > > > It seems like that it comes from hadoop, and the format
|
|