Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Some questions on intermediate serialization in Pig


Copy link to this message
-
Re: Some questions on intermediate serialization in Pig
So, to recap.

InterSedes writes the R1/R2/R3 thing.
I am quite sure it is done for splittability purposes.
The RawComparators, as well as InterStorage, operate on binary data that
does not see this thing.

DISCLAIMER: Guesswork below!

My wild guess is that InterRecordReader has nothing to do with the
RawComparator.
The former is used at the input of the map phase (and InterRecordWriter is
used at the output of the reduce phase) while RawComparator is used at the
boundary between map and reduce phase.
If you look at PigGenericMapReduce you will see that the mapper/reducer
always writes a PigNullableWritable as a key. I think this is the place
where the RawComparator actually plays a role, which means it does not
directly see the R1/R2/R3 because they are stripped out by the RecordReader.
Cheers,

--
Gianmarco

On Sun, May 27, 2012 at 10:28 AM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:

> Ashutosh, that definitely does help. Thanks for lending your insight. I
> think the thing I have little color on at the moment is the relationship
> between those raw bits ie RECORD_1 RECORD_2 RECORD_3 TUPLE_BITS and so on,
> and then the various byte[] compare functions.
>
> 2012/5/27 Ashutosh Chauhan <[EMAIL PROTECTED]>
>
> > Hey Jon,
> >
> > You raised some interesting question. I don't have answer for all, but
> have
> > for few.
> >
> > * BinStorage is a legacy format which was used for intermediate
> > serialization between MR jobs earlier. It is no longer used but is there
> > because unfortunately folks have stored their end-data using BinStorage,
> > even though it was considered internal format and subject to change. The
> > reason folks chose to store data using it was BinStorage was schema
> aware,
> > so once u wrote end-data with it, you can reload it without specifying
> > schema. This feature led to its (mis)use. See
> > https://issues.apache.org/jira/browse/PIG-798 for some related bugs
> around
> > this.
> >
> > * I think you have a correct intuition that in addition to identify tuple
> > boundaries, R1,R2,R3 is also used to identify block boundaries, that is
> to
> > make file splittable. Since, then you can arbitrarily split the files
> among
> > multiple mappers and they will know where does their first record starts.
> >
> > Hope it helps,
> > Ashutosh
> >
> > On Sat, May 26, 2012 at 9:04 PM, Jonathan Coveney <[EMAIL PROTECTED]
> > >wrote:
> >
> > > I appreciate it, Gianmarco :)
> > >
> > > 2012/5/26 Gianmarco De Francisci Morales <[EMAIL PROTECTED]>
> > >
> > > > I am not sure, but I will have a look at it (I implemented the raw
> > > > comparator for secondary sort).
> > > > I don't remember having to deal with this issue.
> > > >
> > > > Cheers,
> > > > --
> > > > Gianmarco
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, May 25, 2012 at 11:07 PM, Jonathan Coveney <
> [EMAIL PROTECTED]
> > > > >wrote:
> > > >
> > > > > I'll just bump this once. The main thing I'm still unsure on is
> just
> > > the
> > > > > relationship various raw comparators, Pig, and hadoop. If we're
> > > > serializing
> > > > > RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3,
> > > > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2,
> > > RECORD_3,
> > > > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come it
> > > > appears
> > > > > that the raw comparators aren't aware of it?
> > > > >
> > > > > 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]>
> > > > >
> > > > > > And one more question to pile on:
> > > > > >
> > > > > > What defines the binary data that the raw tuple comparator will
> be
> > > run
> > > > > on?
> > > > > > It seems like that it comes from hadoop, and the format generally
> > > makes
> > > > > > sense (you get bytes and do with them what you will). The thing
> > that
> > > > > > confuses me is why don't you have to deal with the
> > > > > > RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with
> all
> > > of
> > > > > that
> > > > > > and reads a deserialized tuple...so at what point do you get
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB