Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> Some questions on intermediate serialization in Pig


+
Jonathan Coveney 2012-05-23, 23:50
+
Jonathan Coveney 2012-05-24, 00:01
+
Jonathan Coveney 2012-05-24, 02:10
+
Jonathan Coveney 2012-05-25, 21:07
+
Gianmarco De Francisci Mo... 2012-05-26, 20:14
+
Jonathan Coveney 2012-05-27, 04:04
Copy link to this message
-
Re: Some questions on intermediate serialization in Pig
Hey Jon,

You raised some interesting question. I don't have answer for all, but have
for few.

* BinStorage is a legacy format which was used for intermediate
serialization between MR jobs earlier. It is no longer used but is there
because unfortunately folks have stored their end-data using BinStorage,
even though it was considered internal format and subject to change. The
reason folks chose to store data using it was BinStorage was schema aware,
so once u wrote end-data with it, you can reload it without specifying
schema. This feature led to its (mis)use. See
https://issues.apache.org/jira/browse/PIG-798 for some related bugs around
this.

* I think you have a correct intuition that in addition to identify tuple
boundaries, R1,R2,R3 is also used to identify block boundaries, that is to
make file splittable. Since, then you can arbitrarily split the files among
multiple mappers and they will know where does their first record starts.

Hope it helps,
Ashutosh

On Sat, May 26, 2012 at 9:04 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:

> I appreciate it, Gianmarco :)
>
> 2012/5/26 Gianmarco De Francisci Morales <[EMAIL PROTECTED]>
>
> > I am not sure, but I will have a look at it (I implemented the raw
> > comparator for secondary sort).
> > I don't remember having to deal with this issue.
> >
> > Cheers,
> > --
> > Gianmarco
> >
> >
> >
> >
> > On Fri, May 25, 2012 at 11:07 PM, Jonathan Coveney <[EMAIL PROTECTED]
> > >wrote:
> >
> > > I'll just bump this once. The main thing I'm still unsure on is just
> the
> > > relationship various raw comparators, Pig, and hadoop. If we're
> > serializing
> > > RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3,
> > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2,
> RECORD_3,
> > > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come it
> > appears
> > > that the raw comparators aren't aware of it?
> > >
> > > 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]>
> > >
> > > > And one more question to pile on:
> > > >
> > > > What defines the binary data that the raw tuple comparator will be
> run
> > > on?
> > > > It seems like that it comes from hadoop, and the format generally
> makes
> > > > sense (you get bytes and do with them what you will). The thing that
> > > > confuses me is why don't you have to deal with the
> > > > RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with all
> of
> > > that
> > > > and reads a deserialized tuple...so at what point do you get binary
> > Tuple
> > > > data that doesn't have all of the split stuff? I'll keep digging
> > through
> > > > but this is where my ignorance of the technicalities of the MR layer
> > > comes
> > > > in...
> > > >
> > > > 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]>
> > > >
> > > >> Another question is clarifying what BinStorage does compared to
> > > >> InterStorage. It looks like it might just be a legacy storage
> format?
> > > >>
> > > >> I'm assuming that you do the R_1/R_2/R_3 to be able to find the next
> > > >> Tuple in the stream, but once you do that, can't you just read a
> > tuple,
> > > and
> > > >> then read skip 12 bytes (3 ints), and keep reading?
> > > >>
> > > >>
> > > >> 2012/5/23 Jonathan Coveney <[EMAIL PROTECTED]>
> > > >>
> > > >>> I'm trying to understand how intermediate serialization in Pig
> works
> > at
> > > >>> a deeper level (understanding the whole code path, not just
> > > BinInterSedes
> > > >>> in its own vaccuum). Right now I am looking at
> > > >>> InterRecordReader/InterRecordWriter/InterStorage. Is that the right
> > > place
> > > >>> to look for understanding how BinInterSedes is actually called?
> > > >>>
> > > >>> Further, I'm trying to better understanding the
> > > >>> RECORD_1/RECORD_2/RECORD_3 thing. My guess is that it's to make the
> > > file
> > > >>> splittable? But I'm not really sure. I'd love any pointers about
> > where
> > > to
> > > >>> look for how BinInterSedes is used, and how intermediate storage
> > > happens.
> > > >>>
+
Jonathan Coveney 2012-05-27, 08:28
+
Gianmarco De Francisci Mo... 2012-06-09, 17:11
+
Jonathan Coveney 2012-06-09, 20:48
+
Russell Jurney 2012-05-25, 21:20
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB