Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> problems with .gz


Copy link to this message
-
Re: problems with .gz
Thanks for sending me this! Glad you found your issue. And though my
mysterious bug stays mysterious, it's better that than an issue with Pig's
gzip stuff.
2013/6/12 William Oberman <[EMAIL PROTECTED]>

> I know what's going on, and it's kind of dumb on my part, but I'll post
> anyways to help someone else who might be puzzled.  To review, I had data
> that looked like this (and yes, it's corrupt, but it happens sometimes):
> "json\njson\n...json\n\0\0\0...\0\0\0\0middle_of_json\njson\n...json\n"
>
> E.g. a huge block of null characters in the middle of \n separated json.
>  And, usually the last character before the null block was a \n, but the
> first character after the null block was in the middle of a json string.
>
> My custom UDF returned pig friendly data structures given JSON.  The "dumb"
> thing was I returned null on a bad parse, instead of throwing IOException.
>  For pig, returning null is a signal to stop loading data (I should have
> payed closer attention to the javadoc).
>
> Thus, why my uncompressed count > compressed count: it's the difference
> between splits vs not splits (since gz doesn't allow splitting).
>
> In the uncompressed case, blocks before AND AFTER the nulls were ok and
> contributed data to my COUNT(*).
>
> In the compressed case, only data before the nulls contributed data to my
> COUNT(*).
>
> will
>
>
> On Tue, Jun 11, 2013 at 8:46 AM, Jonathan Coveney <[EMAIL PROTECTED]
> >wrote:
>
> > William,
> >
> > It would be really awesome if you could furnish a file that replicates
> this
> > issue that we can attach to a bug in jira. A long time ago I had a very
> > weird issue with some gzip files and never got to the bottom of it...I'm
> > wondering if this could be it!
> >
> >
> > 2013/6/10 Niels Basjes <[EMAIL PROTECTED]>
> >
> > > Bzip2 is only splittable in newer versions of hadoop.
> > > On Jun 10, 2013 10:28 PM, "Alan Crosswell" <[EMAIL PROTECTED]> wrote:
> > >
> > > > Ignore what I said and see
> > > > https://forums.aws.amazon.com/thread.jspa?threadID=51232
> > > >
> > > > bzip2 was documented somewhere as being splittable but this appears
> to
> > > not
> > > > actually be implemented at least in AWS S3.
> > > > /a
> > > >
> > > >
> > > > On Mon, Jun 10, 2013 at 12:41 PM, Alan Crosswell <[EMAIL PROTECTED]>
> > > > wrote:
> > > >
> > > > > Suggest that if you have a choice, you use bzip2 compression
> instead
> > of
> > > > > gzip as bzip2 is block-based and Pig can split reading a large
> > bzipped
> > > > file
> > > > > across multiple mappers while gzip can't be split that way.
> > > > >
> > > > >
> > > > > On Mon, Jun 10, 2013 at 12:06 PM, William Oberman <
> > > > > [EMAIL PROTECTED]> wrote:
> > > > >
> > > > >> I still don't fully understand (and am still debugging), but I
> have
> > a
> > > > >> "problem file" and a theory.
> > > > >>
> > > > >> The file has a "corrupt line" that is a huge block of null
> > characters
> > > > >> followed by a "\n" (other lines are json followed by "\n").  I'm
> > > > thinking
> > > > >> that's a problem with my cassandra -> s3 process, but is out of
> > scope
> > > > for
> > > > >> this thread....  I wrote scripts to examine the file directly, and
> > if
> > > I
> > > > >> stop counting at the weird line, I get the "gz" count.   If I
> count
> > > all
> > > > >> lines (e.g. don't fail at the corrupt line) I get the
> "uncompressed"
> > > > >> count.
> > > > >>
> > > > >> I don't know how to debug hadoop/pig quite as well, though I'm
> > trying
> > > > now.
> > > > >>  But, my working theory is that some combination of pig/hadoop
> > aborts
> > > > >> processing the gz stream on a null character, but keeps chugging
> on
> > a
> > > > >> non-gz stream.  Does that sound familiar?
> > > > >>
> > > > >> will
> > > > >>
> > > > >>
> > > > >> On Sat, Jun 8, 2013 at 8:00 AM, William Oberman <
> > > > [EMAIL PROTECTED]
> > > > >> >wrote:
> > > > >>
> > > > >> > They are all *.gz, I confirmed that first :-)
> > > > >> >
> > > > >> >
> > > > >> > On Saturday, June 8, 2013, Niels Basjes wrote:
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB