Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - problems with .gz


+
William Oberman 2013-06-07, 21:10
+
Niels Basjes 2013-06-08, 05:23
+
William Oberman 2013-06-08, 12:00
+
William Oberman 2013-06-10, 16:06
+
Alan Crosswell 2013-06-10, 16:41
+
Alan Crosswell 2013-06-10, 20:27
+
Niels Basjes 2013-06-10, 20:38
+
Jonathan Coveney 2013-06-11, 12:46
Copy link to this message
-
Re: problems with .gz
William Oberman 2013-06-12, 20:16
I know what's going on, and it's kind of dumb on my part, but I'll post
anyways to help someone else who might be puzzled.  To review, I had data
that looked like this (and yes, it's corrupt, but it happens sometimes):
"json\njson\n...json\n\0\0\0...\0\0\0\0middle_of_json\njson\n...json\n"

E.g. a huge block of null characters in the middle of \n separated json.
 And, usually the last character before the null block was a \n, but the
first character after the null block was in the middle of a json string.

My custom UDF returned pig friendly data structures given JSON.  The "dumb"
thing was I returned null on a bad parse, instead of throwing IOException.
 For pig, returning null is a signal to stop loading data (I should have
payed closer attention to the javadoc).

Thus, why my uncompressed count > compressed count: it's the difference
between splits vs not splits (since gz doesn't allow splitting).

In the uncompressed case, blocks before AND AFTER the nulls were ok and
contributed data to my COUNT(*).

In the compressed case, only data before the nulls contributed data to my
COUNT(*).

will
On Tue, Jun 11, 2013 at 8:46 AM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:

> William,
>
> It would be really awesome if you could furnish a file that replicates this
> issue that we can attach to a bug in jira. A long time ago I had a very
> weird issue with some gzip files and never got to the bottom of it...I'm
> wondering if this could be it!
>
>
> 2013/6/10 Niels Basjes <[EMAIL PROTECTED]>
>
> > Bzip2 is only splittable in newer versions of hadoop.
> > On Jun 10, 2013 10:28 PM, "Alan Crosswell" <[EMAIL PROTECTED]> wrote:
> >
> > > Ignore what I said and see
> > > https://forums.aws.amazon.com/thread.jspa?threadID=51232
> > >
> > > bzip2 was documented somewhere as being splittable but this appears to
> > not
> > > actually be implemented at least in AWS S3.
> > > /a
> > >
> > >
> > > On Mon, Jun 10, 2013 at 12:41 PM, Alan Crosswell <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > Suggest that if you have a choice, you use bzip2 compression instead
> of
> > > > gzip as bzip2 is block-based and Pig can split reading a large
> bzipped
> > > file
> > > > across multiple mappers while gzip can't be split that way.
> > > >
> > > >
> > > > On Mon, Jun 10, 2013 at 12:06 PM, William Oberman <
> > > > [EMAIL PROTECTED]> wrote:
> > > >
> > > >> I still don't fully understand (and am still debugging), but I have
> a
> > > >> "problem file" and a theory.
> > > >>
> > > >> The file has a "corrupt line" that is a huge block of null
> characters
> > > >> followed by a "\n" (other lines are json followed by "\n").  I'm
> > > thinking
> > > >> that's a problem with my cassandra -> s3 process, but is out of
> scope
> > > for
> > > >> this thread....  I wrote scripts to examine the file directly, and
> if
> > I
> > > >> stop counting at the weird line, I get the "gz" count.   If I count
> > all
> > > >> lines (e.g. don't fail at the corrupt line) I get the "uncompressed"
> > > >> count.
> > > >>
> > > >> I don't know how to debug hadoop/pig quite as well, though I'm
> trying
> > > now.
> > > >>  But, my working theory is that some combination of pig/hadoop
> aborts
> > > >> processing the gz stream on a null character, but keeps chugging on
> a
> > > >> non-gz stream.  Does that sound familiar?
> > > >>
> > > >> will
> > > >>
> > > >>
> > > >> On Sat, Jun 8, 2013 at 8:00 AM, William Oberman <
> > > [EMAIL PROTECTED]
> > > >> >wrote:
> > > >>
> > > >> > They are all *.gz, I confirmed that first :-)
> > > >> >
> > > >> >
> > > >> > On Saturday, June 8, 2013, Niels Basjes wrote:
> > > >> >
> > > >> >> What are the exact filenames you used?
> > > >> >> The decompression of input files is based on the filename
> > extention.
> > > >> >>
> > > >> >> Niels
> > > >> >> On Jun 7, 2013 11:11 PM, "William Oberman" <
> > [EMAIL PROTECTED]
> > > >
> > > >> >> wrote:
> > > >> >>
> > > >> >> > I'm using pig 0.11.2.
> > > >> >> >
> > > >> >> > I had been processing ASCII files of json with schema:
+
Jonathan Coveney 2013-06-12, 21:09
+
William Oberman 2013-06-10, 17:12
+
Alan Crosswell 2013-06-10, 18:02