Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> problems with .gz


+
William Oberman 2013-06-07, 21:10
+
Niels Basjes 2013-06-08, 05:23
+
William Oberman 2013-06-08, 12:00
+
William Oberman 2013-06-10, 16:06
+
Alan Crosswell 2013-06-10, 16:41
+
Alan Crosswell 2013-06-10, 20:27
+
Niels Basjes 2013-06-10, 20:38
Copy link to this message
-
Re: problems with .gz
William,

It would be really awesome if you could furnish a file that replicates this
issue that we can attach to a bug in jira. A long time ago I had a very
weird issue with some gzip files and never got to the bottom of it...I'm
wondering if this could be it!
2013/6/10 Niels Basjes <[EMAIL PROTECTED]>

> Bzip2 is only splittable in newer versions of hadoop.
> On Jun 10, 2013 10:28 PM, "Alan Crosswell" <[EMAIL PROTECTED]> wrote:
>
> > Ignore what I said and see
> > https://forums.aws.amazon.com/thread.jspa?threadID=51232
> >
> > bzip2 was documented somewhere as being splittable but this appears to
> not
> > actually be implemented at least in AWS S3.
> > /a
> >
> >
> > On Mon, Jun 10, 2013 at 12:41 PM, Alan Crosswell <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Suggest that if you have a choice, you use bzip2 compression instead of
> > > gzip as bzip2 is block-based and Pig can split reading a large bzipped
> > file
> > > across multiple mappers while gzip can't be split that way.
> > >
> > >
> > > On Mon, Jun 10, 2013 at 12:06 PM, William Oberman <
> > > [EMAIL PROTECTED]> wrote:
> > >
> > >> I still don't fully understand (and am still debugging), but I have a
> > >> "problem file" and a theory.
> > >>
> > >> The file has a "corrupt line" that is a huge block of null characters
> > >> followed by a "\n" (other lines are json followed by "\n").  I'm
> > thinking
> > >> that's a problem with my cassandra -> s3 process, but is out of scope
> > for
> > >> this thread....  I wrote scripts to examine the file directly, and if
> I
> > >> stop counting at the weird line, I get the "gz" count.   If I count
> all
> > >> lines (e.g. don't fail at the corrupt line) I get the "uncompressed"
> > >> count.
> > >>
> > >> I don't know how to debug hadoop/pig quite as well, though I'm trying
> > now.
> > >>  But, my working theory is that some combination of pig/hadoop aborts
> > >> processing the gz stream on a null character, but keeps chugging on a
> > >> non-gz stream.  Does that sound familiar?
> > >>
> > >> will
> > >>
> > >>
> > >> On Sat, Jun 8, 2013 at 8:00 AM, William Oberman <
> > [EMAIL PROTECTED]
> > >> >wrote:
> > >>
> > >> > They are all *.gz, I confirmed that first :-)
> > >> >
> > >> >
> > >> > On Saturday, June 8, 2013, Niels Basjes wrote:
> > >> >
> > >> >> What are the exact filenames you used?
> > >> >> The decompression of input files is based on the filename
> extention.
> > >> >>
> > >> >> Niels
> > >> >> On Jun 7, 2013 11:11 PM, "William Oberman" <
> [EMAIL PROTECTED]
> > >
> > >> >> wrote:
> > >> >>
> > >> >> > I'm using pig 0.11.2.
> > >> >> >
> > >> >> > I had been processing ASCII files of json with schema:
> > >> (key:chararray,
> > >> >> > columns:bag {column:tuple (timeUUID:chararray, value:chararray,
> > >> >> > timestamp:long)})
> > >> >> > For what it's worth, this is cassandra data, at a fairly low
> level.
> > >> >> >
> > >> >> > But, this was getting big, so I compressed it all with gzip (my
> > "ETL"
> > >> >> > process is already chunking the data into 1GB parts, making the
> .gz
> > >> >> files
> > >> >> > ~100MB).
> > >> >> >
> > >> >> > As a sanity check, I decided to do a quick check of pre/post, and
> > the
> > >> >> > numbers aren't matching.  Then I've done a lot of messing around
> > >> trying
> > >> >> to
> > >> >> > figure out why and I'm getting more and more puzzled.
> > >> >> >
> > >> >> > My "quick check" was to get an overall count.  It looked like
> > >> (assuming
> > >> >> A
> > >> >> > is a LOAD given the schema above):
> > >> >> > -------
> > >> >> > allGrp = GROUP A ALL;
> > >> >> > aCount = FOREACH allGrp GENERATE group, COUNT(A);
> > >> >> > DUMP aCount;
> > >> >> > -------
> > >> >> >
> > >> >> > Basically the original data returned a number GREATER than the
> > >> >> compressed
> > >> >> > data number (not by a lot, but still...).
> > >> >> >
> > >> >> > Then I uncompressed all of the compressed files, and did a size
> > >> check of
> > >> >> > original vs. uncompressed.  They were the same.  Then I "quick
+
William Oberman 2013-06-12, 20:16
+
Jonathan Coveney 2013-06-12, 21:09
+
William Oberman 2013-06-10, 17:12
+
Alan Crosswell 2013-06-10, 18:02