Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> problems with .gz


+
William Oberman 2013-06-07, 21:10
+
Niels Basjes 2013-06-08, 05:23
+
William Oberman 2013-06-08, 12:00
+
William Oberman 2013-06-10, 16:06
+
Alan Crosswell 2013-06-10, 16:41
+
Alan Crosswell 2013-06-10, 20:27
Copy link to this message
-
Re: problems with .gz
Bzip2 is only splittable in newer versions of hadoop.
On Jun 10, 2013 10:28 PM, "Alan Crosswell" <[EMAIL PROTECTED]> wrote:

> Ignore what I said and see
> https://forums.aws.amazon.com/thread.jspa?threadID=51232
>
> bzip2 was documented somewhere as being splittable but this appears to not
> actually be implemented at least in AWS S3.
> /a
>
>
> On Mon, Jun 10, 2013 at 12:41 PM, Alan Crosswell <[EMAIL PROTECTED]>
> wrote:
>
> > Suggest that if you have a choice, you use bzip2 compression instead of
> > gzip as bzip2 is block-based and Pig can split reading a large bzipped
> file
> > across multiple mappers while gzip can't be split that way.
> >
> >
> > On Mon, Jun 10, 2013 at 12:06 PM, William Oberman <
> > [EMAIL PROTECTED]> wrote:
> >
> >> I still don't fully understand (and am still debugging), but I have a
> >> "problem file" and a theory.
> >>
> >> The file has a "corrupt line" that is a huge block of null characters
> >> followed by a "\n" (other lines are json followed by "\n").  I'm
> thinking
> >> that's a problem with my cassandra -> s3 process, but is out of scope
> for
> >> this thread....  I wrote scripts to examine the file directly, and if I
> >> stop counting at the weird line, I get the "gz" count.   If I count all
> >> lines (e.g. don't fail at the corrupt line) I get the "uncompressed"
> >> count.
> >>
> >> I don't know how to debug hadoop/pig quite as well, though I'm trying
> now.
> >>  But, my working theory is that some combination of pig/hadoop aborts
> >> processing the gz stream on a null character, but keeps chugging on a
> >> non-gz stream.  Does that sound familiar?
> >>
> >> will
> >>
> >>
> >> On Sat, Jun 8, 2013 at 8:00 AM, William Oberman <
> [EMAIL PROTECTED]
> >> >wrote:
> >>
> >> > They are all *.gz, I confirmed that first :-)
> >> >
> >> >
> >> > On Saturday, June 8, 2013, Niels Basjes wrote:
> >> >
> >> >> What are the exact filenames you used?
> >> >> The decompression of input files is based on the filename extention.
> >> >>
> >> >> Niels
> >> >> On Jun 7, 2013 11:11 PM, "William Oberman" <[EMAIL PROTECTED]
> >
> >> >> wrote:
> >> >>
> >> >> > I'm using pig 0.11.2.
> >> >> >
> >> >> > I had been processing ASCII files of json with schema:
> >> (key:chararray,
> >> >> > columns:bag {column:tuple (timeUUID:chararray, value:chararray,
> >> >> > timestamp:long)})
> >> >> > For what it's worth, this is cassandra data, at a fairly low level.
> >> >> >
> >> >> > But, this was getting big, so I compressed it all with gzip (my
> "ETL"
> >> >> > process is already chunking the data into 1GB parts, making the .gz
> >> >> files
> >> >> > ~100MB).
> >> >> >
> >> >> > As a sanity check, I decided to do a quick check of pre/post, and
> the
> >> >> > numbers aren't matching.  Then I've done a lot of messing around
> >> trying
> >> >> to
> >> >> > figure out why and I'm getting more and more puzzled.
> >> >> >
> >> >> > My "quick check" was to get an overall count.  It looked like
> >> (assuming
> >> >> A
> >> >> > is a LOAD given the schema above):
> >> >> > -------
> >> >> > allGrp = GROUP A ALL;
> >> >> > aCount = FOREACH allGrp GENERATE group, COUNT(A);
> >> >> > DUMP aCount;
> >> >> > -------
> >> >> >
> >> >> > Basically the original data returned a number GREATER than the
> >> >> compressed
> >> >> > data number (not by a lot, but still...).
> >> >> >
> >> >> > Then I uncompressed all of the compressed files, and did a size
> >> check of
> >> >> > original vs. uncompressed.  They were the same.  Then I "quick
> >> checked"
> >> >> the
> >> >> > uncompressed, and the count of that was == original!  So, the way
> in
> >> >> which
> >> >> > pig processes the gzip'ed data is actually somehow different.
> >> >> >
> >> >> > Then I tried to see if there are nulls floating around, so I loaded
> >> >> "orig"
> >> >> > and "comp" and tried to catch the "missing keys" with outer joins:
> >> >> > -----------
> >> >> > joined = JOIN orig by key LEFT OUTER, comp BY key;
> >> >> > filtered = FILTER joined BY (comp::key is null);
+
Jonathan Coveney 2013-06-11, 12:46
+
William Oberman 2013-06-12, 20:16
+
Jonathan Coveney 2013-06-12, 21:09
+
William Oberman 2013-06-10, 17:12
+
Alan Crosswell 2013-06-10, 18:02
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB