Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> problems with .gz


Copy link to this message
-
Re: problems with .gz
I'm using gzip as I had a huge S3 bucket of uncompressed files, and
s3distcp only supported {gz, lzo, snappy}.

I haven't ever done this, but can I mix/match files?  My backup processes
add files to these buckets, so I could upload new files as *.bz.  But then
I'd have some files as *.gz, and others as *.bz.

will
On Mon, Jun 10, 2013 at 12:41 PM, Alan Crosswell <[EMAIL PROTECTED]> wrote:

> Suggest that if you have a choice, you use bzip2 compression instead of
> gzip as bzip2 is block-based and Pig can split reading a large bzipped file
> across multiple mappers while gzip can't be split that way.
>
>
> On Mon, Jun 10, 2013 at 12:06 PM, William Oberman
> <[EMAIL PROTECTED]>wrote:
>
> > I still don't fully understand (and am still debugging), but I have a
> > "problem file" and a theory.
> >
> > The file has a "corrupt line" that is a huge block of null characters
> > followed by a "\n" (other lines are json followed by "\n").  I'm thinking
> > that's a problem with my cassandra -> s3 process, but is out of scope for
> > this thread....  I wrote scripts to examine the file directly, and if I
> > stop counting at the weird line, I get the "gz" count.   If I count all
> > lines (e.g. don't fail at the corrupt line) I get the "uncompressed"
> count.
> >
> > I don't know how to debug hadoop/pig quite as well, though I'm trying
> now.
> >  But, my working theory is that some combination of pig/hadoop aborts
> > processing the gz stream on a null character, but keeps chugging on a
> > non-gz stream.  Does that sound familiar?
> >
> > will
> >
> >
> > On Sat, Jun 8, 2013 at 8:00 AM, William Oberman <
> [EMAIL PROTECTED]
> > >wrote:
> >
> > > They are all *.gz, I confirmed that first :-)
> > >
> > >
> > > On Saturday, June 8, 2013, Niels Basjes wrote:
> > >
> > >> What are the exact filenames you used?
> > >> The decompression of input files is based on the filename extention.
> > >>
> > >> Niels
> > >> On Jun 7, 2013 11:11 PM, "William Oberman" <[EMAIL PROTECTED]>
> > >> wrote:
> > >>
> > >> > I'm using pig 0.11.2.
> > >> >
> > >> > I had been processing ASCII files of json with schema:
> (key:chararray,
> > >> > columns:bag {column:tuple (timeUUID:chararray, value:chararray,
> > >> > timestamp:long)})
> > >> > For what it's worth, this is cassandra data, at a fairly low level.
> > >> >
> > >> > But, this was getting big, so I compressed it all with gzip (my
> "ETL"
> > >> > process is already chunking the data into 1GB parts, making the .gz
> > >> files
> > >> > ~100MB).
> > >> >
> > >> > As a sanity check, I decided to do a quick check of pre/post, and
> the
> > >> > numbers aren't matching.  Then I've done a lot of messing around
> > trying
> > >> to
> > >> > figure out why and I'm getting more and more puzzled.
> > >> >
> > >> > My "quick check" was to get an overall count.  It looked like
> > (assuming
> > >> A
> > >> > is a LOAD given the schema above):
> > >> > -------
> > >> > allGrp = GROUP A ALL;
> > >> > aCount = FOREACH allGrp GENERATE group, COUNT(A);
> > >> > DUMP aCount;
> > >> > -------
> > >> >
> > >> > Basically the original data returned a number GREATER than the
> > >> compressed
> > >> > data number (not by a lot, but still...).
> > >> >
> > >> > Then I uncompressed all of the compressed files, and did a size
> check
> > of
> > >> > original vs. uncompressed.  They were the same.  Then I "quick
> > checked"
> > >> the
> > >> > uncompressed, and the count of that was == original!  So, the way in
> > >> which
> > >> > pig processes the gzip'ed data is actually somehow different.
> > >> >
> > >> > Then I tried to see if there are nulls floating around, so I loaded
> > >> "orig"
> > >> > and "comp" and tried to catch the "missing keys" with outer joins:
> > >> > -----------
> > >> > joined = JOIN orig by key LEFT OUTER, comp BY key;
> > >> > filtered = FILTER joined BY (comp::key is null);
> > >> > -----------
> > >> > And filtered was empty!  I then tried the reverse (which makes no
> > sense
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB