Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> problems with .gz


Copy link to this message
-
Re: problems with .gz
They are all *.gz, I confirmed that first :-)

On Saturday, June 8, 2013, Niels Basjes wrote:

> What are the exact filenames you used?
> The decompression of input files is based on the filename extention.
>
> Niels
> On Jun 7, 2013 11:11 PM, "William Oberman" <[EMAIL PROTECTED]<javascript:;>>
> wrote:
>
> > I'm using pig 0.11.2.
> >
> > I had been processing ASCII files of json with schema: (key:chararray,
> > columns:bag {column:tuple (timeUUID:chararray, value:chararray,
> > timestamp:long)})
> > For what it's worth, this is cassandra data, at a fairly low level.
> >
> > But, this was getting big, so I compressed it all with gzip (my "ETL"
> > process is already chunking the data into 1GB parts, making the .gz files
> > ~100MB).
> >
> > As a sanity check, I decided to do a quick check of pre/post, and the
> > numbers aren't matching.  Then I've done a lot of messing around trying
> to
> > figure out why and I'm getting more and more puzzled.
> >
> > My "quick check" was to get an overall count.  It looked like (assuming A
> > is a LOAD given the schema above):
> > -------
> > allGrp = GROUP A ALL;
> > aCount = FOREACH allGrp GENERATE group, COUNT(A);
> > DUMP aCount;
> > -------
> >
> > Basically the original data returned a number GREATER than the compressed
> > data number (not by a lot, but still...).
> >
> > Then I uncompressed all of the compressed files, and did a size check of
> > original vs. uncompressed.  They were the same.  Then I "quick checked"
> the
> > uncompressed, and the count of that was == original!  So, the way in
> which
> > pig processes the gzip'ed data is actually somehow different.
> >
> > Then I tried to see if there are nulls floating around, so I loaded
> "orig"
> > and "comp" and tried to catch the "missing keys" with outer joins:
> > -----------
> > joined = JOIN orig by key LEFT OUTER, comp BY key;
> > filtered = FILTER joined BY (comp::key is null);
> > -----------
> > And filtered was empty!  I then tried the reverse (which makes no sense I
> > know, as this was the smaller set), and filtered is still empty!
> >
> > All of these loads are through a custom UDF that extends LoadFunc.  But,
> > there isn't much to that UDF (and it's been in use for many months now).
> >  Basically, the "raw" data is JSON (from cassandra's sstable2json
> program).
> >  And I parse the json and turn it into the pig structure of the schema
> > noted above.
> >
> > Does anything make sense here?
> >
> > Thanks!
> >
> > will
> >
>
--
Will Oberman
Civic Science, Inc.
6101 Penn Avenue, Fifth Floor
Pittsburgh, PA 15206
(M) 412-480-7835
(E) [EMAIL PROTECTED]
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB