Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> problems with .gz


Copy link to this message
-
Re: problems with .gz
What are the exact filenames you used?
The decompression of input files is based on the filename extention.

Niels
On Jun 7, 2013 11:11 PM, "William Oberman" <[EMAIL PROTECTED]> wrote:

> I'm using pig 0.11.2.
>
> I had been processing ASCII files of json with schema: (key:chararray,
> columns:bag {column:tuple (timeUUID:chararray, value:chararray,
> timestamp:long)})
> For what it's worth, this is cassandra data, at a fairly low level.
>
> But, this was getting big, so I compressed it all with gzip (my "ETL"
> process is already chunking the data into 1GB parts, making the .gz files
> ~100MB).
>
> As a sanity check, I decided to do a quick check of pre/post, and the
> numbers aren't matching.  Then I've done a lot of messing around trying to
> figure out why and I'm getting more and more puzzled.
>
> My "quick check" was to get an overall count.  It looked like (assuming A
> is a LOAD given the schema above):
> -------
> allGrp = GROUP A ALL;
> aCount = FOREACH allGrp GENERATE group, COUNT(A);
> DUMP aCount;
> -------
>
> Basically the original data returned a number GREATER than the compressed
> data number (not by a lot, but still...).
>
> Then I uncompressed all of the compressed files, and did a size check of
> original vs. uncompressed.  They were the same.  Then I "quick checked" the
> uncompressed, and the count of that was == original!  So, the way in which
> pig processes the gzip'ed data is actually somehow different.
>
> Then I tried to see if there are nulls floating around, so I loaded "orig"
> and "comp" and tried to catch the "missing keys" with outer joins:
> -----------
> joined = JOIN orig by key LEFT OUTER, comp BY key;
> filtered = FILTER joined BY (comp::key is null);
> -----------
> And filtered was empty!  I then tried the reverse (which makes no sense I
> know, as this was the smaller set), and filtered is still empty!
>
> All of these loads are through a custom UDF that extends LoadFunc.  But,
> there isn't much to that UDF (and it's been in use for many months now).
>  Basically, the "raw" data is JSON (from cassandra's sstable2json program).
>  And I parse the json and turn it into the pig structure of the schema
> noted above.
>
> Does anything make sense here?
>
> Thanks!
>
> will
>