Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Reading Gzip Files


Copy link to this message
-
Re: Reading Gzip Files
Jacob Perkins 2011-02-22, 14:00
Here's what I just tried:

I gzipped a file:

'cat foo.tsv | gzip > foo.tsv.gz'

Uploaded to my hdfs (hdfs://master:8020)

'hadoop fs -put foo.tsv.gz /tmp'

Then loaded it and dumped it with pig:

grunt> data = LOAD 'hdfs://master/tmp/foo.tsv.gz';
grunt> DUMP data;
(98384,559)
(98385,587)
(98386,573)
(98387,587)
(98388,589)
(98389,584)
(98390,572)
(98391,567)

Looks great. I'm going to blame it on your version? I'm using pig-0.8
and hadoop 0.20.2.

--jacob
@thedatachef

 
On Tue, 2011-02-22 at 08:21 -0500, Eric Lubow wrote:
> I apologize for the double mailing:
>
> grunt> Y = LOAD 'hdfs:///mnt/test.log.gz' AS (line:chararray);
> grunt> foo = LIMIT Y 5;
> grunt> dump foo
> <0\Mtest.log?]?o?H??}?)
>
> It didn't work out of HDFS.
>
> -e
>
> On Tue, Feb 22, 2011 at 08:18, Eric Lubow <[EMAIL PROTECTED]> wrote:
>
> > I'm not sure what you mean by testing it directly out of a normal HDFS. I
> > have added it to HDFS with 'hadoop fs copyFromLocal', but then I can't
> > access it via Pig using the file:///.  Am I doing something wrong or are you
> > asking me to try something else?
> >
> > -e
> >
> >
> > On Mon, Feb 21, 2011 at 21:36, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> >
> >> He's on 0.6, so the interface is different. And for him even PigStorage
> >> doesn't decompress...
> >>
> >> It occurs to me the problem may be with underlying fs. Eric, what happens
> >> when you try reading out of a normal HDFS (you can just run a
> >> pseudo-distributed cluster locally to test)?
> >>
> >> D
> >>
> >>
> >> On Mon, Feb 21, 2011 at 6:28 PM, Charles Gonçalves <[EMAIL PROTECTED]>wrote:
> >>
> >>> I'm not sure if is the same problem.
> >>>
> >>> I did a custom loader and I got a problem reading compressed files too.
> >>> So I noticed that in the PigStorage  the function  getInputFormat was:
> >>>
> >>> public InputFormat getInputFormat() throws IOException {
> >>>        if(loadLocation.endsWith(".bz2") || loadLocation.endsWith(".bz"))
> >>> {
> >>>            return new Bzip2TextInputFormat();
> >>>        } else {
> >>>            return new PigTextInputFormat();
> >>>        }
> >>> }
> >>>
> >>> And in my custom loader was :
> >>>
> >>> public InputFormat getInputFormat() {
> >>> return new TextInputFormat();
> >>> }
> >>>
> >>>
> >>> I just copied the code from PigStorage and everything went right
> >>>
> >>>
> >>>
> >>> On Mon, Feb 21, 2011 at 8:46 PM, Eric Lubow <[EMAIL PROTECTED]>
> >>> wrote:
> >>>
> >>> > I have been working my way through Pig recently with a lot of help from
> >>> the
> >>> > folks in #hadoop-pig on Freenode.
> >>> >
> >>> > The problem I am having is with reading any gzip'd files from anywhere
> >>> > (either locally or from s3).  This is the case with pig in local mode.
> >>>  I
> >>> > am
> >>> > using Pig 0.6 on an Amazon EMR (Elastic Map Reduce) instance.  I have
> >>> > checked my core-site.xml and I have the following line for compression
> >>> > codecs:
> >>> >
> >>> >
> >>> >
> >>> <property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value></property>
> >>> >
> >>> > Gzip is listed there so I don't know why it won't decode properly.  I
> >>> am
> >>> > trying to do the following as a test:
> >>> >
> >>> > --
> >>> > Y = LOAD 's3://$bucket/$path/log.*.gz' AS (line:chararray);
> >>> > foo = LIMIT Y 5;
> >>> > dump foo
> >>> > (?ks?F?6?)
> >>> >
> >>> > Y = LOAD 'file:///home/hadoop/logs/test.log.gz' AS (line:chararray);
> >>> > foo = LIMIT Y 5;
> >>> > dump foo
> >>> > (?ks?F?6?)
> >>> > --
> >>> >
> >>> > Both yield the same results.  What I am actually trying to parse is
> >>> > compressed JSON.  And up to this point dmitriy has helped me and the
> >>> JSON
> >>> > loads and the scripts run perfectly as long as the logs are not
> >>> compressed.
> >>> >  Since the logs are compressed, my hands are tied.  Any suggestions to