Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Reading Gzip Files


Copy link to this message
-
Re: Reading Gzip Files
Here's what I just tried:

I gzipped a file:

'cat foo.tsv | gzip > foo.tsv.gz'

Uploaded to my hdfs (hdfs://master:8020)

'hadoop fs -put foo.tsv.gz /tmp'

Then loaded it and dumped it with pig:

grunt> data = LOAD 'hdfs://master/tmp/foo.tsv.gz';
grunt> DUMP data;
(98384,559)
(98385,587)
(98386,573)
(98387,587)
(98388,589)
(98389,584)
(98390,572)
(98391,567)

Looks great. I'm going to blame it on your version? I'm using pig-0.8
and hadoop 0.20.2.

--jacob
@thedatachef

 
On Tue, 2011-02-22 at 08:21 -0500, Eric Lubow wrote:
> I apologize for the double mailing:
>
> grunt> Y = LOAD 'hdfs:///mnt/test.log.gz' AS (line:chararray);
> grunt> foo = LIMIT Y 5;
> grunt> dump foo
> <0\Mtest.log?]?o?H??}?)
>
> It didn't work out of HDFS.
>
> -e
>
> On Tue, Feb 22, 2011 at 08:18, Eric Lubow <[EMAIL PROTECTED]> wrote:
>
> > I'm not sure what you mean by testing it directly out of a normal HDFS. I
> > have added it to HDFS with 'hadoop fs copyFromLocal', but then I can't
> > access it via Pig using the file:///.  Am I doing something wrong or are you
> > asking me to try something else?
> >
> > -e
> >
> >
> > On Mon, Feb 21, 2011 at 21:36, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> >
> >> He's on 0.6, so the interface is different. And for him even PigStorage
> >> doesn't decompress...
> >>
> >> It occurs to me the problem may be with underlying fs. Eric, what happens
> >> when you try reading out of a normal HDFS (you can just run a
> >> pseudo-distributed cluster locally to test)?
> >>
> >> D
> >>
> >>
> >> On Mon, Feb 21, 2011 at 6:28 PM, Charles Gonçalves <[EMAIL PROTECTED]>wrote:
> >>
> >>> I'm not sure if is the same problem.
> >>>
> >>> I did a custom loader and I got a problem reading compressed files too.
> >>> So I noticed that in the PigStorage  the function  getInputFormat was:
> >>>
> >>> public InputFormat getInputFormat() throws IOException {
> >>>        if(loadLocation.endsWith(".bz2") || loadLocation.endsWith(".bz"))
> >>> {
> >>>            return new Bzip2TextInputFormat();
> >>>        } else {
> >>>            return new PigTextInputFormat();
> >>>        }
> >>> }
> >>>
> >>> And in my custom loader was :
> >>>
> >>> public InputFormat getInputFormat() {
> >>> return new TextInputFormat();
> >>> }
> >>>
> >>>
> >>> I just copied the code from PigStorage and everything went right
> >>>
> >>>
> >>>
> >>> On Mon, Feb 21, 2011 at 8:46 PM, Eric Lubow <[EMAIL PROTECTED]>
> >>> wrote:
> >>>
> >>> > I have been working my way through Pig recently with a lot of help from
> >>> the
> >>> > folks in #hadoop-pig on Freenode.
> >>> >
> >>> > The problem I am having is with reading any gzip'd files from anywhere
> >>> > (either locally or from s3).  This is the case with pig in local mode.
> >>>  I
> >>> > am
> >>> > using Pig 0.6 on an Amazon EMR (Elastic Map Reduce) instance.  I have
> >>> > checked my core-site.xml and I have the following line for compression
> >>> > codecs:
> >>> >
> >>> >
> >>> >
> >>> <property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value></property>
> >>> >
> >>> > Gzip is listed there so I don't know why it won't decode properly.  I
> >>> am
> >>> > trying to do the following as a test:
> >>> >
> >>> > --
> >>> > Y = LOAD 's3://$bucket/$path/log.*.gz' AS (line:chararray);
> >>> > foo = LIMIT Y 5;
> >>> > dump foo
> >>> > (?ks?F?6?)
> >>> >
> >>> > Y = LOAD 'file:///home/hadoop/logs/test.log.gz' AS (line:chararray);
> >>> > foo = LIMIT Y 5;
> >>> > dump foo
> >>> > (?ks?F?6?)
> >>> > --
> >>> >
> >>> > Both yield the same results.  What I am actually trying to parse is
> >>> > compressed JSON.  And up to this point dmitriy has helped me and the
> >>> JSON
> >>> > loads and the scripts run perfectly as long as the logs are not
> >>> compressed.
> >>> >  Since the logs are compressed, my hands are tied.  Any suggestions to
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB