Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Compressing output using block compression


Copy link to this message
-
Re: Compressing output using block compression
Most companies handling BigData use LZO, a few have started exploring/using
Snappy as well (which is not any easier to configure). These are the 2
splittable fast-compression algorithms. Note Snappy is not efficient
space-wise compared to gzip or other compression algos, but a lot faster
(ideal for compression between Map and Reduce)

Is there any repeated/heavy computation involved on the outputs other than
pushing this data to a database? If not, may be its fine to use gzip but
you have to make sure the individual files are close to the block size, or
you will have a lot of unnecessary IO transfers taking place.  If you read
the outputs to perform further Map Reduce computation, gzip is not the best.

-Prashant

On Tue, Apr 3, 2012 at 12:18 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:

> Thanks for your input.
>
> It looks like it's some work to configure LZO. What are the other
> alternatives? We read new sequence files and generate output continuously.
> What are my options? Should I split the output in small pieces and gzip
> them? How do people solve similar problems where there is continuous flow
> of data that generates tons of output continuosly?
>
> After output is generated we again read them and load it in OLAP db or do
> some other analysis.
>
> On Tue, Apr 3, 2012 at 11:48 AM, Prashant Kommireddi <[EMAIL PROTECTED]
> >wrote:
>
> > Yes, it is splittable.
> >
> > Bzip2 consumes a lot of CPU in decompression. With Hadoop jobs generally
> > being IO bound, Bzip2 sometimes can become the bottleneck with respect to
> > performance due to this slow decompression rate (algorithm unable to
> > decompress at disk read rate).
> >
> >
> > On Tue, Apr 3, 2012 at 11:39 AM, Mohit Anchlia <[EMAIL PROTECTED]
> > >wrote:
> >
> > > Is bzip2 not advisable? I think it can split too and is supported out
> of
> > > the box.
> > >
> > > On Thu, Mar 29, 2012 at 8:08 PM, 帝归 <[EMAIL PROTECTED]> wrote:
> > >
> > > > When I use LzoPigStorage, it will load all files under a directory.
> > But I
> > > > want compress every file under a directory and keep the file name
> > > > unchanged, just with a .lzo extension name. How can I do this? Maybe
> I
> > > must
> > > > write a mapreduce job?
> > > >
> > > > 2012/3/30 Jonathan Coveney <[EMAIL PROTECTED]>
> > > >
> > > > > check out:
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/store
> > > > >
> > > > > 2012/3/29 Mohit Anchlia <[EMAIL PROTECTED]>
> > > > >
> > > > > > Thanks! When I store output how can I tell pig to compress it in
> > LZO
> > > > > > format?
> > > > > >
> > > > > > On Thu, Mar 29, 2012 at 4:02 PM, Dmitriy Ryaboy <
> > [EMAIL PROTECTED]>
> > > > > > wrote:
> > > > > >
> > > > > > > You might find the elephant-bird project helpful for reading
> and
> > > > > > > creating LZO files, in raw hadoop or using Pig.
> > > > > > > (disclaimer: I'm a committer on elephant-bird)
> > > > > > >
> > > > > > > D
> > > > > > >
> > > > > > > On Wed, Mar 28, 2012 at 9:49 AM, Prashant Kommireddi
> > > > > > > <[EMAIL PROTECTED]> wrote:
> > > > > > > > Pig support LZO for splittable compression.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Prashant
> > > > > > > >
> > > > > > > > On Mar 28, 2012, at 9:45 AM, Mohit Anchlia <
> > > [EMAIL PROTECTED]
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >> We currently have 100s of GB of uncompressed data which we
> > would
> > > > > like
> > > > > > to
> > > > > > > >> zip using some compression that is block compression so that
> > we
> > > > can
> > > > > > use
> > > > > > > >> multiple input splits. Does pig support any such
> compression?
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > ‘(hello world)
> > > >
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB