Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Compressing output using block compression


Copy link to this message
-
Re: Compressing output using block compression
Most companies handling BigData use LZO, a few have started exploring/using
Snappy as well (which is not any easier to configure). These are the 2
splittable fast-compression algorithms. Note Snappy is not efficient
space-wise compared to gzip or other compression algos, but a lot faster
(ideal for compression between Map and Reduce)

Is there any repeated/heavy computation involved on the outputs other than
pushing this data to a database? If not, may be its fine to use gzip but
you have to make sure the individual files are close to the block size, or
you will have a lot of unnecessary IO transfers taking place.  If you read
the outputs to perform further Map Reduce computation, gzip is not the best.

-Prashant

On Tue, Apr 3, 2012 at 12:18 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:

> Thanks for your input.
>
> It looks like it's some work to configure LZO. What are the other
> alternatives? We read new sequence files and generate output continuously.
> What are my options? Should I split the output in small pieces and gzip
> them? How do people solve similar problems where there is continuous flow
> of data that generates tons of output continuosly?
>
> After output is generated we again read them and load it in OLAP db or do
> some other analysis.
>
> On Tue, Apr 3, 2012 at 11:48 AM, Prashant Kommireddi <[EMAIL PROTECTED]
> >wrote:
>
> > Yes, it is splittable.
> >
> > Bzip2 consumes a lot of CPU in decompression. With Hadoop jobs generally
> > being IO bound, Bzip2 sometimes can become the bottleneck with respect to
> > performance due to this slow decompression rate (algorithm unable to
> > decompress at disk read rate).
> >
> >
> > On Tue, Apr 3, 2012 at 11:39 AM, Mohit Anchlia <[EMAIL PROTECTED]
> > >wrote:
> >
> > > Is bzip2 not advisable? I think it can split too and is supported out
> of
> > > the box.
> > >
> > > On Thu, Mar 29, 2012 at 8:08 PM, 帝归 <[EMAIL PROTECTED]> wrote:
> > >
> > > > When I use LzoPigStorage, it will load all files under a directory.
> > But I
> > > > want compress every file under a directory and keep the file name
> > > > unchanged, just with a .lzo extension name. How can I do this? Maybe
> I
> > > must
> > > > write a mapreduce job?
> > > >
> > > > 2012/3/30 Jonathan Coveney <[EMAIL PROTECTED]>
> > > >
> > > > > check out:
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/store
> > > > >
> > > > > 2012/3/29 Mohit Anchlia <[EMAIL PROTECTED]>
> > > > >
> > > > > > Thanks! When I store output how can I tell pig to compress it in
> > LZO
> > > > > > format?
> > > > > >
> > > > > > On Thu, Mar 29, 2012 at 4:02 PM, Dmitriy Ryaboy <
> > [EMAIL PROTECTED]>
> > > > > > wrote:
> > > > > >
> > > > > > > You might find the elephant-bird project helpful for reading
> and
> > > > > > > creating LZO files, in raw hadoop or using Pig.
> > > > > > > (disclaimer: I'm a committer on elephant-bird)
> > > > > > >
> > > > > > > D
> > > > > > >
> > > > > > > On Wed, Mar 28, 2012 at 9:49 AM, Prashant Kommireddi
> > > > > > > <[EMAIL PROTECTED]> wrote:
> > > > > > > > Pig support LZO for splittable compression.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Prashant
> > > > > > > >
> > > > > > > > On Mar 28, 2012, at 9:45 AM, Mohit Anchlia <
> > > [EMAIL PROTECTED]
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >> We currently have 100s of GB of uncompressed data which we
> > would
> > > > > like
> > > > > > to
> > > > > > > >> zip using some compression that is block compression so that
> > we
> > > > can
> > > > > > use
> > > > > > > >> multiple input splits. Does pig support any such
> compression?
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > ‘(hello world)
> > > >
> > >
> >
>