Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Compressing output using block compression


Copy link to this message
-
Re: Compressing output using block compression
SequenceFileStorage in elephant-bird lets you load and store to sequence
files.
If your input is text lines, you can store each line as 'value'.
You can experiment with different codecs.

depending on your use case, simple bzip2 files may not be a bad choice.

On Tue, Apr 3, 2012 at 1:57 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:

> Thanks for the examples. It appears that snappy is not splittable and
> suggested approach is to write to sequence files.
>
> I know how to load from sequencefiles, but in pig I can't find a way to
> write to the sequence files using snappy compression.
>
> On Tue, Apr 3, 2012 at 1:30 PM, Prashant Kommireddi <[EMAIL PROTECTED]
> >wrote:
>
> > Does it mean Snappy is splittable?
> > http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
> >
> > If so then how can I use it in pig?
> > http://hadoopified.wordpress.com/2012/01/24/snappy-compression-with-pig/
> >
> >
> > On Tue, Apr 3, 2012 at 1:02 PM, Mohit Anchlia <[EMAIL PROTECTED]
> > >wrote:
> >
> > > I am currently using Snappy in sequence files. I wasn't aware snappy
> uses
> > > block compression. Does it mean Snappy is splittable? If so then how
> can
> > I
> > > use it in pig?
> > >
> > > Thanks again
> > >
> > > On Tue, Apr 3, 2012 at 12:42 PM, Prashant Kommireddi <
> > [EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > Most companies handling BigData use LZO, a few have started
> > > exploring/using
> > > > Snappy as well (which is not any easier to configure). These are the
> 2
> > > > splittable fast-compression algorithms. Note Snappy is not efficient
> > > > space-wise compared to gzip or other compression algos, but a lot
> > faster
> > > > (ideal for compression between Map and Reduce)
> > > >
> > > > Is there any repeated/heavy computation involved on the outputs other
> > > than
> > > > pushing this data to a database? If not, may be its fine to use gzip
> > but
> > > > you have to make sure the individual files are close to the block
> size,
> > > or
> > > > you will have a lot of unnecessary IO transfers taking place.  If you
> > > read
> > > > the outputs to perform further Map Reduce computation, gzip is not
> the
> > > > best.
> > > >
> > > > -Prashant
> > > >
> > > > On Tue, Apr 3, 2012 at 12:18 PM, Mohit Anchlia <
> [EMAIL PROTECTED]
> > > > >wrote:
> > > >
> > > > > Thanks for your input.
> > > > >
> > > > > It looks like it's some work to configure LZO. What are the other
> > > > > alternatives? We read new sequence files and generate output
> > > > continuously.
> > > > > What are my options? Should I split the output in small pieces and
> > gzip
> > > > > them? How do people solve similar problems where there is
> continuous
> > > flow
> > > > > of data that generates tons of output continuosly?
> > > > >
> > > > > After output is generated we again read them and load it in OLAP db
> > or
> > > do
> > > > > some other analysis.
> > > > >
> > > > > On Tue, Apr 3, 2012 at 11:48 AM, Prashant Kommireddi <
> > > > [EMAIL PROTECTED]
> > > > > >wrote:
> > > > >
> > > > > > Yes, it is splittable.
> > > > > >
> > > > > > Bzip2 consumes a lot of CPU in decompression. With Hadoop jobs
> > > > generally
> > > > > > being IO bound, Bzip2 sometimes can become the bottleneck with
> > > respect
> > > > to
> > > > > > performance due to this slow decompression rate (algorithm unable
> > to
> > > > > > decompress at disk read rate).
> > > > > >
> > > > > >
> > > > > > On Tue, Apr 3, 2012 at 11:39 AM, Mohit Anchlia <
> > > [EMAIL PROTECTED]
> > > > > > >wrote:
> > > > > >
> > > > > > > Is bzip2 not advisable? I think it can split too and is
> supported
> > > out
> > > > > of
> > > > > > > the box.
> > > > > > >
> > > > > > > On Thu, Mar 29, 2012 at 8:08 PM, 帝归 <[EMAIL PROTECTED]>
> wrote:
> > > > > > >
> > > > > > > > When I use LzoPigStorage, it will load all files under a
> > > directory.
> > > > > > But I
> > > > > > > > want compress every file under a directory and keep the file
> > name
> > > > > > > > unchanged, just with a .lzo extension name. How can I do