|
Mohit Anchlia
2012-03-28, 16:45
Prashant Kommireddi
2012-03-28, 16:49
Dmitriy Ryaboy
2012-03-29, 23:02
Mohit Anchlia
2012-03-29, 23:07
Jonathan Coveney
2012-03-30, 00:48
帝归
2012-03-30, 03:08
Mohit Anchlia
2012-04-03, 18:39
Prashant Kommireddi
2012-04-03, 18:48
Mohit Anchlia
2012-04-03, 19:18
Prashant Kommireddi
2012-04-03, 19:42
Mohit Anchlia
2012-04-03, 20:02
Prashant Kommireddi
2012-04-03, 20:30
Mohit Anchlia
2012-04-03, 20:57
Raghu Angadi
2012-04-03, 21:08
帝归
2012-04-05, 15:05
|
-
Compressing output using block compressionMohit Anchlia 2012-03-28, 16:45
We currently have 100s of GB of uncompressed data which we would like to
zip using some compression that is block compression so that we can use multiple input splits. Does pig support any such compression?
-
Re: Compressing output using block compressionPrashant Kommireddi 2012-03-28, 16:49
Pig support LZO for splittable compression.
Thanks, Prashant On Mar 28, 2012, at 9:45 AM, Mohit Anchlia <[EMAIL PROTECTED]> wrote: > We currently have 100s of GB of uncompressed data which we would like to > zip using some compression that is block compression so that we can use > multiple input splits. Does pig support any such compression?
-
Re: Compressing output using block compressionDmitriy Ryaboy 2012-03-29, 23:02
You might find the elephant-bird project helpful for reading and
creating LZO files, in raw hadoop or using Pig. (disclaimer: I'm a committer on elephant-bird) D On Wed, Mar 28, 2012 at 9:49 AM, Prashant Kommireddi <[EMAIL PROTECTED]> wrote: > Pig support LZO for splittable compression. > > Thanks, > Prashant > > On Mar 28, 2012, at 9:45 AM, Mohit Anchlia <[EMAIL PROTECTED]> wrote: > >> We currently have 100s of GB of uncompressed data which we would like to >> zip using some compression that is block compression so that we can use >> multiple input splits. Does pig support any such compression?
-
Re: Compressing output using block compressionMohit Anchlia 2012-03-29, 23:07
Thanks! When I store output how can I tell pig to compress it in LZO format?
On Thu, Mar 29, 2012 at 4:02 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > You might find the elephant-bird project helpful for reading and > creating LZO files, in raw hadoop or using Pig. > (disclaimer: I'm a committer on elephant-bird) > > D > > On Wed, Mar 28, 2012 at 9:49 AM, Prashant Kommireddi > <[EMAIL PROTECTED]> wrote: > > Pig support LZO for splittable compression. > > > > Thanks, > > Prashant > > > > On Mar 28, 2012, at 9:45 AM, Mohit Anchlia <[EMAIL PROTECTED]> > wrote: > > > >> We currently have 100s of GB of uncompressed data which we would like to > >> zip using some compression that is block compression so that we can use > >> multiple input splits. Does pig support any such compression? >
-
Re: Compressing output using block compressionJonathan Coveney 2012-03-30, 00:48
check out:
https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/store 2012/3/29 Mohit Anchlia <[EMAIL PROTECTED]> > Thanks! When I store output how can I tell pig to compress it in LZO > format? > > On Thu, Mar 29, 2012 at 4:02 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > wrote: > > > You might find the elephant-bird project helpful for reading and > > creating LZO files, in raw hadoop or using Pig. > > (disclaimer: I'm a committer on elephant-bird) > > > > D > > > > On Wed, Mar 28, 2012 at 9:49 AM, Prashant Kommireddi > > <[EMAIL PROTECTED]> wrote: > > > Pig support LZO for splittable compression. > > > > > > Thanks, > > > Prashant > > > > > > On Mar 28, 2012, at 9:45 AM, Mohit Anchlia <[EMAIL PROTECTED]> > > wrote: > > > > > >> We currently have 100s of GB of uncompressed data which we would like > to > > >> zip using some compression that is block compression so that we can > use > > >> multiple input splits. Does pig support any such compression? > > >
-
Re: Compressing output using block compression帝归 2012-03-30, 03:08
When I use LzoPigStorage, it will load all files under a directory. But I
want compress every file under a directory and keep the file name unchanged, just with a .lzo extension name. How can I do this? Maybe I must write a mapreduce job? 2012/3/30 Jonathan Coveney <[EMAIL PROTECTED]> > check out: > > https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/store > > 2012/3/29 Mohit Anchlia <[EMAIL PROTECTED]> > > > Thanks! When I store output how can I tell pig to compress it in LZO > > format? > > > > On Thu, Mar 29, 2012 at 4:02 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > wrote: > > > > > You might find the elephant-bird project helpful for reading and > > > creating LZO files, in raw hadoop or using Pig. > > > (disclaimer: I'm a committer on elephant-bird) > > > > > > D > > > > > > On Wed, Mar 28, 2012 at 9:49 AM, Prashant Kommireddi > > > <[EMAIL PROTECTED]> wrote: > > > > Pig support LZO for splittable compression. > > > > > > > > Thanks, > > > > Prashant > > > > > > > > On Mar 28, 2012, at 9:45 AM, Mohit Anchlia <[EMAIL PROTECTED]> > > > wrote: > > > > > > > >> We currently have 100s of GB of uncompressed data which we would > like > > to > > > >> zip using some compression that is block compression so that we can > > use > > > >> multiple input splits. Does pig support any such compression? > > > > > > -- ‘(hello world)
-
Re: Compressing output using block compressionMohit Anchlia 2012-04-03, 18:39
Is bzip2 not advisable? I think it can split too and is supported out of
the box. On Thu, Mar 29, 2012 at 8:08 PM, 帝归 <[EMAIL PROTECTED]> wrote: > When I use LzoPigStorage, it will load all files under a directory. But I > want compress every file under a directory and keep the file name > unchanged, just with a .lzo extension name. How can I do this? Maybe I must > write a mapreduce job? > > 2012/3/30 Jonathan Coveney <[EMAIL PROTECTED]> > > > check out: > > > > > https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/store > > > > 2012/3/29 Mohit Anchlia <[EMAIL PROTECTED]> > > > > > Thanks! When I store output how can I tell pig to compress it in LZO > > > format? > > > > > > On Thu, Mar 29, 2012 at 4:02 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > wrote: > > > > > > > You might find the elephant-bird project helpful for reading and > > > > creating LZO files, in raw hadoop or using Pig. > > > > (disclaimer: I'm a committer on elephant-bird) > > > > > > > > D > > > > > > > > On Wed, Mar 28, 2012 at 9:49 AM, Prashant Kommireddi > > > > <[EMAIL PROTECTED]> wrote: > > > > > Pig support LZO for splittable compression. > > > > > > > > > > Thanks, > > > > > Prashant > > > > > > > > > > On Mar 28, 2012, at 9:45 AM, Mohit Anchlia <[EMAIL PROTECTED] > > > > > > wrote: > > > > > > > > > >> We currently have 100s of GB of uncompressed data which we would > > like > > > to > > > > >> zip using some compression that is block compression so that we > can > > > use > > > > >> multiple input splits. Does pig support any such compression? > > > > > > > > > > > > > -- > ‘(hello world) >
-
Re: Compressing output using block compressionPrashant Kommireddi 2012-04-03, 18:48
Yes, it is splittable.
Bzip2 consumes a lot of CPU in decompression. With Hadoop jobs generally being IO bound, Bzip2 sometimes can become the bottleneck with respect to performance due to this slow decompression rate (algorithm unable to decompress at disk read rate). On Tue, Apr 3, 2012 at 11:39 AM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > Is bzip2 not advisable? I think it can split too and is supported out of > the box. > > On Thu, Mar 29, 2012 at 8:08 PM, 帝归 <[EMAIL PROTECTED]> wrote: > > > When I use LzoPigStorage, it will load all files under a directory. But I > > want compress every file under a directory and keep the file name > > unchanged, just with a .lzo extension name. How can I do this? Maybe I > must > > write a mapreduce job? > > > > 2012/3/30 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > check out: > > > > > > > > > https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/store > > > > > > 2012/3/29 Mohit Anchlia <[EMAIL PROTECTED]> > > > > > > > Thanks! When I store output how can I tell pig to compress it in LZO > > > > format? > > > > > > > > On Thu, Mar 29, 2012 at 4:02 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > You might find the elephant-bird project helpful for reading and > > > > > creating LZO files, in raw hadoop or using Pig. > > > > > (disclaimer: I'm a committer on elephant-bird) > > > > > > > > > > D > > > > > > > > > > On Wed, Mar 28, 2012 at 9:49 AM, Prashant Kommireddi > > > > > <[EMAIL PROTECTED]> wrote: > > > > > > Pig support LZO for splittable compression. > > > > > > > > > > > > Thanks, > > > > > > Prashant > > > > > > > > > > > > On Mar 28, 2012, at 9:45 AM, Mohit Anchlia < > [EMAIL PROTECTED] > > > > > > > > wrote: > > > > > > > > > > > >> We currently have 100s of GB of uncompressed data which we would > > > like > > > > to > > > > > >> zip using some compression that is block compression so that we > > can > > > > use > > > > > >> multiple input splits. Does pig support any such compression? > > > > > > > > > > > > > > > > > > > > -- > > ‘(hello world) > > >
-
Re: Compressing output using block compressionMohit Anchlia 2012-04-03, 19:18
Thanks for your input.
It looks like it's some work to configure LZO. What are the other alternatives? We read new sequence files and generate output continuously. What are my options? Should I split the output in small pieces and gzip them? How do people solve similar problems where there is continuous flow of data that generates tons of output continuosly? After output is generated we again read them and load it in OLAP db or do some other analysis. On Tue, Apr 3, 2012 at 11:48 AM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote: > Yes, it is splittable. > > Bzip2 consumes a lot of CPU in decompression. With Hadoop jobs generally > being IO bound, Bzip2 sometimes can become the bottleneck with respect to > performance due to this slow decompression rate (algorithm unable to > decompress at disk read rate). > > > On Tue, Apr 3, 2012 at 11:39 AM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > Is bzip2 not advisable? I think it can split too and is supported out of > > the box. > > > > On Thu, Mar 29, 2012 at 8:08 PM, 帝归 <[EMAIL PROTECTED]> wrote: > > > > > When I use LzoPigStorage, it will load all files under a directory. > But I > > > want compress every file under a directory and keep the file name > > > unchanged, just with a .lzo extension name. How can I do this? Maybe I > > must > > > write a mapreduce job? > > > > > > 2012/3/30 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > > > check out: > > > > > > > > > > > > > > https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/store > > > > > > > > 2012/3/29 Mohit Anchlia <[EMAIL PROTECTED]> > > > > > > > > > Thanks! When I store output how can I tell pig to compress it in > LZO > > > > > format? > > > > > > > > > > On Thu, Mar 29, 2012 at 4:02 PM, Dmitriy Ryaboy < > [EMAIL PROTECTED]> > > > > > wrote: > > > > > > > > > > > You might find the elephant-bird project helpful for reading and > > > > > > creating LZO files, in raw hadoop or using Pig. > > > > > > (disclaimer: I'm a committer on elephant-bird) > > > > > > > > > > > > D > > > > > > > > > > > > On Wed, Mar 28, 2012 at 9:49 AM, Prashant Kommireddi > > > > > > <[EMAIL PROTECTED]> wrote: > > > > > > > Pig support LZO for splittable compression. > > > > > > > > > > > > > > Thanks, > > > > > > > Prashant > > > > > > > > > > > > > > On Mar 28, 2012, at 9:45 AM, Mohit Anchlia < > > [EMAIL PROTECTED] > > > > > > > > > > wrote: > > > > > > > > > > > > > >> We currently have 100s of GB of uncompressed data which we > would > > > > like > > > > > to > > > > > > >> zip using some compression that is block compression so that > we > > > can > > > > > use > > > > > > >> multiple input splits. Does pig support any such compression? > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > ‘(hello world) > > > > > >
-
Re: Compressing output using block compressionPrashant Kommireddi 2012-04-03, 19:42
Most companies handling BigData use LZO, a few have started exploring/using
Snappy as well (which is not any easier to configure). These are the 2 splittable fast-compression algorithms. Note Snappy is not efficient space-wise compared to gzip or other compression algos, but a lot faster (ideal for compression between Map and Reduce) Is there any repeated/heavy computation involved on the outputs other than pushing this data to a database? If not, may be its fine to use gzip but you have to make sure the individual files are close to the block size, or you will have a lot of unnecessary IO transfers taking place. If you read the outputs to perform further Map Reduce computation, gzip is not the best. -Prashant On Tue, Apr 3, 2012 at 12:18 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > Thanks for your input. > > It looks like it's some work to configure LZO. What are the other > alternatives? We read new sequence files and generate output continuously. > What are my options? Should I split the output in small pieces and gzip > them? How do people solve similar problems where there is continuous flow > of data that generates tons of output continuosly? > > After output is generated we again read them and load it in OLAP db or do > some other analysis. > > On Tue, Apr 3, 2012 at 11:48 AM, Prashant Kommireddi <[EMAIL PROTECTED] > >wrote: > > > Yes, it is splittable. > > > > Bzip2 consumes a lot of CPU in decompression. With Hadoop jobs generally > > being IO bound, Bzip2 sometimes can become the bottleneck with respect to > > performance due to this slow decompression rate (algorithm unable to > > decompress at disk read rate). > > > > > > On Tue, Apr 3, 2012 at 11:39 AM, Mohit Anchlia <[EMAIL PROTECTED] > > >wrote: > > > > > Is bzip2 not advisable? I think it can split too and is supported out > of > > > the box. > > > > > > On Thu, Mar 29, 2012 at 8:08 PM, 帝归 <[EMAIL PROTECTED]> wrote: > > > > > > > When I use LzoPigStorage, it will load all files under a directory. > > But I > > > > want compress every file under a directory and keep the file name > > > > unchanged, just with a .lzo extension name. How can I do this? Maybe > I > > > must > > > > write a mapreduce job? > > > > > > > > 2012/3/30 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > > > > > check out: > > > > > > > > > > > > > > > > > > > > https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/store > > > > > > > > > > 2012/3/29 Mohit Anchlia <[EMAIL PROTECTED]> > > > > > > > > > > > Thanks! When I store output how can I tell pig to compress it in > > LZO > > > > > > format? > > > > > > > > > > > > On Thu, Mar 29, 2012 at 4:02 PM, Dmitriy Ryaboy < > > [EMAIL PROTECTED]> > > > > > > wrote: > > > > > > > > > > > > > You might find the elephant-bird project helpful for reading > and > > > > > > > creating LZO files, in raw hadoop or using Pig. > > > > > > > (disclaimer: I'm a committer on elephant-bird) > > > > > > > > > > > > > > D > > > > > > > > > > > > > > On Wed, Mar 28, 2012 at 9:49 AM, Prashant Kommireddi > > > > > > > <[EMAIL PROTECTED]> wrote: > > > > > > > > Pig support LZO for splittable compression. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Prashant > > > > > > > > > > > > > > > > On Mar 28, 2012, at 9:45 AM, Mohit Anchlia < > > > [EMAIL PROTECTED] > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > >> We currently have 100s of GB of uncompressed data which we > > would > > > > > like > > > > > > to > > > > > > > >> zip using some compression that is block compression so that > > we > > > > can > > > > > > use > > > > > > > >> multiple input splits. Does pig support any such > compression? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > ‘(hello world) > > > > > > > > > >
-
Re: Compressing output using block compressionMohit Anchlia 2012-04-03, 20:02
I am currently using Snappy in sequence files. I wasn't aware snappy uses
block compression. Does it mean Snappy is splittable? If so then how can I use it in pig? Thanks again On Tue, Apr 3, 2012 at 12:42 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote: > Most companies handling BigData use LZO, a few have started exploring/using > Snappy as well (which is not any easier to configure). These are the 2 > splittable fast-compression algorithms. Note Snappy is not efficient > space-wise compared to gzip or other compression algos, but a lot faster > (ideal for compression between Map and Reduce) > > Is there any repeated/heavy computation involved on the outputs other than > pushing this data to a database? If not, may be its fine to use gzip but > you have to make sure the individual files are close to the block size, or > you will have a lot of unnecessary IO transfers taking place. If you read > the outputs to perform further Map Reduce computation, gzip is not the > best. > > -Prashant > > On Tue, Apr 3, 2012 at 12:18 PM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > Thanks for your input. > > > > It looks like it's some work to configure LZO. What are the other > > alternatives? We read new sequence files and generate output > continuously. > > What are my options? Should I split the output in small pieces and gzip > > them? How do people solve similar problems where there is continuous flow > > of data that generates tons of output continuosly? > > > > After output is generated we again read them and load it in OLAP db or do > > some other analysis. > > > > On Tue, Apr 3, 2012 at 11:48 AM, Prashant Kommireddi < > [EMAIL PROTECTED] > > >wrote: > > > > > Yes, it is splittable. > > > > > > Bzip2 consumes a lot of CPU in decompression. With Hadoop jobs > generally > > > being IO bound, Bzip2 sometimes can become the bottleneck with respect > to > > > performance due to this slow decompression rate (algorithm unable to > > > decompress at disk read rate). > > > > > > > > > On Tue, Apr 3, 2012 at 11:39 AM, Mohit Anchlia <[EMAIL PROTECTED] > > > >wrote: > > > > > > > Is bzip2 not advisable? I think it can split too and is supported out > > of > > > > the box. > > > > > > > > On Thu, Mar 29, 2012 at 8:08 PM, 帝归 <[EMAIL PROTECTED]> wrote: > > > > > > > > > When I use LzoPigStorage, it will load all files under a directory. > > > But I > > > > > want compress every file under a directory and keep the file name > > > > > unchanged, just with a .lzo extension name. How can I do this? > Maybe > > I > > > > must > > > > > write a mapreduce job? > > > > > > > > > > 2012/3/30 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > > > > > > > check out: > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/store > > > > > > > > > > > > 2012/3/29 Mohit Anchlia <[EMAIL PROTECTED]> > > > > > > > > > > > > > Thanks! When I store output how can I tell pig to compress it > in > > > LZO > > > > > > > format? > > > > > > > > > > > > > > On Thu, Mar 29, 2012 at 4:02 PM, Dmitriy Ryaboy < > > > [EMAIL PROTECTED]> > > > > > > > wrote: > > > > > > > > > > > > > > > You might find the elephant-bird project helpful for reading > > and > > > > > > > > creating LZO files, in raw hadoop or using Pig. > > > > > > > > (disclaimer: I'm a committer on elephant-bird) > > > > > > > > > > > > > > > > D > > > > > > > > > > > > > > > > On Wed, Mar 28, 2012 at 9:49 AM, Prashant Kommireddi > > > > > > > > <[EMAIL PROTECTED]> wrote: > > > > > > > > > Pig support LZO for splittable compression. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Prashant > > > > > > > > > > > > > > > > > > On Mar 28, 2012, at 9:45 AM, Mohit Anchlia < > > > > [EMAIL PROTECTED] > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > >> We currently have 100s of GB of uncompressed data which we > > > would > > > > > > like > > > > > > > to > > > > > > > > >> zip using some compression that is block compression so
-
Re: Compressing output using block compressionPrashant Kommireddi 2012-04-03, 20:30
Does it mean Snappy is splittable?
http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/ If so then how can I use it in pig? http://hadoopified.wordpress.com/2012/01/24/snappy-compression-with-pig/ On Tue, Apr 3, 2012 at 1:02 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > I am currently using Snappy in sequence files. I wasn't aware snappy uses > block compression. Does it mean Snappy is splittable? If so then how can I > use it in pig? > > Thanks again > > On Tue, Apr 3, 2012 at 12:42 PM, Prashant Kommireddi <[EMAIL PROTECTED] > >wrote: > > > Most companies handling BigData use LZO, a few have started > exploring/using > > Snappy as well (which is not any easier to configure). These are the 2 > > splittable fast-compression algorithms. Note Snappy is not efficient > > space-wise compared to gzip or other compression algos, but a lot faster > > (ideal for compression between Map and Reduce) > > > > Is there any repeated/heavy computation involved on the outputs other > than > > pushing this data to a database? If not, may be its fine to use gzip but > > you have to make sure the individual files are close to the block size, > or > > you will have a lot of unnecessary IO transfers taking place. If you > read > > the outputs to perform further Map Reduce computation, gzip is not the > > best. > > > > -Prashant > > > > On Tue, Apr 3, 2012 at 12:18 PM, Mohit Anchlia <[EMAIL PROTECTED] > > >wrote: > > > > > Thanks for your input. > > > > > > It looks like it's some work to configure LZO. What are the other > > > alternatives? We read new sequence files and generate output > > continuously. > > > What are my options? Should I split the output in small pieces and gzip > > > them? How do people solve similar problems where there is continuous > flow > > > of data that generates tons of output continuosly? > > > > > > After output is generated we again read them and load it in OLAP db or > do > > > some other analysis. > > > > > > On Tue, Apr 3, 2012 at 11:48 AM, Prashant Kommireddi < > > [EMAIL PROTECTED] > > > >wrote: > > > > > > > Yes, it is splittable. > > > > > > > > Bzip2 consumes a lot of CPU in decompression. With Hadoop jobs > > generally > > > > being IO bound, Bzip2 sometimes can become the bottleneck with > respect > > to > > > > performance due to this slow decompression rate (algorithm unable to > > > > decompress at disk read rate). > > > > > > > > > > > > On Tue, Apr 3, 2012 at 11:39 AM, Mohit Anchlia < > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > Is bzip2 not advisable? I think it can split too and is supported > out > > > of > > > > > the box. > > > > > > > > > > On Thu, Mar 29, 2012 at 8:08 PM, 帝归 <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > When I use LzoPigStorage, it will load all files under a > directory. > > > > But I > > > > > > want compress every file under a directory and keep the file name > > > > > > unchanged, just with a .lzo extension name. How can I do this? > > Maybe > > > I > > > > > must > > > > > > write a mapreduce job? > > > > > > > > > > > > 2012/3/30 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > > > > > > > > > check out: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/store > > > > > > > > > > > > > > 2012/3/29 Mohit Anchlia <[EMAIL PROTECTED]> > > > > > > > > > > > > > > > Thanks! When I store output how can I tell pig to compress it > > in > > > > LZO > > > > > > > > format? > > > > > > > > > > > > > > > > On Thu, Mar 29, 2012 at 4:02 PM, Dmitriy Ryaboy < > > > > [EMAIL PROTECTED]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > You might find the elephant-bird project helpful for > reading > > > and > > > > > > > > > creating LZO files, in raw hadoop or using Pig. > > > > > > > > > (disclaimer: I'm a committer on elephant-bird) > > > > > > > > > > > > > > > > > > D > > > > > > > > > > > > > > > > > > On Wed, Mar 28, 2012 at 9:49 AM, Prashant Kommireddi
-
Re: Compressing output using block compressionMohit Anchlia 2012-04-03, 20:57
Thanks for the examples. It appears that snappy is not splittable and
suggested approach is to write to sequence files. I know how to load from sequencefiles, but in pig I can't find a way to write to the sequence files using snappy compression. On Tue, Apr 3, 2012 at 1:30 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote: > Does it mean Snappy is splittable? > http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/ > > If so then how can I use it in pig? > http://hadoopified.wordpress.com/2012/01/24/snappy-compression-with-pig/ > > > On Tue, Apr 3, 2012 at 1:02 PM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > I am currently using Snappy in sequence files. I wasn't aware snappy uses > > block compression. Does it mean Snappy is splittable? If so then how can > I > > use it in pig? > > > > Thanks again > > > > On Tue, Apr 3, 2012 at 12:42 PM, Prashant Kommireddi < > [EMAIL PROTECTED] > > >wrote: > > > > > Most companies handling BigData use LZO, a few have started > > exploring/using > > > Snappy as well (which is not any easier to configure). These are the 2 > > > splittable fast-compression algorithms. Note Snappy is not efficient > > > space-wise compared to gzip or other compression algos, but a lot > faster > > > (ideal for compression between Map and Reduce) > > > > > > Is there any repeated/heavy computation involved on the outputs other > > than > > > pushing this data to a database? If not, may be its fine to use gzip > but > > > you have to make sure the individual files are close to the block size, > > or > > > you will have a lot of unnecessary IO transfers taking place. If you > > read > > > the outputs to perform further Map Reduce computation, gzip is not the > > > best. > > > > > > -Prashant > > > > > > On Tue, Apr 3, 2012 at 12:18 PM, Mohit Anchlia <[EMAIL PROTECTED] > > > >wrote: > > > > > > > Thanks for your input. > > > > > > > > It looks like it's some work to configure LZO. What are the other > > > > alternatives? We read new sequence files and generate output > > > continuously. > > > > What are my options? Should I split the output in small pieces and > gzip > > > > them? How do people solve similar problems where there is continuous > > flow > > > > of data that generates tons of output continuosly? > > > > > > > > After output is generated we again read them and load it in OLAP db > or > > do > > > > some other analysis. > > > > > > > > On Tue, Apr 3, 2012 at 11:48 AM, Prashant Kommireddi < > > > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > Yes, it is splittable. > > > > > > > > > > Bzip2 consumes a lot of CPU in decompression. With Hadoop jobs > > > generally > > > > > being IO bound, Bzip2 sometimes can become the bottleneck with > > respect > > > to > > > > > performance due to this slow decompression rate (algorithm unable > to > > > > > decompress at disk read rate). > > > > > > > > > > > > > > > On Tue, Apr 3, 2012 at 11:39 AM, Mohit Anchlia < > > [EMAIL PROTECTED] > > > > > >wrote: > > > > > > > > > > > Is bzip2 not advisable? I think it can split too and is supported > > out > > > > of > > > > > > the box. > > > > > > > > > > > > On Thu, Mar 29, 2012 at 8:08 PM, 帝归 <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > When I use LzoPigStorage, it will load all files under a > > directory. > > > > > But I > > > > > > > want compress every file under a directory and keep the file > name > > > > > > > unchanged, just with a .lzo extension name. How can I do this? > > > Maybe > > > > I > > > > > > must > > > > > > > write a mapreduce job? > > > > > > > > > > > > > > 2012/3/30 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > > > > > > > > > > > check out: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/store > > > > > > > > > > > > > > > > 2012/3/29 Mohit Anchlia <[EMAIL PROTECTED]> > > > > > > > > > > > > > > > > > Thanks! When I store output how can I tell pig to compress
-
Re: Compressing output using block compressionRaghu Angadi 2012-04-03, 21:08
SequenceFileStorage in elephant-bird lets you load and store to sequence
files. If your input is text lines, you can store each line as 'value'. You can experiment with different codecs. depending on your use case, simple bzip2 files may not be a bad choice. On Tue, Apr 3, 2012 at 1:57 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > Thanks for the examples. It appears that snappy is not splittable and > suggested approach is to write to sequence files. > > I know how to load from sequencefiles, but in pig I can't find a way to > write to the sequence files using snappy compression. > > On Tue, Apr 3, 2012 at 1:30 PM, Prashant Kommireddi <[EMAIL PROTECTED] > >wrote: > > > Does it mean Snappy is splittable? > > http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/ > > > > If so then how can I use it in pig? > > http://hadoopified.wordpress.com/2012/01/24/snappy-compression-with-pig/ > > > > > > On Tue, Apr 3, 2012 at 1:02 PM, Mohit Anchlia <[EMAIL PROTECTED] > > >wrote: > > > > > I am currently using Snappy in sequence files. I wasn't aware snappy > uses > > > block compression. Does it mean Snappy is splittable? If so then how > can > > I > > > use it in pig? > > > > > > Thanks again > > > > > > On Tue, Apr 3, 2012 at 12:42 PM, Prashant Kommireddi < > > [EMAIL PROTECTED] > > > >wrote: > > > > > > > Most companies handling BigData use LZO, a few have started > > > exploring/using > > > > Snappy as well (which is not any easier to configure). These are the > 2 > > > > splittable fast-compression algorithms. Note Snappy is not efficient > > > > space-wise compared to gzip or other compression algos, but a lot > > faster > > > > (ideal for compression between Map and Reduce) > > > > > > > > Is there any repeated/heavy computation involved on the outputs other > > > than > > > > pushing this data to a database? If not, may be its fine to use gzip > > but > > > > you have to make sure the individual files are close to the block > size, > > > or > > > > you will have a lot of unnecessary IO transfers taking place. If you > > > read > > > > the outputs to perform further Map Reduce computation, gzip is not > the > > > > best. > > > > > > > > -Prashant > > > > > > > > On Tue, Apr 3, 2012 at 12:18 PM, Mohit Anchlia < > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > Thanks for your input. > > > > > > > > > > It looks like it's some work to configure LZO. What are the other > > > > > alternatives? We read new sequence files and generate output > > > > continuously. > > > > > What are my options? Should I split the output in small pieces and > > gzip > > > > > them? How do people solve similar problems where there is > continuous > > > flow > > > > > of data that generates tons of output continuosly? > > > > > > > > > > After output is generated we again read them and load it in OLAP db > > or > > > do > > > > > some other analysis. > > > > > > > > > > On Tue, Apr 3, 2012 at 11:48 AM, Prashant Kommireddi < > > > > [EMAIL PROTECTED] > > > > > >wrote: > > > > > > > > > > > Yes, it is splittable. > > > > > > > > > > > > Bzip2 consumes a lot of CPU in decompression. With Hadoop jobs > > > > generally > > > > > > being IO bound, Bzip2 sometimes can become the bottleneck with > > > respect > > > > to > > > > > > performance due to this slow decompression rate (algorithm unable > > to > > > > > > decompress at disk read rate). > > > > > > > > > > > > > > > > > > On Tue, Apr 3, 2012 at 11:39 AM, Mohit Anchlia < > > > [EMAIL PROTECTED] > > > > > > >wrote: > > > > > > > > > > > > > Is bzip2 not advisable? I think it can split too and is > supported > > > out > > > > > of > > > > > > > the box. > > > > > > > > > > > > > > On Thu, Mar 29, 2012 at 8:08 PM, 帝归 <[EMAIL PROTECTED]> > wrote: > > > > > > > > > > > > > > > When I use LzoPigStorage, it will load all files under a > > > directory. > > > > > > But I > > > > > > > > want compress every file under a directory and keep the file > > name > > > > > > > > unchanged, just with a .lzo extension name. How can I do
-
Re: Compressing output using block compression帝归 2012-04-05, 15:05
I think Lzo is a good format to compress files, because it costs constant
time for compressing and decompressing. As Lzo is not included in Hadoop's compression formats (because of it's GPL licence?), I need to write a Java script to compress files on HDFS. 2012/4/4 Raghu Angadi <[EMAIL PROTECTED]> > SequenceFileStorage in elephant-bird lets you load and store to sequence > files. > If your input is text lines, you can store each line as 'value'. > You can experiment with different codecs. > > depending on your use case, simple bzip2 files may not be a bad choice. > > On Tue, Apr 3, 2012 at 1:57 PM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > Thanks for the examples. It appears that snappy is not splittable and > > suggested approach is to write to sequence files. > > > > I know how to load from sequencefiles, but in pig I can't find a way to > > write to the sequence files using snappy compression. > > > > On Tue, Apr 3, 2012 at 1:30 PM, Prashant Kommireddi <[EMAIL PROTECTED] > > >wrote: > > > > > Does it mean Snappy is splittable? > > > http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/ > > > > > > If so then how can I use it in pig? > > > > http://hadoopified.wordpress.com/2012/01/24/snappy-compression-with-pig/ > > > > > > > > > On Tue, Apr 3, 2012 at 1:02 PM, Mohit Anchlia <[EMAIL PROTECTED] > > > >wrote: > > > > > > > I am currently using Snappy in sequence files. I wasn't aware snappy > > uses > > > > block compression. Does it mean Snappy is splittable? If so then how > > can > > > I > > > > use it in pig? > > > > > > > > Thanks again > > > > > > > > On Tue, Apr 3, 2012 at 12:42 PM, Prashant Kommireddi < > > > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > Most companies handling BigData use LZO, a few have started > > > > exploring/using > > > > > Snappy as well (which is not any easier to configure). These are > the > > 2 > > > > > splittable fast-compression algorithms. Note Snappy is not > efficient > > > > > space-wise compared to gzip or other compression algos, but a lot > > > faster > > > > > (ideal for compression between Map and Reduce) > > > > > > > > > > Is there any repeated/heavy computation involved on the outputs > other > > > > than > > > > > pushing this data to a database? If not, may be its fine to use > gzip > > > but > > > > > you have to make sure the individual files are close to the block > > size, > > > > or > > > > > you will have a lot of unnecessary IO transfers taking place. If > you > > > > read > > > > > the outputs to perform further Map Reduce computation, gzip is not > > the > > > > > best. > > > > > > > > > > -Prashant > > > > > > > > > > On Tue, Apr 3, 2012 at 12:18 PM, Mohit Anchlia < > > [EMAIL PROTECTED] > > > > > >wrote: > > > > > > > > > > > Thanks for your input. > > > > > > > > > > > > It looks like it's some work to configure LZO. What are the other > > > > > > alternatives? We read new sequence files and generate output > > > > > continuously. > > > > > > What are my options? Should I split the output in small pieces > and > > > gzip > > > > > > them? How do people solve similar problems where there is > > continuous > > > > flow > > > > > > of data that generates tons of output continuosly? > > > > > > > > > > > > After output is generated we again read them and load it in OLAP > db > > > or > > > > do > > > > > > some other analysis. > > > > > > > > > > > > On Tue, Apr 3, 2012 at 11:48 AM, Prashant Kommireddi < > > > > > [EMAIL PROTECTED] > > > > > > >wrote: > > > > > > > > > > > > > Yes, it is splittable. > > > > > > > > > > > > > > Bzip2 consumes a lot of CPU in decompression. With Hadoop jobs > > > > > generally > > > > > > > being IO bound, Bzip2 sometimes can become the bottleneck with > > > > respect > > > > > to > > > > > > > performance due to this slow decompression rate (algorithm > unable > > > to > > > > > > > decompress at disk read rate). > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 3, 2012 at 11:39 AM, Mohit Anchlia < ‘(hello world) |