|
Mohit Anchlia
2012-02-25, 21:38
Shi Yu
2012-02-26, 02:27
Mohit Anchlia
2012-02-26, 07:14
Shi Yu
2012-02-26, 15:36
Ioan Eugen Stan
2012-02-26, 16:25
Harsh J
2012-02-26, 17:09
Mohit Anchlia
2012-02-26, 17:12
Harsh J
2012-02-26, 18:49
Edward Capriolo
2012-02-26, 19:28
|
-
LZO with sequenceFileMohit Anchlia 2012-02-25, 21:38
Is LZO compression supported with sequenceFile compression codec? I looked
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.CompressionType.html but it doesn't have LZO listed
-
Re: LZO with sequenceFileShi Yu 2012-02-26, 02:27
Yes, it is supported by Hadoop sequence file. It is splittable
by default. If you have installed and specified LZO correctly, use these: org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma t.setCompressOutput(job,true); org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma t.setOutputCompressorClass(job,com.hadoop.compression.lzo.LzoC odec.class); org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma t.setOutputCompressionType(job, SequenceFile.CompressionType.BLOCK); job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.outpu t.SequenceFileOutputFormat.class); Shi
-
Re: LZO with sequenceFileMohit Anchlia 2012-02-26, 07:14
Thanks. Does it mean LZO is not installed by default? How can I install LZO?
On Sat, Feb 25, 2012 at 6:27 PM, Shi Yu <[EMAIL PROTECTED]> wrote: > Yes, it is supported by Hadoop sequence file. It is splittable > by default. If you have installed and specified LZO correctly, > use these: > > > org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma > t.setCompressOutput(job,true); > > org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma > t.setOutputCompressorClass(job,com.hadoop.compression.lzo.LzoC > odec.class); > > org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma > t.setOutputCompressionType(job, > SequenceFile.CompressionType.BLOCK); > > job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.outpu > t.SequenceFileOutputFormat.class); > > > Shi >
-
Re: LZO with sequenceFileShi Yu 2012-02-26, 15:36
Hi,
You could easily find lots of documents talking about this. Try "kevinweil-hadoop-lzo" in google. Shi
-
Re: LZO with sequenceFileIoan Eugen Stan 2012-02-26, 16:25
2012/2/26 Mohit Anchlia <[EMAIL PROTECTED]>:
> Thanks. Does it mean LZO is not installed by default? How can I install LZO? The LZO library is released under GPL and I believe it can't be included in most distributions of Hadoop because of this (can't mix GPL with non GPL stuff). It should be easily available though. > On Sat, Feb 25, 2012 at 6:27 PM, Shi Yu <[EMAIL PROTECTED]> wrote: > >> Yes, it is supported by Hadoop sequence file. It is splittable >> by default. If you have installed and specified LZO correctly, >> use these: >> >> >> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma >> t.setCompressOutput(job,true); >> >> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma >> t.setOutputCompressorClass(job,com.hadoop.compression.lzo.LzoC >> odec.class); >> >> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma >> t.setOutputCompressionType(job, >> SequenceFile.CompressionType.BLOCK); >> >> job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.outpu >> t.SequenceFileOutputFormat.class); >> >> >> Shi >> -- Ioan Eugen Stan http://ieugen.blogspot.com/
-
Re: LZO with sequenceFileHarsh J 2012-02-26, 17:09
If you want to just quickly package the hadoop-lzo items instead of
building/managing-deployment on your own, you can reuse Todd Lipcon's script at https://github.com/toddlipcon/hadoop-lzo-packager - Creates both RPMs and DEBs. On Sun, Feb 26, 2012 at 9:55 PM, Ioan Eugen Stan <[EMAIL PROTECTED]> wrote: > 2012/2/26 Mohit Anchlia <[EMAIL PROTECTED]>: >> Thanks. Does it mean LZO is not installed by default? How can I install LZO? > > The LZO library is released under GPL and I believe it can't be > included in most distributions of Hadoop because of this (can't mix > GPL with non GPL stuff). It should be easily available though. > >> On Sat, Feb 25, 2012 at 6:27 PM, Shi Yu <[EMAIL PROTECTED]> wrote: >> >>> Yes, it is supported by Hadoop sequence file. It is splittable >>> by default. If you have installed and specified LZO correctly, >>> use these: >>> >>> >>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma >>> t.setCompressOutput(job,true); >>> >>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma >>> t.setOutputCompressorClass(job,com.hadoop.compression.lzo.LzoC >>> odec.class); >>> >>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma >>> t.setOutputCompressionType(job, >>> SequenceFile.CompressionType.BLOCK); >>> >>> job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.outpu >>> t.SequenceFileOutputFormat.class); >>> >>> >>> Shi >>> > > > > -- > Ioan Eugen Stan > http://ieugen.blogspot.com/ -- Harsh J
-
Re: LZO with sequenceFileMohit Anchlia 2012-02-26, 17:12
On Sun, Feb 26, 2012 at 9:09 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> If you want to just quickly package the hadoop-lzo items instead of > building/managing-deployment on your own, you can reuse Todd Lipcon's > script at https://github.com/toddlipcon/hadoop-lzo-packager - Creates > both RPMs and DEBs. > Thanks! Some questions I have is: 1. Would it work with sequence files? I am using SequenceFileAsTextInputStream 2. If I use SequenceFile.CompressionType.RECORD or BLOCK would it still split the files? 3. I am also using CDH's 20.2 version of hadoop. > > On Sun, Feb 26, 2012 at 9:55 PM, Ioan Eugen Stan <[EMAIL PROTECTED]> > wrote: > > 2012/2/26 Mohit Anchlia <[EMAIL PROTECTED]>: > >> Thanks. Does it mean LZO is not installed by default? How can I install > LZO? > > > > The LZO library is released under GPL and I believe it can't be > > included in most distributions of Hadoop because of this (can't mix > > GPL with non GPL stuff). It should be easily available though. > > > >> On Sat, Feb 25, 2012 at 6:27 PM, Shi Yu <[EMAIL PROTECTED]> wrote: > >> > >>> Yes, it is supported by Hadoop sequence file. It is splittable > >>> by default. If you have installed and specified LZO correctly, > >>> use these: > >>> > >>> > >>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma > >>> t.setCompressOutput(job,true); > >>> > >>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma > >>> t.setOutputCompressorClass(job,com.hadoop.compression.lzo.LzoC > >>> odec.class); > >>> > >>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma > >>> t.setOutputCompressionType(job, > >>> SequenceFile.CompressionType.BLOCK); > >>> > >>> job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.outpu > >>> t.SequenceFileOutputFormat.class); > >>> > >>> > >>> Shi > >>> > > > > > > > > -- > > Ioan Eugen Stan > > http://ieugen.blogspot.com/ > > > > -- > Harsh J >
-
Re: LZO with sequenceFileHarsh J 2012-02-26, 18:49
Hi Mohit,
On Sun, Feb 26, 2012 at 10:42 PM, Mohit Anchlia <[EMAIL PROTECTED]> wrote: > Thanks! Some questions I have is: > 1. Would it work with sequence files? I am using > SequenceFileAsTextInputStream Yes, you just need to set the right codec when you write the file. Reading is then normal as reading a non-compressed sequence-file. The codec classnames are stored as meta information into sequence files and are read back to load the right codec for the reader - thus you don't have to specify a 'reader' codec once you are done writing a file with any codec of choice. > 2. If I use SequenceFile.CompressionType.RECORD or BLOCK would it still > split the files? Yes SequenceFiles are a natively splittable file format, designed for HDFS and MapReduce. Compressed sequence files are thus splittable too. You mostly need block compression unless your records are large in size and you feel you'll benefit better with compression algorithms applied to a single, complete record instead of a bunch of records. > 3. I am also using CDH's 20.2 version of hadoop. http://www.cloudera.com/assets/images/diagrams/whats-in-a-version.png :) -- Harsh J
-
Re: LZO with sequenceFileEdward Capriolo 2012-02-26, 19:28
On Sun, Feb 26, 2012 at 1:49 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> Hi Mohit, > > On Sun, Feb 26, 2012 at 10:42 PM, Mohit Anchlia <[EMAIL PROTECTED]> wrote: >> Thanks! Some questions I have is: >> 1. Would it work with sequence files? I am using >> SequenceFileAsTextInputStream > > Yes, you just need to set the right codec when you write the file. > Reading is then normal as reading a non-compressed sequence-file. > > The codec classnames are stored as meta information into sequence > files and are read back to load the right codec for the reader - thus > you don't have to specify a 'reader' codec once you are done writing a > file with any codec of choice. > >> 2. If I use SequenceFile.CompressionType.RECORD or BLOCK would it still >> split the files? > > Yes SequenceFiles are a natively splittable file format, designed for > HDFS and MapReduce. Compressed sequence files are thus splittable too. > > You mostly need block compression unless your records are large in > size and you feel you'll benefit better with compression algorithms > applied to a single, complete record instead of a bunch of records. > >> 3. I am also using CDH's 20.2 version of hadoop. > > http://www.cloudera.com/assets/images/diagrams/whats-in-a-version.png :) > > -- > Harsh J LZO confuses most because how it was added and removed. Also there is a system to make raw LZO files split-table by indexing it. I have just patched google-snappy into 0.20.2. Snappy has a similar performance profile to LZO, good compression low processor overhead. It does not have all the licence issues and there is not thousands and semi contradictory/confusing information it ends up being easier to setup and use. http://code.google.com/p/snappy/ Recent version of hadoop just snappy build in so it will just work out of the box. Edward |