|
|
-
How to read LZO compressed files?
edward choi 2012-01-02, 05:34
Hi,
I'm having trouble trying to handle lzo compressed files. The input files are compressed by LzopCodec provided by hadoop-lzo package. And I am using Cloudera 3 update 2 version Hadoop.
I don't need to split the input file, so there is no need telling me to index the input file and to use LzoTextInputFormat, unless that is the only way to handle lzo-compressed files.
I thought all I needed to do was set the job input format as "TextInputFormat" and hadoop will take care of the rest. When I do that, I don't get any error messages but log files tell me that input files are not decompressed at all. Input files are being handled as raw text files.
Is there a specific way to read files with lzo extension?
Regards, Ed
+
edward choi 2012-01-02, 05:34
-
Re: How to read LZO compressed files?
Shi Yu 2012-01-02, 06:54
You could decompress the LZO file manually into plain text then using TextInputFormat.
Alternatively, you don't need to index the LZO compressed file, just using LZOInputFormat on non-indexed files, then the LZO file will not be split anymore.
+
Shi Yu 2012-01-02, 06:54
-
Re: How to read LZO compressed files?
edward choi 2012-01-02, 07:22
Hi,
The first solution is my final plan. There are so many lzo files, that manual decompression would take quite a while
As you suggested, I have used LzoTextInputFormat but I get the following error
2012-01-02 16:15:16,668 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library 2012-01-02 16:15:16,765 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId2012-01-02 16:15:16,858 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl library 2012-01-02 16:15:16,860 INFO com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 8aa060526bc6778c971775b832751d2894c73b5f] 2012-01-02 16:15:16,906 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2012-01-02 16:15:16,908 WARN org.apache.hadoop.mapred.Child: Error running child java.io.IOException: Codec for file hdfs://lp182:54310/user/hadoop/blog_result/20111106_20111112/part-m-00000.lzo not found, cannot run at com.hadoop.mapreduce.LzoLineRecordReader.initialize(LzoLineRecordReader.java:97) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:451) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:646) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) 2012-01-02 16:15:16,910 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
which I don't understand, because I do have LZO codec. Could you tell me what I am doing wrong here?
Regards, Ed
2012/1/2 Shi Yu <[EMAIL PROTECTED]>
> You could decompress the LZO file manually into plain text then > using TextInputFormat. > > Alternatively, you don't need to index the LZO compressed file, > just using LZOInputFormat on non-indexed files, then the LZO > file will not be split anymore. >
+
edward choi 2012-01-02, 07:22
-
Re: How to read LZO compressed files?
Harsh J 2012-01-02, 07:22
Hello Edward,
On Mon, Jan 2, 2012 at 11:04 AM, edward choi <[EMAIL PROTECTED]> wrote: > Hi, > > I'm having trouble trying to handle lzo compressed files. > The input files are compressed by LzopCodec provided by hadoop-lzo package. > And I am using Cloudera 3 update 2 version Hadoop. > > I don't need to split the input file, so there is no need telling me to > index the input file and to use LzoTextInputFormat, unless that is the only > way to handle lzo-compressed files.
Its OK to use LZO without splitting. There are no issues in doing that.
> I thought all I needed to do was set the job input format as > "TextInputFormat" and hadoop will take care of the rest. > When I do that, I don't get any error messages but log files tell me that > input files are not decompressed at all. Input files are being handled as > raw text files.
By 'Input files are being handled as raw text files.' I assume you mean that your mappers are receiving garbage (compressed) input, without being decoded?
Have you ensured that your io.compression.codecs property in core-site.xml carries LzoCodec and LzopCodec canonical classnames, and that your MR cluster was restarted with this change added?
> Is there a specific way to read files with lzo extension?
The above config registers ".lzo" look-outs and auto-detection of LZO files so you shouldn't need an explicit way.
-- Harsh J
+
Harsh J 2012-01-02, 07:22
-
Re: How to read LZO compressed files?
edward choi 2012-01-02, 08:01
Harsh, your comment just saved me from several wasteful hours of aimless labor. I added LzoCodec in core-site.xml. But I forgot to add LzopCodec. Now it works all good. Thanks for the reply!!!
Regards, Ed
2012/1/2 Harsh J <[EMAIL PROTECTED]>
> Hello Edward, > > On Mon, Jan 2, 2012 at 11:04 AM, edward choi <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I'm having trouble trying to handle lzo compressed files. > > The input files are compressed by LzopCodec provided by hadoop-lzo > package. > > And I am using Cloudera 3 update 2 version Hadoop. > > > > I don't need to split the input file, so there is no need telling me to > > index the input file and to use LzoTextInputFormat, unless that is the > only > > way to handle lzo-compressed files. > > Its OK to use LZO without splitting. There are no issues in doing that. > > > I thought all I needed to do was set the job input format as > > "TextInputFormat" and hadoop will take care of the rest. > > When I do that, I don't get any error messages but log files tell me that > > input files are not decompressed at all. Input files are being handled as > > raw text files. > > By 'Input files are being handled as raw text files.' I assume you > mean that your mappers are receiving garbage (compressed) input, > without being decoded? > > Have you ensured that your io.compression.codecs property in > core-site.xml carries LzoCodec and LzopCodec canonical classnames, and > that your MR cluster was restarted with this change added? > > > Is there a specific way to read files with lzo extension? > > The above config registers ".lzo" look-outs and auto-detection of LZO > files so you shouldn't need an explicit way. > > -- > Harsh J >
+
edward choi 2012-01-02, 08:01
|
|