|
|
-
Simple map-only job to create Block Sequence Files compressed with SnappyPeter Cogan 2013-01-11, 19:13
Hi there,
I am trying to create a map-only job which takes as input some log files and simply converts them into sequence files compressed with Snappy. Although the job runs with no error - the output file that is created is pretty much the same size as the file I started with. Really confused! I've pasted the full script and the hadoop output below The output file is just named part-m-00000 - this is the resultant map output file that seems to have the same size as the input file thanks! Peter public class snappyMapOutput { public static class MapFunction extends Mapper<Object, Text, LongWritable, Text>{ public void map(LongWritable key, Text value, Context context ) throws IOException, InterruptedException { context.write(key, value); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); conf.setBoolean("mapred.compress.map.output", true); conf.set("mapred.map.output.compression.codec", "org.apache.hadoop.io.compress.SnappyCodec"); conf.set("mapred.output.compression.type", "BLOCK"); Job job = new Job(conf, "Convert to BLOCK Sequence File Snappy Compressed" ); job.setJarByClass(snappyMapOutput.class); job.setMapperClass(MapFunction.class); job.setNumReduceTasks(0); job.setMapOutputKeyClass(LongWritable.class); job.setMapOutputValueClass(Text.class); job.setOutputFormatClass(SequenceFileOutputFormat.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } 13/01/11 19:19:38 INFO input.FileInputFormat: Total input paths to process : 1 13/01/11 19:19:38 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library 13/01/11 19:19:38 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 6bb1b7f8b9044d8df9b4d2b6641db7658aab3cf8] 13/01/11 19:19:38 WARN snappy.LoadSnappy: Snappy native library is available 13/01/11 19:19:38 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/01/11 19:19:38 INFO snappy.LoadSnappy: Snappy native library loaded 13/01/11 19:19:39 INFO mapred.JobClient: Running job: job_201301111838_0006 13/01/11 19:19:40 INFO mapred.JobClient: map 0% reduce 0% 13/01/11 19:19:45 INFO mapred.JobClient: map 100% reduce 0% 13/01/11 19:19:45 INFO mapred.JobClient: Job complete: job_201301111838_0006 13/01/11 19:19:45 INFO mapred.JobClient: Counters: 19 13/01/11 19:19:45 INFO mapred.JobClient: Job Counters 13/01/11 19:19:45 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=4566 13/01/11 19:19:45 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/01/11 19:19:45 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/01/11 19:19:45 INFO mapred.JobClient: Launched map tasks=1 13/01/11 19:19:45 INFO mapred.JobClient: Data-local map tasks=1 13/01/11 19:19:45 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 13/01/11 19:19:45 INFO mapred.JobClient: File Output Format Counters 13/01/11 19:19:45 INFO mapred.JobClient: Bytes Written=72951075 13/01/11 19:19:45 INFO mapred.JobClient: FileSystemCounters 13/01/11 19:19:45 INFO mapred.JobClient: HDFS_BYTES_READ=70983803 13/01/11 19:19:45 INFO mapred.JobClient: FILE_BYTES_WRITTEN=24107 13/01/11 19:19:45 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=72951075 13/01/11 19:19:45 INFO mapred.JobClient: File Input Format Counters 13/01/11 19:19:45 INFO mapred.JobClient: Bytes Read=70983680 13/01/11 19:19:45 INFO mapred.JobClient: Map-Reduce Framework 13/01/11 19:19:45 INFO mapred.JobClient: Map input records=79756 13/01/11 19:19:45 INFO mapred.JobClient: Physical memory (bytes) snapshot=109174784 13/01/11 19:19:45 INFO mapred.JobClient: Spilled Records=0 13/01/11 19:19:45 INFO mapred.JobClient: CPU time spent (ms)=2040 13/01/11 19:19:45 INFO mapred.JobClient: Total committed heap usage (bytes)=187105280 13/01/11 19:19:45 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1084190720 13/01/11 19:19:45 INFO mapred.JobClient: Map output records=79756 13/01/11 19:19:45 INFO mapred.JobClient: SPLIT_RAW_BYTES=123 |