Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Mapreduce using JSONObjects


Copy link to this message
-
Mapreduce using JSONObjects
Hi. I've been trying to use JSONObjects to identify duplicates in
JSONStrings.
The duplicate strings contain the same data, but not necessarily in the
same order. For example the following two lines should be identified as
duplicates (and filtered).

{"ts":1368758947.291035,"isSecure":true,"version":2,"source":"sdk","debug":false
{"ts":1368758947.291035,"version":2,"source":"sdk","isSecure":true,"debug":false}

This is the code:

class DupFilter{

        public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, JSONObject, Text> {

                public void map(LongWritable key, Text value,
OutputCollector<JSONObject, Text> output, Reporter reporter) throws
IOException{

                JSONObject jo = null;

                try {

                        jo = new JSONObject(value.toString());

                        } catch (JSONException e) {

                                e.printStackTrace();

                        }

                output.collect(jo, value);

                }

        }

        public static class Reduce extends MapReduceBase implements
Reducer<JSONObject, Text, NullWritable, Text> {

                public void reduce(JSONObject jo, Iterator<Text> lines,
OutputCollector<NullWritable, Text> output, Reporter reporter) throws
IOException {

                        output.collect(null, lines.next());

                }

        }

        public static void main(String[] args) throws Exception {

                JobConf conf = new JobConf(DupFilter.class);

                conf.setOutputKeyClass(JSONObject.class);

                conf.setOutputValueClass(Text.class);

                conf.setMapperClass(Map.class);

                conf.setReducerClass(Reduce.class);

                conf.setInputFormat(TextInputFormat.class);

                conf.setOutputFormat(TextOutputFormat.class);

                FileInputFormat.setInputPaths(conf, new Path(args[0]));

                FileOutputFormat.setOutputPath(conf, new Path(args[1]));

                JobClient.runJob(conf);

        }

}

I get the following error:
java.lang.ClassCastException: class org.json.JSONObject

        at java.lang.Class.asSubclass(Class.java:3027)

        at
org.apache.hadoop.mapred.JobConf.getOutputKeyComparator(JobConf.java:795)

        at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:817)

        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:383)

        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)

        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)

It looks like it has something to do with conf.setOutputKeyClass(). Am I
doing something wrong here?
Thanks,

Max Lebedev
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB