Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Mapreduce using JSONObjects

Copy link to this message
Mapreduce using JSONObjects
Hi. I've been trying to use JSONObjects to identify duplicates in
The duplicate strings contain the same data, but not necessarily in the
same order. For example the following two lines should be identified as
duplicates (and filtered).


This is the code:

class DupFilter{

        public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, JSONObject, Text> {

                public void map(LongWritable key, Text value,
OutputCollector<JSONObject, Text> output, Reporter reporter) throws

                JSONObject jo = null;

                try {

                        jo = new JSONObject(value.toString());

                        } catch (JSONException e) {



                output.collect(jo, value);



        public static class Reduce extends MapReduceBase implements
Reducer<JSONObject, Text, NullWritable, Text> {

                public void reduce(JSONObject jo, Iterator<Text> lines,
OutputCollector<NullWritable, Text> output, Reporter reporter) throws
IOException {

                        output.collect(null, lines.next());



        public static void main(String[] args) throws Exception {

                JobConf conf = new JobConf(DupFilter.class);







                FileInputFormat.setInputPaths(conf, new Path(args[0]));

                FileOutputFormat.setOutputPath(conf, new Path(args[1]));




I get the following error:
java.lang.ClassCastException: class org.json.JSONObject

        at java.lang.Class.asSubclass(Class.java:3027)



        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:383)

        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)


It looks like it has something to do with conf.setOutputKeyClass(). Am I
doing something wrong here?

Max Lebedev