Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Mapreduce using JSONObjects


+
Max Lebedev 2013-06-04, 22:49
+
Shahab Yunus 2013-06-04, 23:07
+
Mischa Tuffield 2013-06-04, 23:39
Copy link to this message
-
Re: Mapreduce using JSONObjects
I agree with Shahab , you have to ensure that the key are writable
comparable and values are writable in order to be used in MR.

You can have writable comparable implementation wrapping the actual json
object.

Thanks,
Rahul
On Wed, Jun 5, 2013 at 5:09 AM, Mischa Tuffield <[EMAIL PROTECTED]> wrote:

> Hello,
>
> On 4 Jun 2013, at 23:49, Max Lebedev <[EMAIL PROTECTED]> wrote:
>
> Hi. I've been trying to use JSONObjects to identify duplicates in
> JSONStrings.
> The duplicate strings contain the same data, but not necessarily in the
> same order. For example the following two lines should be identified as
> duplicates (and filtered).
>
>
> {"ts":1368758947.291035,"isSecure":true,"version":2,"source":"sdk","debug":false
>
> {"ts":1368758947.291035,"version":2,"source":"sdk","isSecure":true,"debug":false}
>
> Can you not use the timestamp as a URI and emit them as URIs. Then you
> have your mapper emit the following kv :
>
> output.collect(ts, value);
>
> And you would have a straight forward reducer that can dedup based on the
> timestamps.
>
> If above doesn't work for you, I would look at the jackson library for
> mangling json in java. It method of using java beans for json is clean from
> a code pov and comes with lots of nice features.
> http://stackoverflow.com/a/2255893
>
> P.S. In your code you are using the old'er map reduce API, I would look at
> using the newer APIs in this package org.apache.hadoop.mapreduce
>
> Mischa
>
> This is the code:
>
> class DupFilter{
>
>         public static class Map extends MapReduceBase implements
> Mapper<LongWritable, Text, JSONObject, Text> {
>
>                 public void map(LongWritable key, Text value,
> OutputCollector<JSONObject, Text> output, Reporter reporter) throws
> IOException{
>
>                 JSONObject jo = null;
>
>                 try {
>
>                         jo = new JSONObject(value.toString());
>
>                         } catch (JSONException e) {
>
>                                 e.printStackTrace();
>
>                         }
>
>                 output.collect(jo, value);
>
>                 }
>
>         }
>
>         public static class Reduce extends MapReduceBase implements
> Reducer<JSONObject, Text, NullWritable, Text> {
>
>                 public void reduce(JSONObject jo, Iterator<Text> lines,
> OutputCollector<NullWritable, Text> output, Reporter reporter) throws
> IOException {
>
>                         output.collect(null, lines.next());
>
>                 }
>
>         }
>
>         public static void main(String[] args) throws Exception {
>
>                 JobConf conf = new JobConf(DupFilter.class);
>
>                 conf.setOutputKeyClass(JSONObject.class);
>
>                 conf.setOutputValueClass(Text.class);
>
>                 conf.setMapperClass(Map.class);
>
>                 conf.setReducerClass(Reduce.class);
>
>                 conf.setInputFormat(TextInputFormat.class);
>
>                 conf.setOutputFormat(TextOutputFormat.class);
>
>                 FileInputFormat.setInputPaths(conf, new Path(args[0]));
>
>                 FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>
>                 JobClient.runJob(conf);
>
>         }
>
> }
>
> I get the following error:
>
>
> java.lang.ClassCastException: class org.json.JSONObject
>
>         at java.lang.Class.asSubclass(Class.java:3027)
>
>         at
> org.apache.hadoop.mapred.JobConf.getOutputKeyComparator(JobConf.java:795)
>
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:817)
>
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:383)
>
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
>
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
>
>
>
> It looks like it has something to do with conf.setOutputKeyClass(). Am I
> doing something wrong here?
>
>
> Thanks,
>
> Max Lebedev
>
>
>   _______________________________
> Mischa Tuffield PhD
> http://mmt.me.uk/
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB