Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Mapreduce using JSONObjects


Copy link to this message
-
Re: Mapreduce using JSONObjects
A side point for Hadoop experts: a comparator is used for sorting in the
shuffle. If a comparator always returns -1 for unequal objects, then
sorting will take longer than it should because there will be a certain
amount of items that are compared more than once.

Is this true?

On 06/05/2013 04:10 PM, Max Lebedev wrote:
>
> I�ve taken your advice and made a wrapper class which implements
> WritableComparable. Thank you very much for your help. I believe
> everything is working fine on that front. I used google�s gson for the
> comparison.
>
>
> public int compareTo(Object o) {
>
>     JsonElement o1 = PARSER.parse(this.json.toString());
>
>     JsonElement o2 = PARSER.parse(o.toString());
>
>     if(o2.equals(o1))
>
>   return 0;
>
>     else
>
>   return -1;
>
> }
>
>
> The problem I have now is that only consecutive duplicates are
> detected. Given 6 lines:
>
> {"ts":1368758947.291035,"isSecure":true,"version":2,"source":"sdk","debug":false}
>
> {"ts":1368758947.291035,"version":2,"source":"sdk","isSecure":true,"debug":false}
>
> {"ts":1368758947.291035,"version":2,"source":"sdk","isSecure":true,"debug":true}
>
> {"ts":1368758947.291035,"isSecure":false,"version":2,"source":"sdk","debug":false}
>
> {"ts":1368758947.291035,
> "source":"sdk","isSecure":false,"version":2,"debug":false}
>
> {"ts":1368758947.291035,"isSecure":true,"version":2,"source":"sdk","debug":false}
>
>
> I get back 1, 3, 4, and 6. I should be getting 1, 3 and 4 as 6 is
> exactly equal to 1. If I switch 5 and 6, the original line 5 is no
> longer filtered (I get 1,3,4,5,6). I�ve noticed that the compareTo
> method is called a total of 13 times. I assume that in order for all 6
> of the keys to be compared, 15 comparisons need to be made. Am I
> missing something here? I�ve tested the compareTo manually and line 1
> and 6 are interpreted as equal. My map reduce code currently looks
> like this:
>
>
> class DupFilter{
>
>     private static final Gson GSON = new Gson();
>
>     private static final JsonParser PARSER = new JsonParser();
>
>     public static class Map extends MapReduceBase implements
> Mapper<LongWritable, Text, JSONWrapper, IntWritable> {
>         public void map(LongWritable key, Text value,
> OutputCollector<JSONWrapper, IntWritable> output, Reporter reporter)
> throws IOException{
>
>             JsonElement je = PARSER.parse(value.toString());
>
>             JSONWrapper jow = null;
>
>             jow = new JSONWrapper(value.toString());
>
>             IntWritable one = new IntWritable(1);
>
>             output.collect(jow, one);
>
>     }
>
>     }
>
>     public static class Reduce extends MapReduceBase implements
> Reducer<JSONWrapper, IntWritable, JSONWrapper, IntWritable> {
>
>   public void reduce(JSONWrapper jow, Iterator<IntWritable> values,
> OutputCollector<JSONWrapper, IntWritable> output, Reporter reporter)
> throws IOException {
>
>             int sum = 0;
>
>             while (values.hasNext())
>
>                 sum += values.next().get();
>
>             output.collect(jow, new IntWritable(sum));
>
>       }
>
>     }
>
>     public static void main(String[] args) throws Exception {
>
>         JobConf conf = new JobConf(DupFilter.class);
>
>   conf.setJobName("dupfilter");
>
>   conf.setOutputKeyClass(JSONWrapper.class);
>
>   conf.setOutputValueClass(IntWritable.class);
>
>   conf.setMapperClass(Map.class);
>
>   conf.setReducerClass(Reduce.class);
>
>   conf.setInputFormat(TextInputFormat.class);
>
>   conf.setOutputFormat(TextOutputFormat.class);
>
>   FileInputFormat.setInputPaths(conf, new Path(args[0]));
>
>   FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>
>   JobClient.runJob(conf);
>
>     }
>
> }
>
> Thanks,
>
> Max Lebedev
>
>
>
> On Tue, Jun 4, 2013 at 10:58 PM, Rahul Bhattacharjee
> <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
>
>     I agree with Shahab , you have to ensure that the key are writable
>     comparable and values are writable in order to be used in MR.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB