Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Mapreduce using JSONObjects


Copy link to this message
-
Re: Mapreduce using JSONObjects
A side point for Hadoop experts: a comparator is used for sorting in the
shuffle. If a comparator always returns -1 for unequal objects, then
sorting will take longer than it should because there will be a certain
amount of items that are compared more than once.

Is this true?

On 06/05/2013 04:10 PM, Max Lebedev wrote:
>
> I�ve taken your advice and made a wrapper class which implements
> WritableComparable. Thank you very much for your help. I believe
> everything is working fine on that front. I used google�s gson for the
> comparison.
>
>
> public int compareTo(Object o) {
>
>     JsonElement o1 = PARSER.parse(this.json.toString());
>
>     JsonElement o2 = PARSER.parse(o.toString());
>
>     if(o2.equals(o1))
>
>   return 0;
>
>     else
>
>   return -1;
>
> }
>
>
> The problem I have now is that only consecutive duplicates are
> detected. Given 6 lines:
>
> {"ts":1368758947.291035,"isSecure":true,"version":2,"source":"sdk","debug":false}
>
> {"ts":1368758947.291035,"version":2,"source":"sdk","isSecure":true,"debug":false}
>
> {"ts":1368758947.291035,"version":2,"source":"sdk","isSecure":true,"debug":true}
>
> {"ts":1368758947.291035,"isSecure":false,"version":2,"source":"sdk","debug":false}
>
> {"ts":1368758947.291035,
> "source":"sdk","isSecure":false,"version":2,"debug":false}
>
> {"ts":1368758947.291035,"isSecure":true,"version":2,"source":"sdk","debug":false}
>
>
> I get back 1, 3, 4, and 6. I should be getting 1, 3 and 4 as 6 is
> exactly equal to 1. If I switch 5 and 6, the original line 5 is no
> longer filtered (I get 1,3,4,5,6). I�ve noticed that the compareTo
> method is called a total of 13 times. I assume that in order for all 6
> of the keys to be compared, 15 comparisons need to be made. Am I
> missing something here? I�ve tested the compareTo manually and line 1
> and 6 are interpreted as equal. My map reduce code currently looks
> like this:
>
>
> class DupFilter{
>
>     private static final Gson GSON = new Gson();
>
>     private static final JsonParser PARSER = new JsonParser();
>
>     public static class Map extends MapReduceBase implements
> Mapper<LongWritable, Text, JSONWrapper, IntWritable> {
>         public void map(LongWritable key, Text value,
> OutputCollector<JSONWrapper, IntWritable> output, Reporter reporter)
> throws IOException{
>
>             JsonElement je = PARSER.parse(value.toString());
>
>             JSONWrapper jow = null;
>
>             jow = new JSONWrapper(value.toString());
>
>             IntWritable one = new IntWritable(1);
>
>             output.collect(jow, one);
>
>     }
>
>     }
>
>     public static class Reduce extends MapReduceBase implements
> Reducer<JSONWrapper, IntWritable, JSONWrapper, IntWritable> {
>
>   public void reduce(JSONWrapper jow, Iterator<IntWritable> values,
> OutputCollector<JSONWrapper, IntWritable> output, Reporter reporter)
> throws IOException {
>
>             int sum = 0;
>
>             while (values.hasNext())
>
>                 sum += values.next().get();
>
>             output.collect(jow, new IntWritable(sum));
>
>       }
>
>     }
>
>     public static void main(String[] args) throws Exception {
>
>         JobConf conf = new JobConf(DupFilter.class);
>
>   conf.setJobName("dupfilter");
>
>   conf.setOutputKeyClass(JSONWrapper.class);
>
>   conf.setOutputValueClass(IntWritable.class);
>
>   conf.setMapperClass(Map.class);
>
>   conf.setReducerClass(Reduce.class);
>
>   conf.setInputFormat(TextInputFormat.class);
>
>   conf.setOutputFormat(TextOutputFormat.class);
>
>   FileInputFormat.setInputPaths(conf, new Path(args[0]));
>
>   FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>
>   JobClient.runJob(conf);
>
>     }
>
> }
>
> Thanks,
>
> Max Lebedev
>
>
>
> On Tue, Jun 4, 2013 at 10:58 PM, Rahul Bhattacharjee
> <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
>
>     I agree with Shahab , you have to ensure that the key are writable
>     comparable and values are writable in order to be used in MR.