Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Implementing a total sort over avro data


Copy link to this message
-
Implementing a total sort over avro data
Hi,

I was wondering if it was possible to implement a total sort using the InputSampler.RandomSampler and TotalOrderPartitioner with avro mapreduce? I tried adding the following lines to my job:

InputSampler.Sampler<AvroKey, AvroValue> sampler = new InputSampler.RandomSampler<AvroKey, AvroValue>(0.1, 10000, 10);
InputSampler.writePartitionFile(jobConf, sampler);
jobConf.setPartitionerClass(TotalOrderPartitioner.class);
DistributedCache.addCacheFile(new URI(TotalOrderPartitioner.getPartitionFile(jobConf)), jobConf);

But that just gives me:

12/08/15 17:23:05 INFO partition.InputSampler: Using 10000 samples
Exception in thread "main" java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.avro.mapred.AvroWrapper
at org.apache.avro.mapred.AvroKeyComparator.compare(AvroKeyComparator.java:30)
at java.util.Arrays.mergeSort(Arrays.java:1270)
at java.util.Arrays.mergeSort(Arrays.java:1281)
at java.util.Arrays.mergeSort(Arrays.java:1281)
at java.util.Arrays.mergeSort(Arrays.java:1281)
at java.util.Arrays.mergeSort(Arrays.java:1281)
at java.util.Arrays.mergeSort(Arrays.java:1281)
at java.util.Arrays.mergeSort(Arrays.java:1281)
at java.util.Arrays.mergeSort(Arrays.java:1281)
at java.util.Arrays.mergeSort(Arrays.java:1281)
at java.util.Arrays.mergeSort(Arrays.java:1281)
at java.util.Arrays.mergeSort(Arrays.java:1281)
at java.util.Arrays.mergeSort(Arrays.java:1281)
at java.util.Arrays.sort(Arrays.java:1210)
at org.apache.hadoop.mapreduce.lib.partition.InputSampler.writePartitionFile(InputSampler.java:324)
at org.apache.hadoop.mapred.lib.InputSampler.writePartitionFile(InputSampler.java:39)
at com.compete.avro.ParallelDataPull.run(ParallelDataPull.java:223)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at com.compete.avro.ParallelDataPull.main(ParallelDataPull.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

-Steven Willis
+
Harsh J 2012-08-23, 14:23
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB