|
|
-
Avro Map Reduce Question: GenericRecord, renaming reduce output
snikhil0 2012-06-08, 18:49
My problem: I have an input file which is avro schema but it has shuffled datums(think ids in mixed order) I need to sort them by items from the schema (id) and run a mux-demux/shuffle-sort. So my mapper: reads from avro schema (GenericRecord) and outputs key(id) and value(GenericRecord). My reducer: for each key (id) gets the list of values and outputs to a file (part-r-00000) just the genericrecords. My expectation is that I can use the same input schema to read the output file. But alas this is not working. In the part-r-00000 I have a 0<tab>Obj<Avroschema>....datums...... Why is this? Also how can rename the reduce output file to something other than part-r-0000*? Some snippets of code: ===============public void map(GenericData.Record datum, AvroCollector<Pair<LogKeyWritable, GenericData.Record>> collector, Reporter reporter) throws IOException { long tstamp = ((Long) datum.get("timestamp")).longValue(); String keyPath = CollectorUtils.getKeyHour(tstamp, ((String) datum.get("appid"))); LogKeyWritable key = new LogKeyWritable(keyPath, tstamp); Pair<LogKeyWritable, GenericData.Record> pair = new Pair<LogKeyWritable, GenericData.Record>( key, datum); collector.collect(pair); } public void reduce(LogKeyWritable key, Iterable<GenericData.Record> values, AvroCollector<GenericData.Record> collector, Reporter reporter) throws IOException { for (GenericData.Record r : values) { collector.collect(r); } } My job setup: ========AvroJob.setInputSchema(jobConf, AVRO_SCHEMA); AvroJob.setOutputSchema(jobConf, AVRO_SCHEMA); CAN SOMEONE PLEASE HELP! Nikhil -- View this message in context: http://apache-avro.679487.n3.nabble.com/Avro-Map-Reduce-Question-GenericRecord-renaming-reduce-output-tp4025105.htmlSent from the Avro - Users mailing list archive at Nabble.com.
+
snikhil0 2012-06-08, 18:49
-
Re: Avro Map Reduce Question: GenericRecord, renaming reduce output
Doug Cutting 2012-06-08, 19:17
On Fri, Jun 8, 2012 at 11:49 AM, snikhil0 <[EMAIL PROTECTED]> wrote: > My expectation is that I can use the same input schema to read the output > file. But alas this is not working. > In the part-r-00000 I have a 0<tab>Obj<Avroschema>....datums...... Why is > this?
That looks approximately like an Avro data file. How is it not what you expect?
> Also how can rename the reduce output file to something other than > part-r-0000*?
That's the standard name for Hadoop mapreduce output files. You could override it in the OutputFormat, but most folks do not. The name of the directory these are in is normally used to identify the result set. The files within the directory are just fragments of that result set.
Doug
+
Doug Cutting 2012-06-08, 19:17
-
Re: Avro Map Reduce Question: GenericRecord, renaming reduce output
Shirahatti, Nikhil 2012-06-08, 20:35
The reason is: when I try to read the file using GenericReader.. I get the error: not a data file. Code snippet: -------------- DatumReader<GenericData.Record> reader = new GenericDatumReader<Record>(AVRO_SCHEMA);
String MUXDEMUX_FILE = outpath.concat("part-r-00000"); InputStream in = new BufferedInputStream(new FileInputStream(MUXDEMUX_FILE)); DataFileStream<GenericData.Record> records = new DataFileStream<GenericData.Record>(in, reader); for (GenericData.Record r : records) { System.out.println(r.toString()); }
Nikhil
On 6/8/12 12:17 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:
>On Fri, Jun 8, 2012 at 11:49 AM, snikhil0 <[EMAIL PROTECTED]> wrote: >> My expectation is that I can use the same input schema to read the >>output >> file. But alas this is not working. >> In the part-r-00000 I have a 0<tab>Obj<Avroschema>....datums...... Why >>is >> this? > >That looks approximately like an Avro data file. How is it not what you >expect? > >> Also how can rename the reduce output file to something other than >> part-r-0000*? > >That's the standard name for Hadoop mapreduce output files. You could >override it in the OutputFormat, but most folks do not. The name of >the directory these are in is normally used to identify the result >set. The files within the directory are just fragments of that result >set. > >Doug
+
Shirahatti, Nikhil 2012-06-08, 20:35
-
Re: Avro Map Reduce Question: GenericRecord, renaming reduce output
Shirahatti, Nikhil 2012-06-08, 21:04
The magic number check is failing: so the top of the file has some junk in it?
if (!Arrays.equals(DataFileConstants.MAGIC, magic)) throw new IOException("Not a data file.");
I checked the (verified by read operation) input file: which has the same schema: This starts with the Obj^A^B^Vavro.schema<E0>^D
Whereas the reduce output file: has the 0<tab> before the Obj^A^B^Vavro.schema<E0>^D 0 Obj^A^B^Vavro.schema<E0>^D This was what I did not expect. Maybe my previous email was unclear.
Thanks, Nikhil
On 6/8/12 1:35 PM, "Shirahatti, Nikhil" <[EMAIL PROTECTED]> wrote:
>The reason is: when I try to read the file using GenericReader.. I get the >error: not a data file. > > >Code snippet: >-------------- >DatumReader<GenericData.Record> reader = new >GenericDatumReader<Record>(AVRO_SCHEMA); > >String MUXDEMUX_FILE = outpath.concat("part-r-00000"); > InputStream in = new BufferedInputStream(new >FileInputStream(MUXDEMUX_FILE)); > DataFileStream<GenericData.Record> records = new >DataFileStream<GenericData.Record>(in, > reader); > for (GenericData.Record r : records) > { > System.out.println(r.toString()); > } > > > >Nikhil > >On 6/8/12 12:17 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote: > >>On Fri, Jun 8, 2012 at 11:49 AM, snikhil0 <[EMAIL PROTECTED]> wrote: >>> My expectation is that I can use the same input schema to read the >>>output >>> file. But alas this is not working. >>> In the part-r-00000 I have a 0<tab>Obj<Avroschema>....datums...... Why >>>is >>> this? >> >>That looks approximately like an Avro data file. How is it not what you >>expect? >> >>> Also how can rename the reduce output file to something other than >>> part-r-0000*? >> >>That's the standard name for Hadoop mapreduce output files. You could >>override it in the OutputFormat, but most folks do not. The name of >>the directory these are in is normally used to identify the result >>set. The files within the directory are just fragments of that result >>set. >> >>Doug >
+
Shirahatti, Nikhil 2012-06-08, 21:04
-
Re: Avro Map Reduce Question: GenericRecord, renaming reduce output
Doug Cutting 2012-06-08, 21:22
On Fri, Jun 8, 2012 at 2:04 PM, Shirahatti, Nikhil <[EMAIL PROTECTED]> wrote: > Whereas the reduce output file: has the 0<tab> before the It sounds like something is writing to the file before AvroOutputFormat. Can you provide a complete example that illustrates this? E.g., like those in the unit tests? http://svn.apache.org/repos/asf/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapred/Thanks, Doug
+
Doug Cutting 2012-06-08, 21:22
-
Re: Avro Map Reduce Question: GenericRecord, renaming reduce output
Shirahatti, Nikhil 2012-06-08, 22:46
Hello, The code is checked in here: https://github.com/snikhil0/avro-mrThe test class is: MuxDemuxRunnableTest Nikhil On 6/8/12 2:22 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote: >On Fri, Jun 8, 2012 at 2:04 PM, Shirahatti, Nikhil <[EMAIL PROTECTED]> >wrote: >> Whereas the reduce output file: has the 0<tab> before the > >It sounds like something is writing to the file before AvroOutputFormat. > >Can you provide a complete example that illustrates this? E.g., like >those in the unit tests? > > http://svn.apache.org/repos/asf/avro/trunk/lang/java/mapred/src/test/java/>org/apache/avro/mapred/ > >Thanks, > >Doug
+
Shirahatti, Nikhil 2012-06-08, 22:46
-
Re: Avro Map Reduce Question: GenericRecord, renaming reduce output
tazan007 2012-06-09, 03:50
Looks like the output format probably isn't being set right, it looks like TextOutputFormat. You need to set the properties on Job not the JobConf you created. When you create the Job and pass in the JobConf, a copy of the JobConf is made which is used in the Job. So when you set the properties in the JobConf you created after creating the Job, they are not reflected in the configuration of the Job since it made a copy. -Hiral On Fri, Jun 8, 2012 at 3:46 PM, Shirahatti, Nikhil <[EMAIL PROTECTED]>wrote: > Hello, > > The code is checked in here: https://github.com/snikhil0/avro-mr> > The test class is: MuxDemuxRunnableTest > > > Nikhil > > On 6/8/12 2:22 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote: > > >On Fri, Jun 8, 2012 at 2:04 PM, Shirahatti, Nikhil <[EMAIL PROTECTED]> > >wrote: > >> Whereas the reduce output file: has the 0<tab> before the > > > >It sounds like something is writing to the file before AvroOutputFormat. > > > >Can you provide a complete example that illustrates this? E.g., like > >those in the unit tests? > > > > > http://svn.apache.org/repos/asf/avro/trunk/lang/java/mapred/src/test/java/> >org/apache/avro/mapred/ > > > >Thanks, > > > >Doug > >
+
tazan007 2012-06-09, 03:50
-
Re: Avro Map Reduce Question: GenericRecord, renaming reduce output
Doug Cutting 2012-06-08, 23:05
There's no Ant or Maven build file. What command line should one use to run the test? Doug On Fri, Jun 8, 2012 at 3:46 PM, Shirahatti, Nikhil <[EMAIL PROTECTED]> wrote: > Hello, > > The code is checked in here: https://github.com/snikhil0/avro-mr> > The test class is: MuxDemuxRunnableTest > > > Nikhil > > On 6/8/12 2:22 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote: > >>On Fri, Jun 8, 2012 at 2:04 PM, Shirahatti, Nikhil <[EMAIL PROTECTED]> >>wrote: >>> Whereas the reduce output file: has the 0<tab> before the >> >>It sounds like something is writing to the file before AvroOutputFormat. >> >>Can you provide a complete example that illustrates this? E.g., like >>those in the unit tests? >> >> http://svn.apache.org/repos/asf/avro/trunk/lang/java/mapred/src/test/java/>>org/apache/avro/mapred/ >> >>Thanks, >> >>Doug >
+
Doug Cutting 2012-06-08, 23:05
-
Re: Avro Map Reduce Question: GenericRecord, renaming reduce output
Shirahatti, Nikhil 2012-06-12, 00:22
Sorry for the delay. I am still having the problem. I added the ant file: run ant test ( https://github.com/snikhil0/avro-mr)Creates the o/p file under: /logshed/test/<timebased>/part-r-00000 Its not completely kosher code: each time u invoke the test, delete your previous output (/logshed/test/<time-based>) Thanks, Nikhil On 6/8/12 4:05 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote: >There's no Ant or Maven build file. What command line should one use >to run the test? > >Doug > >On Fri, Jun 8, 2012 at 3:46 PM, Shirahatti, Nikhil <[EMAIL PROTECTED]> >wrote: >> Hello, >> >> The code is checked in here: https://github.com/snikhil0/avro-mr>> >> The test class is: MuxDemuxRunnableTest >> >> >> Nikhil >> >> On 6/8/12 2:22 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote: >> >>>On Fri, Jun 8, 2012 at 2:04 PM, Shirahatti, Nikhil <[EMAIL PROTECTED]> >>>wrote: >>>> Whereas the reduce output file: has the 0<tab> before the >>> >>>It sounds like something is writing to the file before AvroOutputFormat. >>> >>>Can you provide a complete example that illustrates this? E.g., like >>>those in the unit tests? >>> >>> http://svn.apache.org/repos/asf/avro/trunk/lang/java/mapred/src/test/jav>>>a/ >>>org/apache/avro/mapred/ >>> >>>Thanks, >>> >>>Doug >>
+
Shirahatti, Nikhil 2012-06-12, 00:22
-
Re: Avro Map Reduce Question: GenericRecord, renaming reduce output
Doug Cutting 2012-06-12, 17:05
When I do 'git clone https://github.com/snikhil0/avro-mr.git; cd avro-mr; ant test', I see: [junit] Running com.telenav.logshed.collector.muxdemux.MuxDemuxRunnableTest [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 12.637 sec BUILD SUCCESSFUL Finally, Nikhil suggested above that your problem is in MuxDemuxJob.java, where you set properties on the JobConf after creating the Job. The AvroJob methods should instead be called before the Job is constructed. Doug
+
Doug Cutting 2012-06-12, 17:05
-
Re: Avro Map Reduce Question: GenericRecord, renaming reduce output
Shirahatti, Nikhil 2012-06-12, 17:24
That¹s right. The junit test, did not do any asserts on the file checking. I've checked it in, so please try again. However, if you try to open the file in /logshed you'll probably see what I'm talking about. I also tried setting AvroJob before job instantiation, but I got the same error. Snippet: JobConf jobConf = new JobConf(LogshedCollectorUtils.getLocalHadoopConfiguartion()); AvroJob.setInputSchema(jobConf, IN_SCHEMA); AvroJob.setOutputSchema(jobConf, OUT_SCHEMA); AvroJob.setMapperClass(jobConf, LogshedMapper.class); AvroJob.setReducerClass(jobConf, LogshedReducer.class); Job job = new Job(jobConf, "muxdemux_job"); FileInputFormat.setInputPaths(job, new Path(args[0])); Path outPath = new Path(args[1]); FileOutputFormat.setOutputPath(job, outPath); job.setJarByClass(MuxDemuxJob.class); Thanks, Nikhil On 6/12/12 10:05 AM, "Doug Cutting" <[EMAIL PROTECTED]> wrote: >When I do 'git clone https://github.com/snikhil0/avro-mr.git; cd >avro-mr; ant test', I see: > > [junit] Running >com.telenav.logshed.collector.muxdemux.MuxDemuxRunnableTest > [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 12.637 sec > >BUILD SUCCESSFUL > >Finally, Nikhil suggested above that your problem is in >MuxDemuxJob.java, where you set properties on the JobConf after >creating the Job. The AvroJob methods should instead be called before >the Job is constructed. > >Doug
+
Shirahatti, Nikhil 2012-06-12, 17:24
-
Re: Avro Map Reduce Question: GenericRecord, renaming reduce output
Shirahatti, Nikhil 2012-06-12, 17:41
Another thing: when I try the AvroJob settings before job instantiation, I basically get no reduce output file? Nikhil On 6/12/12 10:24 AM, "Shirahatti, Nikhil" <[EMAIL PROTECTED]> wrote: >That¹s right. The junit test, did not do any asserts on the file checking. >I've checked it in, so please try again. However, if you try to open the >file in /logshed you'll probably see what I'm talking about. > >I also tried setting AvroJob before job instantiation, but I got the same >error. > >Snippet: >JobConf jobConf = new >JobConf(LogshedCollectorUtils.getLocalHadoopConfiguartion()); > > AvroJob.setInputSchema(jobConf, IN_SCHEMA); > AvroJob.setOutputSchema(jobConf, OUT_SCHEMA); > > AvroJob.setMapperClass(jobConf, LogshedMapper.class); > AvroJob.setReducerClass(jobConf, LogshedReducer.class); > > Job job = new Job(jobConf, "muxdemux_job"); > > FileInputFormat.setInputPaths(job, new Path(args[0])); > Path outPath = new Path(args[1]); > FileOutputFormat.setOutputPath(job, outPath); > job.setJarByClass(MuxDemuxJob.class); > > > >Thanks, >Nikhil > > >On 6/12/12 10:05 AM, "Doug Cutting" <[EMAIL PROTECTED]> wrote: > >>When I do 'git clone https://github.com/snikhil0/avro-mr.git; cd >>avro-mr; ant test', I see: >> >> [junit] Running >>com.telenav.logshed.collector.muxdemux.MuxDemuxRunnableTest >> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 12.637 >>sec >> >>BUILD SUCCESSFUL >> >>Finally, Nikhil suggested above that your problem is in >>MuxDemuxJob.java, where you set properties on the JobConf after >>creating the Job. The AvroJob methods should instead be called before >>the Job is constructed. >> >>Doug >
+
Shirahatti, Nikhil 2012-06-12, 17:41
-
Re: Avro Map Reduce Question: GenericRecord, renaming reduce output
snikhil0 2012-06-13, 07:26
Ok this things looks like a map-reduce api issue: I went back to the old style of map-reduce api: now I get a good avro header but no datums. Sheesh! can someone please help ! The main function: final static Schema IN_SCHEMA LogshedCollectorUtils.getResourceSchema(); final static Schema OUT_SCHEMA = LogshedCollectorUtils.getResourceSchema(); final static ReflectData reflectData = ReflectData.get(); final static Schema KEY_SCHEMA reflectData.getSchema(LogKeyWritable.class); final static Schema MAP_OUT_SCHEMA = Pair.getPairSchema(KEY_SCHEMA, OUT_SCHEMA); Configuration conf LogshedCollectorUtils.getLocalHadoopConfiguartion(); JobConf jobConf = new JobConf(LogshedCollectorUtils.getLocalHadoopConfiguartion(), MuxDemuxJob.class); jobConf.setJobName("muxdemux"); jobConf.setJarByClass(MuxDemuxJob.class); jobConf.setInputFormat(AvroInputFormat.class); jobConf.setOutputFormat(AvroOutputFormat.class); AvroJob.setInputSchema(jobConf, IN_SCHEMA); AvroJob.setMapOutputSchema(jobConf, MAP_OUT_SCHEMA); AvroJob.setOutputSchema(jobConf, OUT_SCHEMA); AvroJob.setMapperClass(jobConf, LogshedMapper.class); AvroJob.setReducerClass(jobConf, LogshedReducer.class); //Job job = new Job(jobConf, "muxdemux"); FileInputFormat.setInputPaths(jobConf, new Path(args[0])); Path outPath = new Path(args[1]); FileOutputFormat.setOutputPath(jobConf, outPath); JobClient.runJob(jobConf); return 0; Nikhil -- View this message in context: http://apache-avro.679487.n3.nabble.com/Avro-Map-Reduce-Question-GenericRecord-renaming-reduce-output-tp4025105p4025126.htmlSent from the Avro - Users mailing list archive at Nabble.com.
+
snikhil0 2012-06-13, 07:26
|
|