|
|
-
Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormatAlejandro Abdelnur 2013-02-08, 19:06
Tony, I think the first step would be to verify if the S3 filesystem
implementation rename works as expected. Thx On Fri, Feb 1, 2013 at 7:12 AM, Tony Burton <[EMAIL PROTECTED]>wrote: > ** ** > > Thanks for the reply Alejandro. Using a temp output directory was my first > guess as well. What’s the best way to proceed? I’ve come across > FileSystem.rename but it’s consistently returning false for whatever Paths > I provide. Specifically, I need to copy the following:**** > > ** ** > > s3://<path to data>/<tmp folder>/<object type 1>/part-00000**** > > …**** > > s3://<path to data>/<tmp folder>/<object type 1>/part-nnnnn**** > > s3://<path to data>/<tmp folder>/<object type 2>/part-00000**** > > …**** > > s3://<path to data>/<tmp folder>/<object type 2>/part-nnnnn**** > > …**** > > s3://<path to data>/<tmp folder>/<object type m>/part-nnnnn**** > > ** ** > > to **** > > ** ** > > s3://<path to data>/<object type 1>/part-00000**** > > …**** > > s3://<path to data>/<object type 1>/part-nnnnn**** > > s3://<path to data>/<object type 2>/part-00000**** > > …**** > > s3://<path to data>/<object type 2>/part-nnnnn**** > > …**** > > s3://<path to data>/<object type m>/part-nnnnn**** > > ** ** > > without doing a copyToLocal.**** > > ** ** > > Any tips? Are there any better alternatives to FileSystem.rename? Or would > using the AWS Java SDK be a better solution?**** > > ** ** > > Thanks!**** > > ** ** > > Tony**** > > ** ** > > ** ** > > ** ** > > ** ** > > ** ** > > ** ** > > *From:* Alejandro Abdelnur [mailto:[EMAIL PROTECTED]] > *Sent:* 31 January 2013 18:45 > *To:* [EMAIL PROTECTED] > > *Subject:* Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat**** > > ** ** > > Hi Tony, from what i understand your prob is not with MTOF but with you > wanting to run 2 jobs using the same output directory, the second job will > fail because the output dir already existed. My take would be tweaking your > jobs to use a temp output dir, and moving them to the required (final) > location upon completion.**** > > ** ** > > thx**** > > ** ** > > ** ** > > On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton <[EMAIL PROTECTED]> > wrote:**** > > Hi everyone, > > Some of you might recall this topic, which I worked on with the list's > help back in August last year - see email trail below. Despite initial > success of the discovery, I had the shelve the approach as I ended up using > a different solution (for reasons I forget!) with the implementation that > was ultimately used for that particular project. > > I'm now in a position to be working on a similar new task, where I've > successfully implemented the combination of LazyOutputFormat and > MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output > locations. However, I've hit another snag which I'm hoping you might help > me work through. > > I'm going to be running daily tasks to extract data from XML files > (specifically, the data stored in certain nodes of the XML), stored on AWS > S3 using object names with the following format: > > s3://inputbucket/data/2013/1/13/<list of xml data files.bz2> > > I want to extract items from the XML and write out as follows: > > s3://outputbucket/path/<xml node name>/20130113/<output from MR job> > > For one day of data, this works fine. I pass in s3://inputbucket/data and > s3://outputbucket/path as input and output arguments, along with my run > date (20130113) which gets manipulated and appended where appropriate to > form the precise read and write locations, for example > > FileInputFormat.setInputhPath(job, " s3://inputbucket/data"); > FileOutputFormat.setOutputPath(job, "s3://outputbucket/path"); > > Then MultipleOutputs adds on my XML node names underneath > s3://outputbucket/path automatically. > > However, for the next day's run, the job gets to > FileOutputFormat.setOutputPath and sees that the output path > (s3://outputbucket/path) already exists, and throws a > FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even Alejandro |