|
|
-
Re: Merging filesMohit Anchlia 2012-12-23, 06:20
Thanks for the info. I was trying not to use nfs because my data size might
be 10-20GB in size for every merge I perform. I'll use pig instead. In dstcp I checked and none of the directories are duplicate. Looking at the logs it looks like it's failing because all those directories have sub-directories of the same name. On Sat, Dec 22, 2012 at 2:05 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > A pig script should work quite well. > > I also note that the file paths have maprfs in them. This implies that > you are using MapR and could simply use the normal linux command cat to > concatenate the files if you mount the files using NFS (depending on > volume, of course). For small amounts of data, this would work very well. > For large amounts of data, you would be better with some kind of > map-reduce program. Your Pig script is just the sort of thing. > > Keep in mind if you write a map-reduce program (or pig script) that you > will wind up with as many outputs as you have reducers. If you have only a > single reducer, you will get one output file, but that will mean that only > a single process will do all the writing. That would be no faster than > using the cat + NFS method above. Having multiple reducers will allow you > to have write parallelism. > > The error message that distcp is giving you is a little odd, however, > since it implies that some of your input files are repeated. Is that > possible? > > > > On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > >> Tried distcp but it fails. Is there a way to merge them? Or else I could >> write a pig script to load from multiple paths >> >> >> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there >> are duplicated files in the sources: >> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo, >> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo >> >> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419) >> >> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222) >> >> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675) >> >> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910) >> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >> >> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937) >> >> >> On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <[EMAIL PROTECTED]>wrote: >> >>> The technical term for this is "copying". You may have heard of it. >>> >>> It is a subject of such long technical standing that many do not >>> consider it worthy of detailed documentation. >>> >>> Distcp effects a similar process and can be modified to combine the >>> input files into a single file. >>> >>> http://hadoop.apache.org/docs/r1.0.4/distcp.html >>> >>> >>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <[EMAIL PROTECTED]>wrote: >>> >>>> Can you please attach HOW-TO links for the alternatives you mentioned? >>>> >>>> >>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <[EMAIL PROTECTED]> wrote: >>>> >>>>> Yes, via the simple act of opening a target stream and writing all >>>>> source streams into it. Or to save code time, an identity job with a >>>>> single reducer (you may not get control over ordering this way). >>>>> >>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia < >>>>> [EMAIL PROTECTED]> wrote: >>>>> > Is it possible to merge files from different locations from HDFS >>>>> location >>>>> > into one file into HDFS location? >>>>> >>>>> >>>>> >>>>> -- >>>>> Harsh J >>>>> >>>> >>>> >>> >> > |