|
|
-
Re: Merging filesEdward Capriolo 2012-12-23, 15:30
https://github.com/edwardcapriolo/filecrush
^ Another option On Sun, Dec 23, 2012 at 1:20 AM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > Thanks for the info. I was trying not to use nfs because my data size > might be 10-20GB in size for every merge I perform. I'll use pig instead. > > In dstcp I checked and none of the directories are duplicate. Looking at > the logs it looks like it's failing because all those directories have > sub-directories of the same name. > > On Sat, Dec 22, 2012 at 2:05 PM, Ted Dunning <[EMAIL PROTECTED]>wrote: > >> A pig script should work quite well. >> >> I also note that the file paths have maprfs in them. This implies that >> you are using MapR and could simply use the normal linux command cat to >> concatenate the files if you mount the files using NFS (depending on >> volume, of course). For small amounts of data, this would work very well. >> For large amounts of data, you would be better with some kind of >> map-reduce program. Your Pig script is just the sort of thing. >> >> Keep in mind if you write a map-reduce program (or pig script) that you >> will wind up with as many outputs as you have reducers. If you have only a >> single reducer, you will get one output file, but that will mean that only >> a single process will do all the writing. That would be no faster than >> using the cat + NFS method above. Having multiple reducers will allow you >> to have write parallelism. >> >> The error message that distcp is giving you is a little odd, however, >> since it implies that some of your input files are repeated. Is that >> possible? >> >> >> >> On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: >> >>> Tried distcp but it fails. Is there a way to merge them? Or else I could >>> write a pig script to load from multiple paths >>> >>> >>> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, >>> there are duplicated files in the sources: >>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo, >>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo >>> >>> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419) >>> >>> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222) >>> >>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675) >>> >>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910) >>> >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>> >>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937) >>> >>> >>> On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <[EMAIL PROTECTED]>wrote: >>> >>>> The technical term for this is "copying". You may have heard of it. >>>> >>>> It is a subject of such long technical standing that many do not >>>> consider it worthy of detailed documentation. >>>> >>>> Distcp effects a similar process and can be modified to combine the >>>> input files into a single file. >>>> >>>> http://hadoop.apache.org/docs/r1.0.4/distcp.html >>>> >>>> >>>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <[EMAIL PROTECTED]>wrote: >>>> >>>>> Can you please attach HOW-TO links for the alternatives you mentioned? >>>>> >>>>> >>>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <[EMAIL PROTECTED]> wrote: >>>>> >>>>>> Yes, via the simple act of opening a target stream and writing all >>>>>> source streams into it. Or to save code time, an identity job with a >>>>>> single reducer (you may not get control over ordering this way). >>>>>> >>>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia < >>>>>> [EMAIL PROTECTED]> wrote: >>>>>> > Is it possible to merge files from different locations from HDFS >>>>>> location >>>>>> > into one file into HDFS location? >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Harsh J >>>>>> >>>>> >>>>> >>>> >>> >> > |