|
Barak Yaish
2012-12-22, 18:54
Ted Dunning
2012-12-22, 19:24
Mohit Anchlia
2012-12-22, 20:53
Ted Dunning
2012-12-22, 22:05
Mohit Anchlia
2012-12-23, 06:20
Edward Capriolo
2012-12-23, 15:30
|
-
Re: Merging filesBarak Yaish 2012-12-22, 18:54
Can you please attach HOW-TO links for the alternatives you mentioned?
On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <[EMAIL PROTECTED]> wrote: > Yes, via the simple act of opening a target stream and writing all > source streams into it. Or to save code time, an identity job with a > single reducer (you may not get control over ordering this way). > > On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <[EMAIL PROTECTED]> > wrote: > > Is it possible to merge files from different locations from HDFS location > > into one file into HDFS location? > > > > -- > Harsh J > +
Barak Yaish 2012-12-22, 18:54
-
Re: Merging filesTed Dunning 2012-12-22, 19:24
The technical term for this is "copying". You may have heard of it.
It is a subject of such long technical standing that many do not consider it worthy of detailed documentation. Distcp effects a similar process and can be modified to combine the input files into a single file. http://hadoop.apache.org/docs/r1.0.4/distcp.html On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <[EMAIL PROTECTED]> wrote: > Can you please attach HOW-TO links for the alternatives you mentioned? > > > On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <[EMAIL PROTECTED]> wrote: > >> Yes, via the simple act of opening a target stream and writing all >> source streams into it. Or to save code time, an identity job with a >> single reducer (you may not get control over ordering this way). >> >> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <[EMAIL PROTECTED]> >> wrote: >> > Is it possible to merge files from different locations from HDFS >> location >> > into one file into HDFS location? >> >> >> >> -- >> Harsh J >> > > +
Ted Dunning 2012-12-22, 19:24
-
Re: Merging filesMohit Anchlia 2012-12-22, 20:53
Tried distcp but it fails. Is there a way to merge them? Or else I could
write a pig script to load from multiple paths org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there are duplicated files in the sources: maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo, maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419) at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222) at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675) at org.apache.hadoop.tools.DistCp.run(DistCp.java:910) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.tools.DistCp.main(DistCp.java:937) On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > The technical term for this is "copying". You may have heard of it. > > It is a subject of such long technical standing that many do not consider > it worthy of detailed documentation. > > Distcp effects a similar process and can be modified to combine the input > files into a single file. > > http://hadoop.apache.org/docs/r1.0.4/distcp.html > > > On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <[EMAIL PROTECTED]>wrote: > >> Can you please attach HOW-TO links for the alternatives you mentioned? >> >> >> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <[EMAIL PROTECTED]> wrote: >> >>> Yes, via the simple act of opening a target stream and writing all >>> source streams into it. Or to save code time, an identity job with a >>> single reducer (you may not get control over ordering this way). >>> >>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <[EMAIL PROTECTED]> >>> wrote: >>> > Is it possible to merge files from different locations from HDFS >>> location >>> > into one file into HDFS location? >>> >>> >>> >>> -- >>> Harsh J >>> >> >> > +
Mohit Anchlia 2012-12-22, 20:53
-
Re: Merging filesTed Dunning 2012-12-22, 22:05
A pig script should work quite well.
I also note that the file paths have maprfs in them. This implies that you are using MapR and could simply use the normal linux command cat to concatenate the files if you mount the files using NFS (depending on volume, of course). For small amounts of data, this would work very well. For large amounts of data, you would be better with some kind of map-reduce program. Your Pig script is just the sort of thing. Keep in mind if you write a map-reduce program (or pig script) that you will wind up with as many outputs as you have reducers. If you have only a single reducer, you will get one output file, but that will mean that only a single process will do all the writing. That would be no faster than using the cat + NFS method above. Having multiple reducers will allow you to have write parallelism. The error message that distcp is giving you is a little odd, however, since it implies that some of your input files are repeated. Is that possible? On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > Tried distcp but it fails. Is there a way to merge them? Or else I could > write a pig script to load from multiple paths > > > org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there > are duplicated files in the sources: > maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo, > maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo > > at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419) > > at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222) > > at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675) > > at org.apache.hadoop.tools.DistCp.run(DistCp.java:910) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > > at org.apache.hadoop.tools.DistCp.main(DistCp.java:937) > > > On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <[EMAIL PROTECTED]>wrote: > >> The technical term for this is "copying". You may have heard of it. >> >> It is a subject of such long technical standing that many do not consider >> it worthy of detailed documentation. >> >> Distcp effects a similar process and can be modified to combine the input >> files into a single file. >> >> http://hadoop.apache.org/docs/r1.0.4/distcp.html >> >> >> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <[EMAIL PROTECTED]>wrote: >> >>> Can you please attach HOW-TO links for the alternatives you mentioned? >>> >>> >>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <[EMAIL PROTECTED]> wrote: >>> >>>> Yes, via the simple act of opening a target stream and writing all >>>> source streams into it. Or to save code time, an identity job with a >>>> single reducer (you may not get control over ordering this way). >>>> >>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <[EMAIL PROTECTED]> >>>> wrote: >>>> > Is it possible to merge files from different locations from HDFS >>>> location >>>> > into one file into HDFS location? >>>> >>>> >>>> >>>> -- >>>> Harsh J >>>> >>> >>> >> > +
Ted Dunning 2012-12-22, 22:05
-
Re: Merging filesMohit Anchlia 2012-12-23, 06:20
Thanks for the info. I was trying not to use nfs because my data size might
be 10-20GB in size for every merge I perform. I'll use pig instead. In dstcp I checked and none of the directories are duplicate. Looking at the logs it looks like it's failing because all those directories have sub-directories of the same name. On Sat, Dec 22, 2012 at 2:05 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > A pig script should work quite well. > > I also note that the file paths have maprfs in them. This implies that > you are using MapR and could simply use the normal linux command cat to > concatenate the files if you mount the files using NFS (depending on > volume, of course). For small amounts of data, this would work very well. > For large amounts of data, you would be better with some kind of > map-reduce program. Your Pig script is just the sort of thing. > > Keep in mind if you write a map-reduce program (or pig script) that you > will wind up with as many outputs as you have reducers. If you have only a > single reducer, you will get one output file, but that will mean that only > a single process will do all the writing. That would be no faster than > using the cat + NFS method above. Having multiple reducers will allow you > to have write parallelism. > > The error message that distcp is giving you is a little odd, however, > since it implies that some of your input files are repeated. Is that > possible? > > > > On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > >> Tried distcp but it fails. Is there a way to merge them? Or else I could >> write a pig script to load from multiple paths >> >> >> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there >> are duplicated files in the sources: >> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo, >> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo >> >> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419) >> >> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222) >> >> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675) >> >> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910) >> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >> >> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937) >> >> >> On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <[EMAIL PROTECTED]>wrote: >> >>> The technical term for this is "copying". You may have heard of it. >>> >>> It is a subject of such long technical standing that many do not >>> consider it worthy of detailed documentation. >>> >>> Distcp effects a similar process and can be modified to combine the >>> input files into a single file. >>> >>> http://hadoop.apache.org/docs/r1.0.4/distcp.html >>> >>> >>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <[EMAIL PROTECTED]>wrote: >>> >>>> Can you please attach HOW-TO links for the alternatives you mentioned? >>>> >>>> >>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <[EMAIL PROTECTED]> wrote: >>>> >>>>> Yes, via the simple act of opening a target stream and writing all >>>>> source streams into it. Or to save code time, an identity job with a >>>>> single reducer (you may not get control over ordering this way). >>>>> >>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia < >>>>> [EMAIL PROTECTED]> wrote: >>>>> > Is it possible to merge files from different locations from HDFS >>>>> location >>>>> > into one file into HDFS location? >>>>> >>>>> >>>>> >>>>> -- >>>>> Harsh J >>>>> >>>> >>>> >>> >> > +
Mohit Anchlia 2012-12-23, 06:20
-
Re: Merging filesEdward Capriolo 2012-12-23, 15:30
https://github.com/edwardcapriolo/filecrush
^ Another option On Sun, Dec 23, 2012 at 1:20 AM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > Thanks for the info. I was trying not to use nfs because my data size > might be 10-20GB in size for every merge I perform. I'll use pig instead. > > In dstcp I checked and none of the directories are duplicate. Looking at > the logs it looks like it's failing because all those directories have > sub-directories of the same name. > > On Sat, Dec 22, 2012 at 2:05 PM, Ted Dunning <[EMAIL PROTECTED]>wrote: > >> A pig script should work quite well. >> >> I also note that the file paths have maprfs in them. This implies that >> you are using MapR and could simply use the normal linux command cat to >> concatenate the files if you mount the files using NFS (depending on >> volume, of course). For small amounts of data, this would work very well. >> For large amounts of data, you would be better with some kind of >> map-reduce program. Your Pig script is just the sort of thing. >> >> Keep in mind if you write a map-reduce program (or pig script) that you >> will wind up with as many outputs as you have reducers. If you have only a >> single reducer, you will get one output file, but that will mean that only >> a single process will do all the writing. That would be no faster than >> using the cat + NFS method above. Having multiple reducers will allow you >> to have write parallelism. >> >> The error message that distcp is giving you is a little odd, however, >> since it implies that some of your input files are repeated. Is that >> possible? >> >> >> >> On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: >> >>> Tried distcp but it fails. Is there a way to merge them? Or else I could >>> write a pig script to load from multiple paths >>> >>> >>> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, >>> there are duplicated files in the sources: >>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo, >>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo >>> >>> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419) >>> >>> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222) >>> >>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675) >>> >>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910) >>> >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>> >>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937) >>> >>> >>> On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <[EMAIL PROTECTED]>wrote: >>> >>>> The technical term for this is "copying". You may have heard of it. >>>> >>>> It is a subject of such long technical standing that many do not >>>> consider it worthy of detailed documentation. >>>> >>>> Distcp effects a similar process and can be modified to combine the >>>> input files into a single file. >>>> >>>> http://hadoop.apache.org/docs/r1.0.4/distcp.html >>>> >>>> >>>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <[EMAIL PROTECTED]>wrote: >>>> >>>>> Can you please attach HOW-TO links for the alternatives you mentioned? >>>>> >>>>> >>>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <[EMAIL PROTECTED]> wrote: >>>>> >>>>>> Yes, via the simple act of opening a target stream and writing all >>>>>> source streams into it. Or to save code time, an identity job with a >>>>>> single reducer (you may not get control over ordering this way). >>>>>> >>>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia < >>>>>> [EMAIL PROTECTED]> wrote: >>>>>> > Is it possible to merge files from different locations from HDFS >>>>>> location >>>>>> > into one file into HDFS location? >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Harsh J >>>>>> >>>>> >>>>> >>>> >>> >> > +
Edward Capriolo 2012-12-23, 15:30
|