Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Re: Merging files


+
Barak Yaish 2012-12-22, 18:54
+
Ted Dunning 2012-12-22, 19:24
+
Mohit Anchlia 2012-12-22, 20:53
+
Ted Dunning 2012-12-22, 22:05
+
Mohit Anchlia 2012-12-23, 06:20
Copy link to this message
-
Re: Merging files
https://github.com/edwardcapriolo/filecrush

^ Another option

On Sun, Dec 23, 2012 at 1:20 AM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:

> Thanks for the info. I was trying not to use nfs because my data size
> might be 10-20GB in size for every merge I perform. I'll use pig instead.
>
> In dstcp I checked and none of the directories are duplicate. Looking at
> the logs it looks like it's failing because all those directories have
> sub-directories of the same name.
>
> On Sat, Dec 22, 2012 at 2:05 PM, Ted Dunning <[EMAIL PROTECTED]>wrote:
>
>> A pig script should work quite well.
>>
>> I also note that the file paths have maprfs in them.  This implies that
>> you are using MapR and could simply use the normal linux command cat to
>> concatenate the files if you mount the files using NFS (depending on
>> volume, of course).  For small amounts of data, this would work very well.
>>  For large amounts of data, you would be better with some kind of
>> map-reduce program.  Your Pig script is just the sort of thing.
>>
>> Keep in mind if you write a map-reduce program (or pig script) that you
>> will wind up with as many outputs as you have reducers.  If you have only a
>> single reducer, you will get one output file, but that will mean that only
>> a single process will do all the writing.  That would be no faster than
>> using the cat + NFS method above.  Having multiple reducers will allow you
>> to have write parallelism.
>>
>> The error message that distcp is giving you is a little odd, however,
>> since it implies that some of your input files are repeated.  Is that
>> possible?
>>
>>
>>
>> On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:
>>
>>> Tried distcp but it fails. Is there a way to merge them? Or else I could
>>> write a pig script to load from multiple paths
>>>
>>>
>>> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input,
>>> there are duplicated files in the sources:
>>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
>>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo
>>>
>>> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)
>>>
>>> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)
>>>
>>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)
>>>
>>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)
>>>
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>
>>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)
>>>
>>>
>>>  On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <[EMAIL PROTECTED]>wrote:
>>>
>>>> The technical term for this is "copying".  You may have heard of it.
>>>>
>>>> It is a subject of such long technical standing that many do not
>>>> consider it worthy of detailed documentation.
>>>>
>>>> Distcp effects a similar process and can be modified to combine the
>>>> input files into a single file.
>>>>
>>>> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>>>>
>>>>
>>>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>>>>
>>>>>
>>>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>>> Yes, via the simple act of opening a target stream and writing all
>>>>>> source streams into it. Or to save code time, an identity job with a
>>>>>> single reducer (you may not get control over ordering this way).
>>>>>>
>>>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <
>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>> > Is it possible to merge files from different locations from HDFS
>>>>>> location
>>>>>> > into one file into HDFS location?
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Harsh J
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>