Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Re: Merging files


+
Barak Yaish 2012-12-22, 18:54
+
Ted Dunning 2012-12-22, 19:24
+
Mohit Anchlia 2012-12-22, 20:53
+
Ted Dunning 2012-12-22, 22:05
Copy link to this message
-
Re: Merging files
Thanks for the info. I was trying not to use nfs because my data size might
be 10-20GB in size for every merge I perform. I'll use pig instead.

In dstcp I checked and none of the directories are duplicate. Looking at
the logs it looks like it's failing because all those directories have
sub-directories of the same name.

On Sat, Dec 22, 2012 at 2:05 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

> A pig script should work quite well.
>
> I also note that the file paths have maprfs in them.  This implies that
> you are using MapR and could simply use the normal linux command cat to
> concatenate the files if you mount the files using NFS (depending on
> volume, of course).  For small amounts of data, this would work very well.
>  For large amounts of data, you would be better with some kind of
> map-reduce program.  Your Pig script is just the sort of thing.
>
> Keep in mind if you write a map-reduce program (or pig script) that you
> will wind up with as many outputs as you have reducers.  If you have only a
> single reducer, you will get one output file, but that will mean that only
> a single process will do all the writing.  That would be no faster than
> using the cat + NFS method above.  Having multiple reducers will allow you
> to have write parallelism.
>
> The error message that distcp is giving you is a little odd, however,
> since it implies that some of your input files are repeated.  Is that
> possible?
>
>
>
> On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:
>
>> Tried distcp but it fails. Is there a way to merge them? Or else I could
>> write a pig script to load from multiple paths
>>
>>
>> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there
>> are duplicated files in the sources:
>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo
>>
>> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)
>>
>> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)
>>
>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)
>>
>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>
>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)
>>
>>
>>  On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <[EMAIL PROTECTED]>wrote:
>>
>>> The technical term for this is "copying".  You may have heard of it.
>>>
>>> It is a subject of such long technical standing that many do not
>>> consider it worthy of detailed documentation.
>>>
>>> Distcp effects a similar process and can be modified to combine the
>>> input files into a single file.
>>>
>>> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>>>
>>>
>>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <[EMAIL PROTECTED]>wrote:
>>>
>>>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>>>
>>>>
>>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> Yes, via the simple act of opening a target stream and writing all
>>>>> source streams into it. Or to save code time, an identity job with a
>>>>> single reducer (you may not get control over ordering this way).
>>>>>
>>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <
>>>>> [EMAIL PROTECTED]> wrote:
>>>>> > Is it possible to merge files from different locations from HDFS
>>>>> location
>>>>> > into one file into HDFS location?
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Harsh J
>>>>>
>>>>
>>>>
>>>
>>
>
+
Edward Capriolo 2012-12-23, 15:30
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB