Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Combine data from different HDFS FS


Copy link to this message
-
Re: Combine data from different HDFS FS
I don't think you need a special input format.  I think you just need to
specify your list of input files like this:

hdfs://HOST1/folder-name/file-name,hdfs://HOST2/folder-name/file-name, ...

HTH,

DR

On 04/09/2013 12:07 AM, Pedro Sá da Costa wrote:
> Maybe there is some FileInputFormat class that allows to define input files
> from different locations. What I would like to know, is if it's possible to
> read input data from different HDFS FS. E.g., run the wordcount with the
> input files from HDFS FS in HOST1 and HOST2 (the FS in HOST1 and HOST2 are
> distinct). Any suggestion on which InputFormat I should use?
>
>
>
> On 9 April 2013 00:10, Pedro Sá da Costa <[EMAIL PROTECTED]> wrote:
>
>> I'm invoking the wordcount example in host1 with this command, but I got
>> an error.
>>
>>
>> HOST1:$ bin/hadoop jar hadoop-examples-1.0.4.jar wordcount
>> hdfs://HOST2:54310/gutenberg gutenberg-output
>>
>> 13/04/08 22:02:55 ERROR security.UserGroupInformation:
>> PriviledgedActionException as:ubuntu
>> cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
>> path does not exist: hdfs://HOST2:54310/gutenberg
>> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
>> does not exist: hdfs://HOST2:54310/gutenberg
>>
>> Can you be more specific about using the FileinputFormat? It's because
>> I've configured MapReduce and HDFS to work in HOST, and I don't know how
>> can I make an wordcount that reads the data from the HDFS from files in
>> HOST1 and HOST2?
>>
>>
>>
>>
>>
>>
>> On 8 April 2013 19:34, Harsh J <[EMAIL PROTECTED]> wrote:
>>
>>> You should be able to add fully qualified HDFS paths from N clusters
>>> into the same job via FileInputFormat.addInputPath(…) calls. Caveats
>>> may apply for secure environments, but for non-secure mode this should
>>> work just fine. Did you try this and did it not work?
>>>
>>> On Mon, Apr 8, 2013 at 9:56 PM, Pedro Sá da Costa <[EMAIL PROTECTED]>
>>> wrote:
>>>> Hi,
>>>>
>>>> I want to combine the data that are in different HDFS filesystems, for
>>> them
>>>> to be executed in one job. Is it possible to do this with MR, or there
>>> is
>>>> another Apache tool that allows me to do this?
>>>>
>>>> Eg.
>>>>
>>>> Hdfs data in Cluster1 ----v
>>>> Hdfs data in Cluster2 -> this job reads the data from Cluster1, 2
>>>>
>>>>
>>>> Thanks,
>>>> --
>>>> Best regards,
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>>
>> --
>> Best regards,
>>
>
>
>