-Re: Integrating Lustre and HDFS
Allen Wittenauer 2010-06-15, 17:24
No, i'm saying your mapreduce code needs to explicitly reference every file system that it needs to access. You can't rely upon fs.default.name*. The distcp code could provide some guidance on how to do this.
* maybe it isn't clear why this is, so let me spell it out a bit: fs.default.name is just that--a default. When you run hadoop dfs -ls with no qualifying file system url, it uses fs.default.name to figure out where that file system is actually at. Since you need to access two different file systems, you cannot make any such assumptions safely. This is also why you can't list two file systems in fs.default.name. When you run 'hadoop dfs -ls', it wouldn't be logical as to what exactly Hadoop should do, especially if the paths requested *conflict*.
On Jun 15, 2010, at 3:31 AM, Vikas Ashok Patil wrote:
> Hello Allen,
> Sorry for bugging you regarding the same problem again. If you say "we need
> to be explicit having multiple file-systems" for map reduce jobs, are you
> hinting on code changes to be made to hadoop ? Please provide more details
> on this if possible.
> On Sat, Jun 12, 2010 at 9:05 AM, Vikas Ashok Patil <[EMAIL PROTECTED]>wrote:
>> Hello Allen,
>> Thanks for the reply.
>> You are right about trying to run two distributed filesystems. The reason
>> being, there are certain restrictions (in our cluster environment) to
>> include the local file system into lustre. Please tell me how would I make
>> mapreduce access more than one file system. At least the configs don't seem
>> to allow it.
>> Vikas A Patil
>> On Sat, Jun 12, 2010 at 12:32 AM, Allen Wittenauer <
>> [EMAIL PROTECTED]> wrote:
>>> On Jun 10, 2010, at 8:27 PM, Vikas Ashok Patil wrote:
>>>> Thanks for the replies.
>>>> If I have fs.default.name = file://my_lustre_mount_point , then only
>>>> lustre filesystem will be used. I would like to have something like
>>>> fs.default.name=file://my_lustre_mount_point , hdfs://localhost:9123
>>>> so that both local filesystem and lustre are in use.
>>>> Kindly correct me if I am missing something here.
>>> I guess we're all confused as to your use case. Why do you want to run
>>> two distributed file systems on the same nodes? Why can't you use Lustre
>>> for all your needs?
>>> As to fs.default.name, you can only have one. [That's why it is a
>>> default. *smile*] If you want to access more than one file system from
>>> within MapReduce, you'll need to specify it explicitly.