Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - DistributedCache - why not read directly from HDFS?


Copy link to this message
-
Re: DistributedCache - why not read directly from HDFS?
Arun C Murthy 2013-03-25, 22:30
More importantly, second and subsequent access of the file in DC is guaranteed to be local disk i/o.

On Mar 24, 2013, at 3:00 AM, Alberto Cordioli wrote:

> Thanks for your reply Harsh.
> So if I want to read a simple text file, choosing whether to use
> DistributedCachce or HDFS it becomes just a matter of performance.
>
>
> Alberto
>
> On 23 March 2013 16:17, Harsh J <[EMAIL PROTECTED]> wrote:
>> A DistributedCache is not used just to distribute simple files but
>> also native libraries and such which cannot be loaded by certain if
>> its on HDFS.
>>
>> Also, keeping it on HDFS could provide less performant as non-local
>> reads could happen (depending on the files' replication factor).
>>
>> On Sat, Mar 23, 2013 at 8:23 PM, Alberto Cordioli
>> <[EMAIL PROTECTED]> wrote:
>>> Hi all,
>>>
>>> I was not able to find an answer to the following question. If the
>>> question has already been answered please give me the pointer to the
>>> right thread.
>>>
>>> Which are actually the differences between read file from HDFS in one
>>> mapper and use DistributedCache.
>>>
>>> I saw that with DistributedCache you can give an hdfs path and the
>>> task nodes will get the data on local file system. But which
>>> advantages we have compared with a simple HDFS read with
>>> FSDataInputStream.open() method?
>>>
>>> Thank you very much,
>>> Alberto
>>>
>>>
>>> --
>>> Alberto Cordioli
>>
>>
>>
>> --
>> Harsh J
>
>
>
> --
> Alberto Cordioli

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/