Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> hadoop filesystem cache


Copy link to this message
-
Re: hadoop filesystem cache
My intention isn't to make it a mandatory feature just as an option.
Keeping data locally on a filesystem as a method of Lx cache is far better
than getting it from the network and the cost of fs buffer cache is much
cheaper than a RPC call.

On Mon, Jan 16, 2012 at 1:07 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:

> The challenges of this design is people accessing the same data over and
> over again is the uncommon usecase for hadoop. Hadoop's bread and butter is
> all about streaming through large datasets that do not fit in memory. Also
> your shuffle-sort-spill is going to play havoc on and file system based
> cache. The distributed cache roughly fits this role except that it does not
> persist after a job.
>
> Replicating content to N nodes also is not a hard problem to tackle (you
> can hack up a content delivery system with ssh+rsync) and get similar
> results.The approach often taken has been to keep data that is accessed
> repeatedly and fits in memory in some other system
> (hbase/cassandra/mysql/whatever).
>
> Edward
>
>
> On Mon, Jan 16, 2012 at 11:33 AM, Rita <[EMAIL PROTECTED]> wrote:
>
> > Thanks. I believe this is a good feature to have for clients especially
> if
> > you are reading the same large file over and over.
> >
> >
> > On Sun, Jan 15, 2012 at 7:33 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote:
> >
> > > There is some work being done in this area by some folks over at UC
> > > Berkeley's AMP Lab in coordination with Facebook. I don't believe it
> > > has been published quite yet, but the title of the project is "PACMan"
> > > -- I expect it will be published soon.
> > >
> > > -Todd
> > >
> > > On Sat, Jan 14, 2012 at 5:30 PM, Rita <[EMAIL PROTECTED]> wrote:
> > > > After reading this article,
> > > > http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I
> > was
> > > > wondering if there was a filesystem cache for hdfs. For example, if a
> > > large
> > > > file (10gigabytes) was keep getting accessed on the cluster instead
> of
> > > keep
> > > > getting it from the network why not storage the content of the file
> > > locally
> > > > on the client itself.  A use case on the client would be like this:
> > > >
> > > >
> > > >
> > > > <property>
> > > >  <name>dfs.client.cachedirectory</name>
> > > >  <value>/var/cache/hdfs</value>
> > > > </property>
> > > >
> > > >
> > > > <property>
> > > > <name>dfs.client.cachesize</name>
> > > > <description>in megabytes</description>
> > > > <value>100000</value>
> > > > </property>
> > > >
> > > >
> > > > Any thoughts of a feature like this?
> > > >
> > > >
> > > > --
> > > > --- Get your facts first, then you can distort them as you please.--
> > >
> > >
> > >
> > > --
> > > Todd Lipcon
> > > Software Engineer, Cloudera
> > >
> >
> >
> >
> > --
> > --- Get your facts first, then you can distort them as you please.--
> >
>

--
--- Get your facts first, then you can distort them as you please.--
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB