Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> hadoop filesystem cache


Copy link to this message
-
Re: hadoop filesystem cache
The challenges of this design is people accessing the same data over and
over again is the uncommon usecase for hadoop. Hadoop's bread and butter is
all about streaming through large datasets that do not fit in memory. Also
your shuffle-sort-spill is going to play havoc on and file system based
cache. The distributed cache roughly fits this role except that it does not
persist after a job.

Replicating content to N nodes also is not a hard problem to tackle (you
can hack up a content delivery system with ssh+rsync) and get similar
results.The approach often taken has been to keep data that is accessed
repeatedly and fits in memory in some other system
(hbase/cassandra/mysql/whatever).

Edward
On Mon, Jan 16, 2012 at 11:33 AM, Rita <[EMAIL PROTECTED]> wrote:

> Thanks. I believe this is a good feature to have for clients especially if
> you are reading the same large file over and over.
>
>
> On Sun, Jan 15, 2012 at 7:33 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote:
>
> > There is some work being done in this area by some folks over at UC
> > Berkeley's AMP Lab in coordination with Facebook. I don't believe it
> > has been published quite yet, but the title of the project is "PACMan"
> > -- I expect it will be published soon.
> >
> > -Todd
> >
> > On Sat, Jan 14, 2012 at 5:30 PM, Rita <[EMAIL PROTECTED]> wrote:
> > > After reading this article,
> > > http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I
> was
> > > wondering if there was a filesystem cache for hdfs. For example, if a
> > large
> > > file (10gigabytes) was keep getting accessed on the cluster instead of
> > keep
> > > getting it from the network why not storage the content of the file
> > locally
> > > on the client itself.  A use case on the client would be like this:
> > >
> > >
> > >
> > > <property>
> > >  <name>dfs.client.cachedirectory</name>
> > >  <value>/var/cache/hdfs</value>
> > > </property>
> > >
> > >
> > > <property>
> > > <name>dfs.client.cachesize</name>
> > > <description>in megabytes</description>
> > > <value>100000</value>
> > > </property>
> > >
> > >
> > > Any thoughts of a feature like this?
> > >
> > >
> > > --
> > > --- Get your facts first, then you can distort them as you please.--
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>
>
>
> --
> --- Get your facts first, then you can distort them as you please.--
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB