|
Rita
2012-01-15, 01:30
Prashant Kommireddi
2012-01-15, 01:33
Rita
2012-01-15, 01:57
Todd Lipcon
2012-01-16, 00:33
Rita
2012-01-16, 16:33
Edward Capriolo
2012-01-16, 18:07
Rita
2012-01-17, 12:27
|
-
hadoop filesystem cacheRita 2012-01-15, 01:30
After reading this article,
http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I was wondering if there was a filesystem cache for hdfs. For example, if a large file (10gigabytes) was keep getting accessed on the cluster instead of keep getting it from the network why not storage the content of the file locally on the client itself. A use case on the client would be like this: <property> <name>dfs.client.cachedirectory</name> <value>/var/cache/hdfs</value> </property> <property> <name>dfs.client.cachesize</name> <description>in megabytes</description> <value>100000</value> </property> Any thoughts of a feature like this? -- --- Get your facts first, then you can distort them as you please.--
-
Re: hadoop filesystem cachePrashant Kommireddi 2012-01-15, 01:33
You mean something different from the DistributedCache?
Sent from my iPhone On Jan 14, 2012, at 5:30 PM, Rita <[EMAIL PROTECTED]> wrote: > After reading this article, > http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I was > wondering if there was a filesystem cache for hdfs. For example, if a large > file (10gigabytes) was keep getting accessed on the cluster instead of keep > getting it from the network why not storage the content of the file locally > on the client itself. A use case on the client would be like this: > > > > <property> > <name>dfs.client.cachedirectory</name> > <value>/var/cache/hdfs</value> > </property> > > > <property> > <name>dfs.client.cachesize</name> > <description>in megabytes</description> > <value>100000</value> > </property> > > > Any thoughts of a feature like this? > > > -- > --- Get your facts first, then you can distort them as you please.--
-
Re: hadoop filesystem cacheRita 2012-01-15, 01:57
yes, something different from that. To my knowledge, DistributedCache is
only for Mapreduce. On Sat, Jan 14, 2012 at 8:33 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote: > You mean something different from the DistributedCache? > > Sent from my iPhone > > On Jan 14, 2012, at 5:30 PM, Rita <[EMAIL PROTECTED]> wrote: > > > After reading this article, > > http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I was > > wondering if there was a filesystem cache for hdfs. For example, if a > large > > file (10gigabytes) was keep getting accessed on the cluster instead of > keep > > getting it from the network why not storage the content of the file > locally > > on the client itself. A use case on the client would be like this: > > > > > > > > <property> > > <name>dfs.client.cachedirectory</name> > > <value>/var/cache/hdfs</value> > > </property> > > > > > > <property> > > <name>dfs.client.cachesize</name> > > <description>in megabytes</description> > > <value>100000</value> > > </property> > > > > > > Any thoughts of a feature like this? > > > > > > -- > > --- Get your facts first, then you can distort them as you please.-- > -- --- Get your facts first, then you can distort them as you please.--
-
Re: hadoop filesystem cacheTodd Lipcon 2012-01-16, 00:33
There is some work being done in this area by some folks over at UC
Berkeley's AMP Lab in coordination with Facebook. I don't believe it has been published quite yet, but the title of the project is "PACMan" -- I expect it will be published soon. -Todd On Sat, Jan 14, 2012 at 5:30 PM, Rita <[EMAIL PROTECTED]> wrote: > After reading this article, > http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I was > wondering if there was a filesystem cache for hdfs. For example, if a large > file (10gigabytes) was keep getting accessed on the cluster instead of keep > getting it from the network why not storage the content of the file locally > on the client itself. A use case on the client would be like this: > > > > <property> > <name>dfs.client.cachedirectory</name> > <value>/var/cache/hdfs</value> > </property> > > > <property> > <name>dfs.client.cachesize</name> > <description>in megabytes</description> > <value>100000</value> > </property> > > > Any thoughts of a feature like this? > > > -- > --- Get your facts first, then you can distort them as you please.-- -- Todd Lipcon Software Engineer, Cloudera
-
Re: hadoop filesystem cacheRita 2012-01-16, 16:33
Thanks. I believe this is a good feature to have for clients especially if
you are reading the same large file over and over. On Sun, Jan 15, 2012 at 7:33 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > There is some work being done in this area by some folks over at UC > Berkeley's AMP Lab in coordination with Facebook. I don't believe it > has been published quite yet, but the title of the project is "PACMan" > -- I expect it will be published soon. > > -Todd > > On Sat, Jan 14, 2012 at 5:30 PM, Rita <[EMAIL PROTECTED]> wrote: > > After reading this article, > > http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I was > > wondering if there was a filesystem cache for hdfs. For example, if a > large > > file (10gigabytes) was keep getting accessed on the cluster instead of > keep > > getting it from the network why not storage the content of the file > locally > > on the client itself. A use case on the client would be like this: > > > > > > > > <property> > > <name>dfs.client.cachedirectory</name> > > <value>/var/cache/hdfs</value> > > </property> > > > > > > <property> > > <name>dfs.client.cachesize</name> > > <description>in megabytes</description> > > <value>100000</value> > > </property> > > > > > > Any thoughts of a feature like this? > > > > > > -- > > --- Get your facts first, then you can distort them as you please.-- > > > > -- > Todd Lipcon > Software Engineer, Cloudera > -- --- Get your facts first, then you can distort them as you please.--
-
Re: hadoop filesystem cacheEdward Capriolo 2012-01-16, 18:07
The challenges of this design is people accessing the same data over and
over again is the uncommon usecase for hadoop. Hadoop's bread and butter is all about streaming through large datasets that do not fit in memory. Also your shuffle-sort-spill is going to play havoc on and file system based cache. The distributed cache roughly fits this role except that it does not persist after a job. Replicating content to N nodes also is not a hard problem to tackle (you can hack up a content delivery system with ssh+rsync) and get similar results.The approach often taken has been to keep data that is accessed repeatedly and fits in memory in some other system (hbase/cassandra/mysql/whatever). Edward On Mon, Jan 16, 2012 at 11:33 AM, Rita <[EMAIL PROTECTED]> wrote: > Thanks. I believe this is a good feature to have for clients especially if > you are reading the same large file over and over. > > > On Sun, Jan 15, 2012 at 7:33 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > > > There is some work being done in this area by some folks over at UC > > Berkeley's AMP Lab in coordination with Facebook. I don't believe it > > has been published quite yet, but the title of the project is "PACMan" > > -- I expect it will be published soon. > > > > -Todd > > > > On Sat, Jan 14, 2012 at 5:30 PM, Rita <[EMAIL PROTECTED]> wrote: > > > After reading this article, > > > http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I > was > > > wondering if there was a filesystem cache for hdfs. For example, if a > > large > > > file (10gigabytes) was keep getting accessed on the cluster instead of > > keep > > > getting it from the network why not storage the content of the file > > locally > > > on the client itself. A use case on the client would be like this: > > > > > > > > > > > > <property> > > > <name>dfs.client.cachedirectory</name> > > > <value>/var/cache/hdfs</value> > > > </property> > > > > > > > > > <property> > > > <name>dfs.client.cachesize</name> > > > <description>in megabytes</description> > > > <value>100000</value> > > > </property> > > > > > > > > > Any thoughts of a feature like this? > > > > > > > > > -- > > > --- Get your facts first, then you can distort them as you please.-- > > > > > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > > > > -- > --- Get your facts first, then you can distort them as you please.-- >
-
Re: hadoop filesystem cacheRita 2012-01-17, 12:27
My intention isn't to make it a mandatory feature just as an option.
Keeping data locally on a filesystem as a method of Lx cache is far better than getting it from the network and the cost of fs buffer cache is much cheaper than a RPC call. On Mon, Jan 16, 2012 at 1:07 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > The challenges of this design is people accessing the same data over and > over again is the uncommon usecase for hadoop. Hadoop's bread and butter is > all about streaming through large datasets that do not fit in memory. Also > your shuffle-sort-spill is going to play havoc on and file system based > cache. The distributed cache roughly fits this role except that it does not > persist after a job. > > Replicating content to N nodes also is not a hard problem to tackle (you > can hack up a content delivery system with ssh+rsync) and get similar > results.The approach often taken has been to keep data that is accessed > repeatedly and fits in memory in some other system > (hbase/cassandra/mysql/whatever). > > Edward > > > On Mon, Jan 16, 2012 at 11:33 AM, Rita <[EMAIL PROTECTED]> wrote: > > > Thanks. I believe this is a good feature to have for clients especially > if > > you are reading the same large file over and over. > > > > > > On Sun, Jan 15, 2012 at 7:33 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > > > > > There is some work being done in this area by some folks over at UC > > > Berkeley's AMP Lab in coordination with Facebook. I don't believe it > > > has been published quite yet, but the title of the project is "PACMan" > > > -- I expect it will be published soon. > > > > > > -Todd > > > > > > On Sat, Jan 14, 2012 at 5:30 PM, Rita <[EMAIL PROTECTED]> wrote: > > > > After reading this article, > > > > http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I > > was > > > > wondering if there was a filesystem cache for hdfs. For example, if a > > > large > > > > file (10gigabytes) was keep getting accessed on the cluster instead > of > > > keep > > > > getting it from the network why not storage the content of the file > > > locally > > > > on the client itself. A use case on the client would be like this: > > > > > > > > > > > > > > > > <property> > > > > <name>dfs.client.cachedirectory</name> > > > > <value>/var/cache/hdfs</value> > > > > </property> > > > > > > > > > > > > <property> > > > > <name>dfs.client.cachesize</name> > > > > <description>in megabytes</description> > > > > <value>100000</value> > > > > </property> > > > > > > > > > > > > Any thoughts of a feature like this? > > > > > > > > > > > > -- > > > > --- Get your facts first, then you can distort them as you please.-- > > > > > > > > > > > > -- > > > Todd Lipcon > > > Software Engineer, Cloudera > > > > > > > > > > > -- > > --- Get your facts first, then you can distort them as you please.-- > > > -- --- Get your facts first, then you can distort them as you please.-- |