I did a good chunk of the design and implementation of HDFS-4949, so if you
could post a longer writeup of your envisioned use cases and
implementation, I'd definitely be interested in taking a look.
It's also good to note that HDFS-4949 is only the foundation for a whole
slew of potential enhancements. We're planning to add some form of
automatic cache replacement, which as a first step could just be an
external policy that manages your static caching directives. It should also
already be possible to integrate a job scheduler with HDFS-4949, since it
both exposes the cache state of the cluster and allows a scheduler to
prefetch data into RAM. Finally, we're also thinking about caching at finer
granularities, e.g. block or sub-block rather than file-level caching,
which is nice for apps that only read regions of a file.
On Mon, Dec 23, 2013 at 9:57 PM, Dhaivat Pandya <[EMAIL PROTECTED]>wrote:
> Hi Harsh,
> Thanks a lot for the response. As it turns out, I figured out the
> registration mechanism this evening and how the sourceId is relayed to the
> As for your question about the cache layer it is a similar basic concept as
> the plan mentioned, but the technical details differ significantly. First
> of all, instead of having the user tell the namenode to perform caching (as
> it seems from the proposal on JIRA), there is a distributed caching
> algorithm that decides what files will be cached. Secondly, I am
> implementing a hook-in with the job scheduler that arranges jobs according
> to what files are cached at a given point in time (and also allows files to
> be cached based on what jobs are to be run).
> Also, the cache layer does a bit of metadata caching; the numbers on it are
> not all in, but thus far, some of the *metadata* caching surprisingly gives
> a pretty nice reduction in response time.
> Any thoughts on the cache layer would be greatly appreciated.
> On Mon, Dec 23, 2013 at 11:46 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> > Hi,
> > On Mon, Dec 23, 2013 at 9:41 AM, Dhaivat Pandya <[EMAIL PROTECTED]
> > wrote:
> > > Hi,
> > >
> > > I'm currently trying to build a cache layer that should sit "on top" of
> > the
> > > datanode. Essentially, the namenode should know the port number of the
> > > cache layer instead of that of the datanode (since the namenode then
> > relays
> > > this information to the default HDFS client). All of the communication
> > > between the datanode and the namenode currently flows through my cache
> > > layer (including heartbeats, etc.)
> > Curious Q: What does your cache layer aim to do btw? If its a data
> > cache, have you checked out the design being implemented currently by
> > https://issues.apache.org/jira/browse/HDFS-4949?
> > > *First question*: is there a way to tell the namenode where a datanode
> > > should be? Any way to trick it into thinking that the datanode is on a
> > port
> > > number where it actually isn't? As far as I can tell, the port number
> > > obtained from the DatanodeId object; can this be set in the
> > > so that the port number derived is that of the cache layer?
> > The NN receives a DN host and port from the DN directly. The DN sends
> > it whatever its running on. See
> > > I spent quite a bit of time on the above question and I could not find
> > any
> > > sort of configuration option that would let me do that. So, I delved
> > > the HDFS source code and tracked down the DatanodeRegistration class.
> > > However, I can't seem to find out *how* the NameNode figures out the
> > > Datanode's port number or if I could somehow change the packets to
> > reflect
> > > the port number of cache layer?
> > See