Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # dev >> Datanode registration, port number


Copy link to this message
-
Re: Datanode registration, port number
Hi Dhaivat,

I did a good chunk of the design and implementation of HDFS-4949, so if you
could post a longer writeup of your envisioned use cases and
implementation, I'd definitely be interested in taking a look.

It's also good to note that HDFS-4949 is only the foundation for a whole
slew of potential enhancements. We're planning to add some form of
automatic cache replacement, which as a first step could just be an
external policy that manages your static caching directives. It should also
already be possible to integrate a job scheduler with HDFS-4949, since it
both exposes the cache state of the cluster and allows a scheduler to
prefetch data into RAM. Finally, we're also thinking about caching at finer
granularities, e.g. block or sub-block rather than file-level caching,
which is nice for apps that only read regions of a file.

Best,
Andrew
On Mon, Dec 23, 2013 at 9:57 PM, Dhaivat Pandya <[EMAIL PROTECTED]>wrote:

> Hi Harsh,
>
> Thanks a lot for the response. As it turns out, I figured out the
> registration mechanism this evening and how the sourceId is relayed to the
> NN.
>
> As for your question about the cache layer it is a similar basic concept as
> the plan mentioned, but the technical details differ significantly. First
> of all, instead of having the user tell the namenode to perform caching (as
> it seems from the proposal on JIRA), there is a distributed caching
> algorithm that decides what files will be cached. Secondly, I am
> implementing a hook-in with the job scheduler that arranges jobs according
> to what files are cached at a given point in time (and also allows files to
> be cached based on what jobs are to be run).
>
> Also, the cache layer does a bit of metadata caching; the numbers on it are
> not all in, but thus far, some of the *metadata* caching surprisingly gives
> a pretty nice reduction in response time.
>
> Any thoughts on the cache layer would be greatly appreciated.
>
> Thanks,
>
> Dhaivat
>
>
> On Mon, Dec 23, 2013 at 11:46 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > On Mon, Dec 23, 2013 at 9:41 AM, Dhaivat Pandya <[EMAIL PROTECTED]
> >
> > wrote:
> > > Hi,
> > >
> > > I'm currently trying to build a cache layer that should sit "on top" of
> > the
> > > datanode. Essentially, the namenode should know the port number of the
> > > cache layer instead of that of the datanode (since the namenode then
> > relays
> > > this information to the default HDFS client). All of the communication
> > > between the datanode and the namenode currently flows through my cache
> > > layer (including heartbeats, etc.)
> >
> > Curious Q: What does your cache layer aim to do btw? If its a data
> > cache, have you checked out the design being implemented currently by
> > https://issues.apache.org/jira/browse/HDFS-4949?
> >
> > > *First question*: is there a way to tell the namenode where a datanode
> > > should be? Any way to trick it into thinking that the datanode is on a
> > port
> > > number where it actually isn't? As far as I can tell, the port number
> is
> > > obtained from the DatanodeId object; can this be set in the
> configuration
> > > so that the port number derived is that of the cache layer?
> >
> > The NN receives a DN host and port from the DN directly. The DN sends
> > it whatever its running on. See
> >
> >
> https://github.com/apache/hadoop-common/blob/release-2.2.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java#L690
> >
> > > I spent quite a bit of time on the above question and I could not find
> > any
> > > sort of configuration option that would let me do that. So, I delved
> into
> > > the HDFS source code and tracked down the DatanodeRegistration class.
> > > However, I can't seem to find out *how* the NameNode figures out the
> > > Datanode's port number or if I could somehow change the packets to
> > reflect
> > > the port number of cache layer?
> >
> > See
> >
> https://github.com/apache/hadoop-common/blob/release-2.2.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java#L690
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB