|
Jun Rao
2009-01-07, 17:19
Doug Cutting
2009-01-07, 18:18
Raghu Angadi
2009-01-07, 18:18
George Porter
2009-01-08, 18:13
Chris K Wensel
2009-01-08, 18:25
Doug Cutting
2009-01-08, 18:39
stack
2009-01-08, 22:23
Jun Rao
2009-01-12, 04:16
George Porter
2009-01-12, 18:11
Jun Rao
2009-01-13, 03:30
Sanjay Radia
2009-02-13, 22:39
Dhruba Borthakur
2009-02-14, 05:45
|
-
short-circuiting HDFS readsJun Rao 2009-01-07, 17:19
Hi, Today, HDFS always reads through a socket even when the data is local to the client. This adds a lot of overhead, especially for warm reads. It should be possible for a dfs client to test if a block to be read is local and if so, bypass socket and read through local FS api directly. This should improve random access performance significantly (e.g., for HBase). Has this been considered in HDFS? Thanks, Jun
-
Re: short-circuiting HDFS readsDoug Cutting 2009-01-07, 18:18
Please see https://issues.apache.org/jira/browse/HADOOP-4801.
Doug Jun Rao wrote: > > Hi, > > Today, HDFS always reads through a socket even when the data is local to > the client. This adds a lot of overhead, especially for warm reads. It > should be possible for a dfs client to test if a block to be read is local > and if so, bypass socket and read through local FS api directly. This > should improve random access performance significantly (e.g., for HBase). > Has this been considered in HDFS? Thanks, > > Jun
-
Re: short-circuiting HDFS readsRaghu Angadi 2009-01-07, 18:18
https://issues.apache.org/jira/browse/HADOOP-4801 Jun Rao wrote: > > Hi, > > Today, HDFS always reads through a socket even when the data is local to > the client. This adds a lot of overhead, especially for warm reads. It > should be possible for a dfs client to test if a block to be read is local > and if so, bypass socket and read through local FS api directly. This > should improve random access performance significantly (e.g., for HBase). > Has this been considered in HDFS? Thanks, > > Jun
-
Re: short-circuiting HDFS readsGeorge Porter 2009-01-08, 18:13
Hi Jun,
The earlier responses to your email reference the JIRA that I opened about this issue. Short-circuiting the primary HDFS datapath does improve throughput, and the amount depends on your workload (random reads especially). Some initial experimental results are posted to that JIRA. A second advantage is that since the JVM hosting the HDFS client is doing the reading, the O/S will satisfy future disk requests from the cache, which isn't really possible when you read over the network (even to another JVM on the same host). There are several real disadvantages, the largest of which include 1) it adds a new datapath, and 2) bypasses various security and auditing features of HDFS. I would certainly like to think through a more clean interface for achieving this goal, especially since reading local data should be the common case. Any thoughts you might have would be appreciated. Thanks, George Jun Rao wrote: > Hi, > > Today, HDFS always reads through a socket even when the data is local to > the client. This adds a lot of overhead, especially for warm reads. It > should be possible for a dfs client to test if a block to be read is local > and if so, bypass socket and read through local FS api directly. This > should improve random access performance significantly (e.g., for HBase). > Has this been considered in HDFS? Thanks, > > Jun > >
-
Re: short-circuiting HDFS readsChris K Wensel 2009-01-08, 18:25
Hey George
Any comments on the probability (currently) that reads by a Task are over the network vs. being "local", as seen in your tests? That is, are 10% of block reads over the network, or 90% of reads? I haven't looked, but am wondering if this metrics is stuffed somewhere by Hadoop... ckw On Jan 8, 2009, at 10:13 AM, George Porter wrote: > Hi Jun, > > The earlier responses to your email reference the JIRA that I opened > about this issue. Short-circuiting the primary HDFS datapath does > improve throughput, and the amount depends on your workload (random > reads especially). Some initial experimental results are posted to > that JIRA. A second advantage is that since the JVM hosting the > HDFS client is doing the reading, the O/S will satisfy future disk > requests from the cache, which isn't really possible when you read > over the network (even to another JVM on the same host). > > There are several real disadvantages, the largest of which include > 1) it adds a new datapath, and 2) bypasses various security and > auditing features of HDFS. I would certainly like to think through > a more clean interface for achieving this goal, especially since > reading local data should be the common case. Any thoughts you > might have would be appreciated. > > Thanks, > George > > Jun Rao wrote: >> Hi, >> >> Today, HDFS always reads through a socket even when the data is >> local to >> the client. This adds a lot of overhead, especially for warm reads. >> It >> should be possible for a dfs client to test if a block to be read >> is local >> and if so, bypass socket and read through local FS api directly. This >> should improve random access performance significantly (e.g., for >> HBase). >> Has this been considered in HDFS? Thanks, >> >> Jun >> >> -- Chris K Wensel [EMAIL PROTECTED] http://www.cascading.org/ http://www.scaleunlimited.com/
-
Re: short-circuiting HDFS readsDoug Cutting 2009-01-08, 18:39
Chris K Wensel wrote:
> Any comments on the probability (currently) that reads by a Task are > over the network vs. being "local", as seen in your tests? That is, are > 10% of block reads over the network, or 90% of reads? Greater than 90% of map reads are typically local in a sort job, like 98-99%. But map input is not the bottleneck in sort. Shuffle and reduce output are both considerably slower. So this sort of optimization may only have a significant impact for jobs whose maps do not produce much output. Doug
-
Re: short-circuiting HDFS readsstack 2009-01-08, 22:23
On Thu, Jan 8, 2009 at 10:39 AM, Doug Cutting <[EMAIL PROTECTED]> wrote:
> So this sort of optimization may only have a significant impact for jobs > whose maps do not produce much output. Don't forget the lowly non-MR users of HDFS. The short-circuit looks promising improving HDFS random access times. St.Ack
-
Re: short-circuiting HDFS readsJun Rao 2009-01-12, 04:16
Hi, George, I read the results in your JIRA. Very encouraging. It would be useful to test the improvement on both cold and warm data (warm data likely has larger improvment). There is a simple way to clear the file cache on Linux (http://www.linuxinsight.com/proc_sys_vm_drop_caches.html). An alternative approach is to build an in-memory caching layer on top of a DFS Client. The advantages are (1) less security issues; (2) probably even better performance since checksum can be avoided once the data is cached in memory; (3) the caching layer can be used anywhere, no just nodes owning a block locally. The disadvantage is that data is buffered twice in memory: once in the caching layer and once in the OS file cache. One can probably limit the OS file cache size (not sure if there is an easy way in Linux). What's your thought on this? Jun [EMAIL PROTECTED] wrote on 01/08/2009 10:13:25 AM: > Hi Jun, > > The earlier responses to your email reference the JIRA that I opened > about this issue. Short-circuiting the primary HDFS datapath does > improve throughput, and the amount depends on your workload (random > reads especially). Some initial experimental results are posted to that > JIRA. A second advantage is that since the JVM hosting the HDFS client > is doing the reading, the O/S will satisfy future disk requests from the > cache, which isn't really possible when you read over the network (even > to another JVM on the same host). > > There are several real disadvantages, the largest of which include 1) it > adds a new datapath, and 2) bypasses various security and auditing > features of HDFS. I would certainly like to think through a more clean > interface for achieving this goal, especially since reading local data > should be the common case. Any thoughts you might have would be > appreciated. > > Thanks, > George > > Jun Rao wrote: > > Hi, > > > > Today, HDFS always reads through a socket even when the data is local to > > the client. This adds a lot of overhead, especially for warm reads. It > > should be possible for a dfs client to test if a block to be read is local > > and if so, bypass socket and read through local FS api directly. This > > should improve random access performance significantly (e.g., for HBase). > > Has this been considered in HDFS? Thanks, > > > > Jun > > > >
-
Re: short-circuiting HDFS readsGeorge Porter 2009-01-12, 18:11
Hi Jun,
Thanks for the pointer to clear the disk cache, as well as the suggestion for creating a DFS Client cache layer. As for the double buffering overhead, I think that there is not going to be a large benefit to buffering in the DataNode, since the DataNode itself does not use the data in the buffer (it just forwards that data to HDFS clients). With the ability to perform zero-copy I/O operations, it probably shouldn't buffer any data at all, since it could just sendfile() the data directly from the disk to the network client via DMA, rather than copying it from disk into its memory space, then from memory to the socket. The downside of a DFS client cache is that it would need to be kept consistent, which would probably add a lot of complexity to the client. It is an interesting idea, though, and I think we should keep thinking about it. Thanks, George Jun Rao wrote: > > Hi, George, > > I read the results in your JIRA. Very encouraging. It would be useful > to test the improvement on both cold and warm data (warm data likely > has larger improvment). There is a simple way to clear the file cache > on Linux (http://www.linuxinsight.com/proc_sys_vm_drop_caches.html). > > An alternative approach is to build an in-memory caching layer on top > of a DFS Client. The advantages are (1) less security issues; (2) > probably even better performance since checksum can be avoided once > the data is cached in memory; (3) the caching layer can be used > anywhere, no just nodes owning a block locally. The disadvantage is > that data is buffered twice in memory: once in the caching layer and > once in the OS file cache. One can probably limit the OS file cache > size (not sure if there is an easy way in Linux). What's your thought > on this? > > Jun > > [EMAIL PROTECTED] wrote on 01/08/2009 10:13:25 AM: > > > Hi Jun, > > > > The earlier responses to your email reference the JIRA that I opened > > about this issue. Short-circuiting the primary HDFS datapath does > > improve throughput, and the amount depends on your workload (random > > reads especially). Some initial experimental results are posted to > that > > JIRA. A second advantage is that since the JVM hosting the HDFS client > > is doing the reading, the O/S will satisfy future disk requests from > the > > cache, which isn't really possible when you read over the network (even > > to another JVM on the same host). > > > > There are several real disadvantages, the largest of which include > 1) it > > adds a new datapath, and 2) bypasses various security and auditing > > features of HDFS. I would certainly like to think through a more clean > > interface for achieving this goal, especially since reading local data > > should be the common case. Any thoughts you might have would be > > appreciated. > > > > Thanks, > > George > > > > Jun Rao wrote: > > > Hi, > > > > > > Today, HDFS always reads through a socket even when the data is > local to > > > the client. This adds a lot of overhead, especially for warm reads. It > > > should be possible for a dfs client to test if a block to be read > is local > > > and if so, bypass socket and read through local FS api directly. This > > > should improve random access performance significantly (e.g., for > HBase). > > > Has this been considered in HDFS? Thanks, > > > > > > Jun > > > > > > >
-
Re: short-circuiting HDFS readsJun Rao 2009-01-13, 03:30
George,
Thanks for the comment. Note that HDFS files are never updated in-place. This should make it easy to deal with cache consistency. Jun IBM Almaden Research Center K55/B1, 650 Harry Road, San Jose, CA 95120-6099 [EMAIL PROTECTED] (408)927-1886 (phone) (408)927-3215 (fax) [EMAIL PROTECTED] wrote on 01/12/2009 10:11:08 AM: > Hi Jun, > > Thanks for the pointer to clear the disk cache, as well as the > suggestion for creating a DFS Client cache layer. As for the double > buffering overhead, I think that there is not going to be a large > benefit to buffering in the DataNode, since the DataNode itself does not > use the data in the buffer (it just forwards that data to HDFS > clients). With the ability to perform zero-copy I/O operations, it > probably shouldn't buffer any data at all, since it could just > sendfile() the data directly from the disk to the network client via > DMA, rather than copying it from disk into its memory space, then from > memory to the socket. The downside of a DFS client cache is that it > would need to be kept consistent, which would probably add a lot of > complexity to the client. It is an interesting idea, though, and I > think we should keep thinking about it. > > Thanks, > George > > Jun Rao wrote: > > > > Hi, George, > > > > I read the results in your JIRA. Very encouraging. It would be useful > > to test the improvement on both cold and warm data (warm data likely > > has larger improvment). There is a simple way to clear the file cache > > on Linux (http://www.linuxinsight.com/proc_sys_vm_drop_caches.html). > > > > An alternative approach is to build an in-memory caching layer on top > > of a DFS Client. The advantages are (1) less security issues; (2) > > probably even better performance since checksum can be avoided once > > the data is cached in memory; (3) the caching layer can be used > > anywhere, no just nodes owning a block locally. The disadvantage is > > that data is buffered twice in memory: once in the caching layer and > > once in the OS file cache. One can probably limit the OS file cache > > size (not sure if there is an easy way in Linux). What's your thought > > on this? > > > > Jun > > > > [EMAIL PROTECTED] wrote on 01/08/2009 10:13:25 AM: > > > > > Hi Jun, > > > > > > The earlier responses to your email reference the JIRA that I opened > > > about this issue. Short-circuiting the primary HDFS datapath does > > > improve throughput, and the amount depends on your workload (random > > > reads especially). Some initial experimental results are posted to > > that > > > JIRA. A second advantage is that since the JVM hosting the HDFS client > > > is doing the reading, the O/S will satisfy future disk requests from > > the > > > cache, which isn't really possible when you read over the network (even > > > to another JVM on the same host). > > > > > > There are several real disadvantages, the largest of which include > > 1) it > > > adds a new datapath, and 2) bypasses various security and auditing > > > features of HDFS. I would certainly like to think through a more clean > > > interface for achieving this goal, especially since reading local data > > > should be the common case. Any thoughts you might have would be > > > appreciated. > > > > > > Thanks, > > > George > > > > > > Jun Rao wrote: > > > > Hi, > > > > > > > > Today, HDFS always reads through a socket even when the data is > > local to > > > > the client. This adds a lot of overhead, especially for warm reads. It > > > > should be possible for a dfs client to test if a block to be read > > is local > > > > and if so, bypass socket and read through local FS api directly. This > > > > should improve random access performance significantly (e.g., for > > HBase). > > > > Has this been considered in HDFS? Thanks, > > > > > > > > Jun > > > > > > > > > >
-
Re: short-circuiting HDFS readsSanjay Radia 2009-02-13, 22:39
On Jan 8, 2009, at 10:13 AM, George Porter wrote: > Hi Jun, > > The earlier responses to your email reference the JIRA that I opened > about this issue. Short-circuiting the primary HDFS datapath does > improve throughput, and the amount depends on your workload (random > reads especially). Some initial experimental results are posted to > that > JIRA. A second advantage is that since the JVM hosting the HDFS > client > is doing the reading, the O/S will satisfy future disk requests from > the > cache, which isn't really possible when you read over the network > (even > to another JVM on the same host). > > There are several real disadvantages, the largest of which include > 1) it > adds a new datapath, and 2) bypasses various security and auditing > features of HDFS. > We are in middle of adding security to HDFS. Having the client read the blocks directly would violate security. Security is a specially thorny problem to solve in this case. Further the internal structure and hence the path name of the file are not visible outside. One could consider hacking this (ignoring security) but even this gets tricky as the directory in which the block is saved may change if some one starts to write to the file (which can happen with the recent append work), Interesting optimization but tricky to do in a clean way (at least not obvious to me). sanjay > I would certainly like to think through a more clean > interface for achieving this goal, especially since reading local data > should be the common case. Any thoughts you might have would be > appreciated. > > Thanks, > George > > Jun Rao wrote: > > Hi, > > > > Today, HDFS always reads through a socket even when the data is > local to > > the client. This adds a lot of overhead, especially for warm > reads. It > > should be possible for a dfs client to test if a block to be read > is local > > and if so, bypass socket and read through local FS api directly. > This > > should improve random access performance significantly (e.g., for > HBase). > > Has this been considered in HDFS? Thanks, > > > > Jun > > > > >
-
Re: short-circuiting HDFS readsDhruba Borthakur 2009-02-14, 05:45
Hi folks,
This is a very interesting discussion. We have been considering divvy-ing up a set of "old" netapp storage among a few number of nodes that run HDFS. This HDFS could be used to archive rarely used data from our production cluster. These netapp-storage could actually be mounted on all HDFS client machines, and it would be nice if there was a short-circuit protocol to make HDFS clients write directory to the block file. This is helful in the scenario when there is shared storage that can be accessed directly by datanodes as well as hdfs client. I understand this exposes problems with security and such. -dhruba On Fri, Feb 13, 2009 at 2:39 PM, Sanjay Radia <[EMAIL PROTECTED]> wrote: > > On Jan 8, 2009, at 10:13 AM, George Porter wrote: > > Hi Jun, >> >> The earlier responses to your email reference the JIRA that I opened >> about this issue. Short-circuiting the primary HDFS datapath does >> improve throughput, and the amount depends on your workload (random >> reads especially). Some initial experimental results are posted to that >> JIRA. A second advantage is that since the JVM hosting the HDFS client >> is doing the reading, the O/S will satisfy future disk requests from the >> cache, which isn't really possible when you read over the network (even >> to another JVM on the same host). >> >> There are several real disadvantages, the largest of which include 1) it >> adds a new datapath, and 2) bypasses various security and auditing >> features of HDFS. >> >> We are in middle of adding security to HDFS. > Having the client read the blocks directly would violate security. Security > is a specially thorny problem to solve in this case. > Further the internal structure and hence the path name of the file are not > visible outside. > One could consider hacking this (ignoring security) but even this gets > tricky as the directory in which the block is saved may change if > some one starts to write to the file (which can happen with the recent > append work), > > Interesting optimization but tricky to do in a clean way (at least not > obvious to me). > > > sanjay > > > > I would certainly like to think through a more clean >> interface for achieving this goal, especially since reading local data >> should be the common case. Any thoughts you might have would be >> appreciated. >> >> Thanks, >> George >> >> Jun Rao wrote: >> > Hi, >> > >> > Today, HDFS always reads through a socket even when the data is local to >> > the client. This adds a lot of overhead, especially for warm reads. It >> > should be possible for a dfs client to test if a block to be read is >> local >> > and if so, bypass socket and read through local FS api directly. This >> > should improve random access performance significantly (e.g., for >> HBase). >> > Has this been considered in HDFS? Thanks, >> > >> > Jun >> > >> > >> >> > |