|
Pankaj Gupta
2013-03-20, 07:34
Jean-Daniel Cryans
2013-03-20, 17:30
Enis Söztutar
2013-03-21, 20:28
Pankaj Gupta
2013-03-24, 00:31
Ted Yu
2013-03-24, 01:13
Liyin Tang
2013-03-24, 04:44
lars hofhansl
2013-03-24, 05:20
Liyin Tang
2013-03-25, 05:15
Enis Söztutar
2013-03-25, 18:24
Andrew Purtell
2013-03-25, 18:42
Liyin Tang
2013-03-25, 19:18
Enis Söztutar
2013-03-25, 20:26
谢良
2013-03-26, 03:01
|
-
Does HBase RegionServer benefit from OS Page CachePankaj Gupta 2013-03-20, 07:34
Given that HBase has it's own cache (block cache and bloom filters) and that all the table data is stored in HDFS, I'm wondering if HBase benefits from OS page cache at all. In the set up I'm using HBase Region Servers run on the same boxes as the HDFS data node. In such a scenario if the underlying HLog files lives on the same machine then having a healthy memory surplus may mean that the data node can serve underlying file from page cache and thus improving HBase performance. Is this really the case? (I guess page cache should also help in case where HLog file lives on a different machine but in that case network I/O will probably drown the speedup achieved due to not hitting the disk.
I'm asking because if page cache were useful then in an HBase set up not utilizing all the memory on the machine for the region server may not be that bad. The reason one would not want to use all the memory for region server would be long garbage collection pauses that large heap size may induce. I understand that work has been done to fix the long pauses caused due to memory fragmentation in the old generation, mostly concurrent garbage collector by using slab cache allocator for memstore but that feature is marked experimental and we're not ready to take risks yet. So if the page cache was useful in any way on Region Servers we could go with less memory for RegionServer process with the understanding that free memory on the machine is not completely going to waste. Thus my curiosity about utility of os page cache to performance of HBase. Thanks in Advance, Pankaj
-
Re: Does HBase RegionServer benefit from OS Page CacheJean-Daniel Cryans 2013-03-20, 17:30
First, MSLAB has been enabled by default since 0.92.0 as it was deemed
stable enough. So, unless you are on 0.90, you are already using it. Also, I'm not sure why you are referencing the HLog in your first paragraph in the context of reading from disk, because the HLogs are rarely read (only on recovery). Maybe you meant HFile? In any case, your email covers most arguments except for one: checksumming. Retrieving a block from HDFS, even when using short circuit reads to go directly to the OS instead of passing through the DN, will take quite a bit more time than reading directly from the block cache. This is why even if you disable block caching on a family that the index and root blocks will still be block cached, as reading those very hot blocks from disk would take way too long. Regarding your main question (how does the OS buffer help?), I don't have a good answer. It kind of depends on the amount of RAM you have and what your workload is like. As a data point, I've been successful running with 24GB of heap (50% dedicated to the block cache) with a workload consisting mainly of small writes, short scans, and a typical random read distribution for a website. I can't remember the last time I saw a full GC and it's been running for more than a year like this. Hope this somehow helps, J-D On Wed, Mar 20, 2013 at 12:34 AM, Pankaj Gupta <[EMAIL PROTECTED]> wrote: > Given that HBase has it's own cache (block cache and bloom filters) and that all the table data is stored in HDFS, I'm wondering if HBase benefits from OS page cache at all. In the set up I'm using HBase Region Servers run on the same boxes as the HDFS data node. In such a scenario if the underlying HLog files lives on the same machine then having a healthy memory surplus may mean that the data node can serve underlying file from page cache and thus improving HBase performance. Is this really the case? (I guess page cache should also help in case where HLog file lives on a different machine but in that case network I/O will probably drown the speedup achieved due to not hitting the disk. > > I'm asking because if page cache were useful then in an HBase set up not utilizing all the memory on the machine for the region server may not be that bad. The reason one would not want to use all the memory for region server would be long garbage collection pauses that large heap size may induce. I understand that work has been done to fix the long pauses caused due to memory fragmentation in the old generation, mostly concurrent garbage collector by using slab cache allocator for memstore but that feature is marked experimental and we're not ready to take risks yet. So if the page cache was useful in any way on Region Servers we could go with less memory for RegionServer process with the understanding that free memory on the machine is not completely going to waste. Thus my curiosity about utility of os page cache to performance of HBase. > > Thanks in Advance, > Pankaj
-
Re: Does HBase RegionServer benefit from OS Page CacheEnis Söztutar 2013-03-21, 20:28
I think the page cache is not totally useless, but as long as you can
control the GC, you should prefer the block cache. Some of the reasons of the top of my head: - In case of a cache hit, for OS cache, you have to go through the DN layer (an RPC if ssr disabled), and do a kernel jump, and read using the read() libc vs for reading a block from the block cache, only the HBase process is involved. There is no process switch involved and no kernel jumps. - The read access path is optimized per hfile block. FS page boundaries and hfile block boundaries are not aligned at all. - There is very little control to the page cache to cache / not cache based on expected access patterns. For example, we can mark META region blocks, and some column families, and hfile index blocks always cached or cached with high priority. Also, for full table scans, we can explicitly disable block caching to not trash the current working set. With OS page cache, you do not have this control. Enis On Wed, Mar 20, 2013 at 10:30 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > First, MSLAB has been enabled by default since 0.92.0 as it was deemed > stable enough. So, unless you are on 0.90, you are already using it. > > Also, I'm not sure why you are referencing the HLog in your first > paragraph in the context of reading from disk, because the HLogs are > rarely read (only on recovery). Maybe you meant HFile? > > In any case, your email covers most arguments except for one: > checksumming. Retrieving a block from HDFS, even when using short > circuit reads to go directly to the OS instead of passing through the > DN, will take quite a bit more time than reading directly from the > block cache. This is why even if you disable block caching on a family > that the index and root blocks will still be block cached, as reading > those very hot blocks from disk would take way too long. > > Regarding your main question (how does the OS buffer help?), I don't > have a good answer. It kind of depends on the amount of RAM you have > and what your workload is like. As a data point, I've been successful > running with 24GB of heap (50% dedicated to the block cache) with a > workload consisting mainly of small writes, short scans, and a typical > random read distribution for a website. I can't remember the last time > I saw a full GC and it's been running for more than a year like this. > > Hope this somehow helps, > > J-D > > On Wed, Mar 20, 2013 at 12:34 AM, Pankaj Gupta <[EMAIL PROTECTED]> > wrote: > > Given that HBase has it's own cache (block cache and bloom filters) and > that all the table data is stored in HDFS, I'm wondering if HBase benefits > from OS page cache at all. In the set up I'm using HBase Region Servers run > on the same boxes as the HDFS data node. In such a scenario if the > underlying HLog files lives on the same machine then having a healthy > memory surplus may mean that the data node can serve underlying file from > page cache and thus improving HBase performance. Is this really the case? > (I guess page cache should also help in case where HLog file lives on a > different machine but in that case network I/O will probably drown the > speedup achieved due to not hitting the disk. > > > > I'm asking because if page cache were useful then in an HBase set up not > utilizing all the memory on the machine for the region server may not be > that bad. The reason one would not want to use all the memory for region > server would be long garbage collection pauses that large heap size may > induce. I understand that work has been done to fix the long pauses caused > due to memory fragmentation in the old generation, mostly concurrent > garbage collector by using slab cache allocator for memstore but that > feature is marked experimental and we're not ready to take risks yet. So if > the page cache was useful in any way on Region Servers we could go with > less memory for RegionServer process with the understanding that free > memory on the machine is not completely going to waste. Thus my curiosity
-
Re: Does HBase RegionServer benefit from OS Page CachePankaj Gupta 2013-03-24, 00:31
Thanks a lot for the explanation. It's good to know that MSlab is stable and safe to enable (we don't have it enable right now, we're using 0.92). This would allow us to more freely allocate memory to HBase. I really enjoyed the depth of explanation from both Enis and J-D. I was indeed mistakenly referring to HFile as HLog, fortunately you were still able understand my question.
Thanks, Pankaj On Mar 21, 2013, at 1:28 PM, Enis Söztutar <[EMAIL PROTECTED]> wrote: > I think the page cache is not totally useless, but as long as you can > control the GC, you should prefer the block cache. Some of the reasons of > the top of my head: > - In case of a cache hit, for OS cache, you have to go through the DN > layer (an RPC if ssr disabled), and do a kernel jump, and read using the > read() libc vs for reading a block from the block cache, only the HBase > process is involved. There is no process switch involved and no kernel > jumps. > - The read access path is optimized per hfile block. FS page boundaries > and hfile block boundaries are not aligned at all. > - There is very little control to the page cache to cache / not cache > based on expected access patterns. For example, we can mark META region > blocks, and some column families, and hfile index blocks always cached or > cached with high priority. Also, for full table scans, we can explicitly > disable block caching to not trash the current working set. With OS page > cache, you do not have this control. > > Enis > > > On Wed, Mar 20, 2013 at 10:30 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > >> First, MSLAB has been enabled by default since 0.92.0 as it was deemed >> stable enough. So, unless you are on 0.90, you are already using it. >> >> Also, I'm not sure why you are referencing the HLog in your first >> paragraph in the context of reading from disk, because the HLogs are >> rarely read (only on recovery). Maybe you meant HFile? >> >> In any case, your email covers most arguments except for one: >> checksumming. Retrieving a block from HDFS, even when using short >> circuit reads to go directly to the OS instead of passing through the >> DN, will take quite a bit more time than reading directly from the >> block cache. This is why even if you disable block caching on a family >> that the index and root blocks will still be block cached, as reading >> those very hot blocks from disk would take way too long. >> >> Regarding your main question (how does the OS buffer help?), I don't >> have a good answer. It kind of depends on the amount of RAM you have >> and what your workload is like. As a data point, I've been successful >> running with 24GB of heap (50% dedicated to the block cache) with a >> workload consisting mainly of small writes, short scans, and a typical >> random read distribution for a website. I can't remember the last time >> I saw a full GC and it's been running for more than a year like this. >> >> Hope this somehow helps, >> >> J-D >> >> On Wed, Mar 20, 2013 at 12:34 AM, Pankaj Gupta <[EMAIL PROTECTED]> >> wrote: >>> Given that HBase has it's own cache (block cache and bloom filters) and >> that all the table data is stored in HDFS, I'm wondering if HBase benefits >> from OS page cache at all. In the set up I'm using HBase Region Servers run >> on the same boxes as the HDFS data node. In such a scenario if the >> underlying HLog files lives on the same machine then having a healthy >> memory surplus may mean that the data node can serve underlying file from >> page cache and thus improving HBase performance. Is this really the case? >> (I guess page cache should also help in case where HLog file lives on a >> different machine but in that case network I/O will probably drown the >> speedup achieved due to not hitting the disk. >>> >>> I'm asking because if page cache were useful then in an HBase set up not >> utilizing all the memory on the machine for the region server may not be >> that bad. The reason one would not want to use all the memory for region
-
Re: Does HBase RegionServer benefit from OS Page CacheTed Yu 2013-03-24, 01:13
Coming up is the following enhancement which would make MSLAB even better:
HBASE-8163 MemStoreChunkPool: An improvement for JAVA GC when using MSLAB FYI On Sat, Mar 23, 2013 at 5:31 PM, Pankaj Gupta <[EMAIL PROTECTED]>wrote: > Thanks a lot for the explanation. It's good to know that MSlab is stable > and safe to enable (we don't have it enable right now, we're using 0.92). > This would allow us to more freely allocate memory to HBase. I really > enjoyed the depth of explanation from both Enis and J-D. I was indeed > mistakenly referring to HFile as HLog, fortunately you were still able > understand my question. > > Thanks, > Pankaj > On Mar 21, 2013, at 1:28 PM, Enis Söztutar <[EMAIL PROTECTED]> wrote: > > > I think the page cache is not totally useless, but as long as you can > > control the GC, you should prefer the block cache. Some of the reasons of > > the top of my head: > > - In case of a cache hit, for OS cache, you have to go through the DN > > layer (an RPC if ssr disabled), and do a kernel jump, and read using the > > read() libc vs for reading a block from the block cache, only the HBase > > process is involved. There is no process switch involved and no kernel > > jumps. > > - The read access path is optimized per hfile block. FS page boundaries > > and hfile block boundaries are not aligned at all. > > - There is very little control to the page cache to cache / not cache > > based on expected access patterns. For example, we can mark META region > > blocks, and some column families, and hfile index blocks always cached or > > cached with high priority. Also, for full table scans, we can explicitly > > disable block caching to not trash the current working set. With OS page > > cache, you do not have this control. > > > > Enis > > > > > > On Wed, Mar 20, 2013 at 10:30 AM, Jean-Daniel Cryans < > [EMAIL PROTECTED]>wrote: > > > >> First, MSLAB has been enabled by default since 0.92.0 as it was deemed > >> stable enough. So, unless you are on 0.90, you are already using it. > >> > >> Also, I'm not sure why you are referencing the HLog in your first > >> paragraph in the context of reading from disk, because the HLogs are > >> rarely read (only on recovery). Maybe you meant HFile? > >> > >> In any case, your email covers most arguments except for one: > >> checksumming. Retrieving a block from HDFS, even when using short > >> circuit reads to go directly to the OS instead of passing through the > >> DN, will take quite a bit more time than reading directly from the > >> block cache. This is why even if you disable block caching on a family > >> that the index and root blocks will still be block cached, as reading > >> those very hot blocks from disk would take way too long. > >> > >> Regarding your main question (how does the OS buffer help?), I don't > >> have a good answer. It kind of depends on the amount of RAM you have > >> and what your workload is like. As a data point, I've been successful > >> running with 24GB of heap (50% dedicated to the block cache) with a > >> workload consisting mainly of small writes, short scans, and a typical > >> random read distribution for a website. I can't remember the last time > >> I saw a full GC and it's been running for more than a year like this. > >> > >> Hope this somehow helps, > >> > >> J-D > >> > >> On Wed, Mar 20, 2013 at 12:34 AM, Pankaj Gupta <[EMAIL PROTECTED]> > >> wrote: > >>> Given that HBase has it's own cache (block cache and bloom filters) and > >> that all the table data is stored in HDFS, I'm wondering if HBase > benefits > >> from OS page cache at all. In the set up I'm using HBase Region Servers > run > >> on the same boxes as the HDFS data node. In such a scenario if the > >> underlying HLog files lives on the same machine then having a healthy > >> memory surplus may mean that the data node can serve underlying file > from > >> page cache and thus improving HBase performance. Is this really the > case? > >> (I guess page cache should also help in case where HLog file lives on a
-
Re: Does HBase RegionServer benefit from OS Page CacheLiyin Tang 2013-03-24, 04:44
We (Facebook) are closely monitoring the OS page cache hit ratio in the
production environments. My experience is if your data access pattern is very random, then the OS page cache won't help you so much even though the data locality is very high. On the other hand, if the requests are always against the recent data points, then the page cache hit ratio could be much higher. Actually, there are lots of optimizations could be done in HDFS. For example, we are working on fadvice away the 2nd/3rd replicated data from OS page cache so that it potentially could improve your OS page cache by 3X. Also, by taking advantage of the tiered-based compaction+fadvice in HDFS, the region server could keep more hot data in OS page cache based on the read access pattern. Another separate point is that we probably should NOT reply on the memstore/block cache to keep hot data. 1) The more data in the memstore, the more data the region server need to recovery from the server failures. So the tradeoff is the recovery time. 2) The blocks in the block cache will be naturally invalid quickly after the compactions. So region server probably won't be benefit from large JVM size at all. Thanks a lot Liyin On Sat, Mar 23, 2013 at 6:13 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > Coming up is the following enhancement which would make MSLAB even better: > > HBASE-8163 MemStoreChunkPool: An improvement for JAVA GC when using MSLAB > > FYI > > On Sat, Mar 23, 2013 at 5:31 PM, Pankaj Gupta <[EMAIL PROTECTED] > >wrote: > > > Thanks a lot for the explanation. It's good to know that MSlab is stable > > and safe to enable (we don't have it enable right now, we're using 0.92). > > This would allow us to more freely allocate memory to HBase. I really > > enjoyed the depth of explanation from both Enis and J-D. I was indeed > > mistakenly referring to HFile as HLog, fortunately you were still able > > understand my question. > > > > Thanks, > > Pankaj > > On Mar 21, 2013, at 1:28 PM, Enis Söztutar <[EMAIL PROTECTED]> wrote: > > > > > I think the page cache is not totally useless, but as long as you can > > > control the GC, you should prefer the block cache. Some of the reasons > of > > > the top of my head: > > > - In case of a cache hit, for OS cache, you have to go through the DN > > > layer (an RPC if ssr disabled), and do a kernel jump, and read using > the > > > read() libc vs for reading a block from the block cache, only the > HBase > > > process is involved. There is no process switch involved and no kernel > > > jumps. > > > - The read access path is optimized per hfile block. FS page boundaries > > > and hfile block boundaries are not aligned at all. > > > - There is very little control to the page cache to cache / not cache > > > based on expected access patterns. For example, we can mark META region > > > blocks, and some column families, and hfile index blocks always cached > or > > > cached with high priority. Also, for full table scans, we can > explicitly > > > disable block caching to not trash the current working set. With OS > page > > > cache, you do not have this control. > > > > > > Enis > > > > > > > > > On Wed, Mar 20, 2013 at 10:30 AM, Jean-Daniel Cryans < > > [EMAIL PROTECTED]>wrote: > > > > > >> First, MSLAB has been enabled by default since 0.92.0 as it was deemed > > >> stable enough. So, unless you are on 0.90, you are already using it. > > >> > > >> Also, I'm not sure why you are referencing the HLog in your first > > >> paragraph in the context of reading from disk, because the HLogs are > > >> rarely read (only on recovery). Maybe you meant HFile? > > >> > > >> In any case, your email covers most arguments except for one: > > >> checksumming. Retrieving a block from HDFS, even when using short > > >> circuit reads to go directly to the OS instead of passing through the > > >> DN, will take quite a bit more time than reading directly from the > > >> block cache. This is why even if you disable block caching on a family >
-
Re: Does HBase RegionServer benefit from OS Page Cachelars hofhansl 2013-03-24, 05:20
Interesting.
> 2) The blocks in the block cache will be naturally invalid quickly after the compactions. Should one keep the block cache small in order to increase the OS page cache? Does you data suggest we should not use the block cache at all? Thanks. -- Lars ________________________________ From: Liyin Tang <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Saturday, March 23, 2013 9:44 PM Subject: Re: Does HBase RegionServer benefit from OS Page Cache We (Facebook) are closely monitoring the OS page cache hit ratio in the production environments. My experience is if your data access pattern is very random, then the OS page cache won't help you so much even though the data locality is very high. On the other hand, if the requests are always against the recent data points, then the page cache hit ratio could be much higher. Actually, there are lots of optimizations could be done in HDFS. For example, we are working on fadvice away the 2nd/3rd replicated data from OS page cache so that it potentially could improve your OS page cache by 3X. Also, by taking advantage of the tiered-based compaction+fadvice in HDFS, the region server could keep more hot data in OS page cache based on the read access pattern. Another separate point is that we probably should NOT reply on the memstore/block cache to keep hot data. 1) The more data in the memstore, the more data the region server need to recovery from the server failures. So the tradeoff is the recovery time. 2) The blocks in the block cache will be naturally invalid quickly after the compactions. So region server probably won't be benefit from large JVM size at all. Thanks a lot Liyin On Sat, Mar 23, 2013 at 6:13 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > Coming up is the following enhancement which would make MSLAB even better: > > HBASE-8163 MemStoreChunkPool: An improvement for JAVA GC when using MSLAB > > FYI > > On Sat, Mar 23, 2013 at 5:31 PM, Pankaj Gupta <[EMAIL PROTECTED] > >wrote: > > > Thanks a lot for the explanation. It's good to know that MSlab is stable > > and safe to enable (we don't have it enable right now, we're using 0.92). > > This would allow us to more freely allocate memory to HBase. I really > > enjoyed the depth of explanation from both Enis and J-D. I was indeed > > mistakenly referring to HFile as HLog, fortunately you were still able > > understand my question. > > > > Thanks, > > Pankaj > > On Mar 21, 2013, at 1:28 PM, Enis Söztutar <[EMAIL PROTECTED]> wrote: > > > > > I think the page cache is not totally useless, but as long as you can > > > control the GC, you should prefer the block cache. Some of the reasons > of > > > the top of my head: > > > - In case of a cache hit, for OS cache, you have to go through the DN > > > layer (an RPC if ssr disabled), and do a kernel jump, and read using > the > > > read() libc vs for reading a block from the block cache, only the > HBase > > > process is involved. There is no process switch involved and no kernel > > > jumps. > > > - The read access path is optimized per hfile block. FS page boundaries > > > and hfile block boundaries are not aligned at all. > > > - There is very little control to the page cache to cache / not cache > > > based on expected access patterns. For example, we can mark META region > > > blocks, and some column families, and hfile index blocks always cached > or > > > cached with high priority. Also, for full table scans, we can > explicitly > > > disable block caching to not trash the current working set. With OS > page > > > cache, you do not have this control. > > > > > > Enis > > > > > > > > > On Wed, Mar 20, 2013 at 10:30 AM, Jean-Daniel Cryans < > > [EMAIL PROTECTED]>wrote: > > > > > >> First, MSLAB has been enabled by default since 0.92.0 as it was deemed > > >> stable enough. So, unless you are on 0.90, you are already using it. > > >> > > >> Also, I'm not sure why you are referencing the HLog in your first > > >> paragraph in the context of reading from disk, because the HLogs are
-
RE: Does HBase RegionServer benefit from OS Page CacheLiyin Tang 2013-03-25, 05:15
Block cache is for uncompressed data while OS page contains the compressed data. Unless the request pattern is full-table sequential scan, the block cache is still quite useful. I think the size of the block cache should be the amont of hot data we want to retain within a compaction cycle, which is quite hard to estimate in some use cases.
Thanks a lot Liyin ________________________________________ From: lars hofhansl [[EMAIL PROTECTED]] Sent: Saturday, March 23, 2013 10:20 PM To: [EMAIL PROTECTED] Subject: Re: Does HBase RegionServer benefit from OS Page Cache Interesting. > 2) The blocks in the block cache will be naturally invalid quickly after the compactions. Should one keep the block cache small in order to increase the OS page cache? Does you data suggest we should not use the block cache at all? Thanks. -- Lars ________________________________ From: Liyin Tang <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Saturday, March 23, 2013 9:44 PM Subject: Re: Does HBase RegionServer benefit from OS Page Cache We (Facebook) are closely monitoring the OS page cache hit ratio in the production environments. My experience is if your data access pattern is very random, then the OS page cache won't help you so much even though the data locality is very high. On the other hand, if the requests are always against the recent data points, then the page cache hit ratio could be much higher. Actually, there are lots of optimizations could be done in HDFS. For example, we are working on fadvice away the 2nd/3rd replicated data from OS page cache so that it potentially could improve your OS page cache by 3X. Also, by taking advantage of the tiered-based compaction+fadvice in HDFS, the region server could keep more hot data in OS page cache based on the read access pattern. Another separate point is that we probably should NOT reply on the memstore/block cache to keep hot data. 1) The more data in the memstore, the more data the region server need to recovery from the server failures. So the tradeoff is the recovery time. 2) The blocks in the block cache will be naturally invalid quickly after the compactions. So region server probably won't be benefit from large JVM size at all. Thanks a lot Liyin On Sat, Mar 23, 2013 at 6:13 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > Coming up is the following enhancement which would make MSLAB even better: > > HBASE-8163 MemStoreChunkPool: An improvement for JAVA GC when using MSLAB > > FYI > > On Sat, Mar 23, 2013 at 5:31 PM, Pankaj Gupta <[EMAIL PROTECTED] > >wrote: > > > Thanks a lot for the explanation. It's good to know that MSlab is stable > > and safe to enable (we don't have it enable right now, we're using 0.92). > > This would allow us to more freely allocate memory to HBase. I really > > enjoyed the depth of explanation from both Enis and J-D. I was indeed > > mistakenly referring to HFile as HLog, fortunately you were still able > > understand my question. > > > > Thanks, > > Pankaj > > On Mar 21, 2013, at 1:28 PM, Enis Söztutar <[EMAIL PROTECTED]> wrote: > > > > > I think the page cache is not totally useless, but as long as you can > > > control the GC, you should prefer the block cache. Some of the reasons > of > > > the top of my head: > > > - In case of a cache hit, for OS cache, you have to go through the DN > > > layer (an RPC if ssr disabled), and do a kernel jump, and read using > the > > > read() libc vs for reading a block from the block cache, only the > HBase > > > process is involved. There is no process switch involved and no kernel > > > jumps. > > > - The read access path is optimized per hfile block. FS page boundaries > > > and hfile block boundaries are not aligned at all. > > > - There is very little control to the page cache to cache / not cache > > > based on expected access patterns. For example, we can mark META region > > > blocks, and some column families, and hfile index blocks always cached > or > > > cached with high priority. Also, for full table scans, we can
-
Re: Does HBase RegionServer benefit from OS Page CacheEnis Söztutar 2013-03-25, 18:24
Thanks Liyin for sharing your use cases.
Related to those, I was thinking of two improvements: - AFAIK, MySQL keeps the compressed and uncompressed versions of the blocs in its block cache, failing over the compressed one if decompressed one gets evicted. With very large heaps, maybe keeping around the compressed blocks in a secondary cache makes sense? - A compaction will trash the cache. But maybe we can track keyvalues (inside cached blocks are cached) for the files in the compaction, and mark the blocks of the resulting compacted file which contain previously cached keyvalues to be cached after the compaction. I have to research the feasibility of this approach. Enis On Sun, Mar 24, 2013 at 10:15 PM, Liyin Tang <[EMAIL PROTECTED]> wrote: > Block cache is for uncompressed data while OS page contains the compressed > data. Unless the request pattern is full-table sequential scan, the block > cache is still quite useful. I think the size of the block cache should be > the amont of hot data we want to retain within a compaction cycle, which is > quite hard to estimate in some use cases. > > > Thanks a lot > Liyin > ________________________________________ > From: lars hofhansl [[EMAIL PROTECTED]] > Sent: Saturday, March 23, 2013 10:20 PM > To: [EMAIL PROTECTED] > Subject: Re: Does HBase RegionServer benefit from OS Page Cache > > Interesting. > > > 2) The blocks in the block cache will be naturally invalid quickly after > the compactions. > > Should one keep the block cache small in order to increase the OS page > cache? > > Does you data suggest we should not use the block cache at all? > > > Thanks. > > -- Lars > > > > ________________________________ > From: Liyin Tang <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Saturday, March 23, 2013 9:44 PM > Subject: Re: Does HBase RegionServer benefit from OS Page Cache > > We (Facebook) are closely monitoring the OS page cache hit ratio in the > production environments. My experience is if your data access pattern is > very random, then the OS page cache won't help you so much even though the > data locality is very high. On the other hand, if the requests are always > against the recent data points, then the page cache hit ratio could be much > higher. > > Actually, there are lots of optimizations could be done in HDFS. For > example, we are working on fadvice away the 2nd/3rd replicated data from OS > page cache so that it potentially could improve your OS page cache by 3X. > Also, by taking advantage of the tiered-based compaction+fadvice in HDFS, > the region server could keep more hot data in OS page cache based on the > read access pattern. > > Another separate point is that we probably should NOT reply on the > memstore/block cache to keep hot data. 1) The more data in the memstore, > the more data the region server need to recovery from the server failures. > So the tradeoff is the recovery time. 2) The blocks in the block cache will > be naturally invalid quickly after the compactions. So region server > probably won't be benefit from large JVM size at all. > > Thanks a lot > Liyin > > On Sat, Mar 23, 2013 at 6:13 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > Coming up is the following enhancement which would make MSLAB even > better: > > > > HBASE-8163 MemStoreChunkPool: An improvement for JAVA GC when using MSLAB > > > > FYI > > > > On Sat, Mar 23, 2013 at 5:31 PM, Pankaj Gupta <[EMAIL PROTECTED] > > >wrote: > > > > > Thanks a lot for the explanation. It's good to know that MSlab is > stable > > > and safe to enable (we don't have it enable right now, we're using > 0.92). > > > This would allow us to more freely allocate memory to HBase. I really > > > enjoyed the depth of explanation from both Enis and J-D. I was indeed > > > mistakenly referring to HFile as HLog, fortunately you were still able > > > understand my question. > > > > > > Thanks, > > > Pankaj > > > On Mar 21, 2013, at 1:28 PM, Enis Söztutar <[EMAIL PROTECTED]> wrote: > > >
-
Re: Does HBase RegionServer benefit from OS Page CacheAndrew Purtell 2013-03-25, 18:42
> With very large heaps, maybe keeping around the compressed blocks in a
secondary cache makes sense? That's an interesting idea. > A compaction will trash the cache. But maybe we can track keyvalues (inside cached blocks are cached) for the files in the compaction, and mark the blocks of the resulting compacted file which contain previously cached keyvalues to be cached after the compaction. With very large heaps and a GC that can handle them (perhaps the G1 GC), another option which might be worth experimenting with is a value (KV) cache independent of the block cache which could be enabled on a per-table basis. This would not be trashed by compaction, though we'd need to do some additional housekeeping to evict deleted cells from the value cache, and could be useful if collectively RAM on the cluster is sufficient to hold the whole working set in memory (for the selected tables). On Mon, Mar 25, 2013 at 7:24 PM, Enis Söztutar <[EMAIL PROTECTED]> wrote: > Thanks Liyin for sharing your use cases. > > Related to those, I was thinking of two improvements: > - AFAIK, MySQL keeps the compressed and uncompressed versions of the blocs > in its block cache, failing over the compressed one if decompressed one > gets evicted. With very large heaps, maybe keeping around the compressed > blocks in a secondary cache makes sense? > - A compaction will trash the cache. But maybe we can track keyvalues > (inside cached blocks are cached) for the files in the compaction, and mark > the blocks of the resulting compacted file which contain previously cached > keyvalues to be cached after the compaction. I have to research the > feasibility of this approach. > > Enis > > > On Sun, Mar 24, 2013 at 10:15 PM, Liyin Tang <[EMAIL PROTECTED]> wrote: > > > Block cache is for uncompressed data while OS page contains the > compressed > > data. Unless the request pattern is full-table sequential scan, the block > > cache is still quite useful. I think the size of the block cache should > be > > the amont of hot data we want to retain within a compaction cycle, which > is > > quite hard to estimate in some use cases. > > > > > > Thanks a lot > > Liyin > > ________________________________________ > > From: lars hofhansl [[EMAIL PROTECTED]] > > Sent: Saturday, March 23, 2013 10:20 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Does HBase RegionServer benefit from OS Page Cache > > > > Interesting. > > > > > 2) The blocks in the block cache will be naturally invalid quickly > after > > the compactions. > > > > Should one keep the block cache small in order to increase the OS page > > cache? > > > > Does you data suggest we should not use the block cache at all? > > > > > > Thanks. > > > > -- Lars > > > > > > > > ________________________________ > > From: Liyin Tang <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Sent: Saturday, March 23, 2013 9:44 PM > > Subject: Re: Does HBase RegionServer benefit from OS Page Cache > > > > We (Facebook) are closely monitoring the OS page cache hit ratio in the > > production environments. My experience is if your data access pattern is > > very random, then the OS page cache won't help you so much even though > the > > data locality is very high. On the other hand, if the requests are always > > against the recent data points, then the page cache hit ratio could be > much > > higher. > > > > Actually, there are lots of optimizations could be done in HDFS. For > > example, we are working on fadvice away the 2nd/3rd replicated data from > OS > > page cache so that it potentially could improve your OS page cache by 3X. > > Also, by taking advantage of the tiered-based compaction+fadvice in HDFS, > > the region server could keep more hot data in OS page cache based on the > > read access pattern. > > > > Another separate point is that we probably should NOT reply on the > > memstore/block cache to keep hot data. 1) The more data in the memstore, > > the more data the region server need to recovery from the server Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
-
RE: Does HBase RegionServer benefit from OS Page CacheLiyin Tang 2013-03-25, 19:18
Hi Enis,
Good ideas ! And hbase community is driving on these 2 items. 1) [HBASE-7404]: L1/L2 block cache 2) [HBASE-5263] Preserving cached data on compactions through cache-on-write Thanks a lot Liyin ________________________________________ From: Enis Söztutar [[EMAIL PROTECTED]] Sent: Monday, March 25, 2013 11:24 AM To: hbase-user Cc: lars hofhansl Subject: Re: Does HBase RegionServer benefit from OS Page Cache Thanks Liyin for sharing your use cases. Related to those, I was thinking of two improvements: - AFAIK, MySQL keeps the compressed and uncompressed versions of the blocs in its block cache, failing over the compressed one if decompressed one gets evicted. With very large heaps, maybe keeping around the compressed blocks in a secondary cache makes sense? - A compaction will trash the cache. But maybe we can track keyvalues (inside cached blocks are cached) for the files in the compaction, and mark the blocks of the resulting compacted file which contain previously cached keyvalues to be cached after the compaction. I have to research the feasibility of this approach. Enis On Sun, Mar 24, 2013 at 10:15 PM, Liyin Tang <[EMAIL PROTECTED]> wrote: > Block cache is for uncompressed data while OS page contains the compressed > data. Unless the request pattern is full-table sequential scan, the block > cache is still quite useful. I think the size of the block cache should be > the amont of hot data we want to retain within a compaction cycle, which is > quite hard to estimate in some use cases. > > > Thanks a lot > Liyin > ________________________________________ > From: lars hofhansl [[EMAIL PROTECTED]] > Sent: Saturday, March 23, 2013 10:20 PM > To: [EMAIL PROTECTED] > Subject: Re: Does HBase RegionServer benefit from OS Page Cache > > Interesting. > > > 2) The blocks in the block cache will be naturally invalid quickly after > the compactions. > > Should one keep the block cache small in order to increase the OS page > cache? > > Does you data suggest we should not use the block cache at all? > > > Thanks. > > -- Lars > > > > ________________________________ > From: Liyin Tang <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Saturday, March 23, 2013 9:44 PM > Subject: Re: Does HBase RegionServer benefit from OS Page Cache > > We (Facebook) are closely monitoring the OS page cache hit ratio in the > production environments. My experience is if your data access pattern is > very random, then the OS page cache won't help you so much even though the > data locality is very high. On the other hand, if the requests are always > against the recent data points, then the page cache hit ratio could be much > higher. > > Actually, there are lots of optimizations could be done in HDFS. For > example, we are working on fadvice away the 2nd/3rd replicated data from OS > page cache so that it potentially could improve your OS page cache by 3X. > Also, by taking advantage of the tiered-based compaction+fadvice in HDFS, > the region server could keep more hot data in OS page cache based on the > read access pattern. > > Another separate point is that we probably should NOT reply on the > memstore/block cache to keep hot data. 1) The more data in the memstore, > the more data the region server need to recovery from the server failures. > So the tradeoff is the recovery time. 2) The blocks in the block cache will > be naturally invalid quickly after the compactions. So region server > probably won't be benefit from large JVM size at all. > > Thanks a lot > Liyin > > On Sat, Mar 23, 2013 at 6:13 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > Coming up is the following enhancement which would make MSLAB even > better: > > > > HBASE-8163 MemStoreChunkPool: An improvement for JAVA GC when using MSLAB > > > > FYI > > > > On Sat, Mar 23, 2013 at 5:31 PM, Pankaj Gupta <[EMAIL PROTECTED] > > >wrote: > > > > > Thanks a lot for the explanation. It's good to know that MSlab is > stable > > > and safe to enable (we don't have it enable right now, we're using
-
Re: Does HBase RegionServer benefit from OS Page CacheEnis Söztutar 2013-03-25, 20:26
> With very large heaps and a GC that can handle them (perhaps the G1 GC),
another option which might be worth experimenting with is a value (KV) cache independent of the block cache which could be enabled on a per-table basis Thanks Andy for bringing this up. We've had some discussions some time ago about a row-cache (or KV cache) http://search-hadoop.com/m/XTlxT1xRtYw/hbase+key+value+cache+from%253Aenis&subj=RE+keyvalue+cache The takeaway was that if you are mostly doing point gets, rather than scans, this cache might be better. > 1) [HBASE-7404]: L1/L2 block cache I knew about the Bucket cache, but not that bucket cache could hold compressed blocks. Is it the case, or are you suggesting we can add that to this L2 cache. > 2) [HBASE-5263] Preserving cached data on compactions through cache-on-write Thanks, this is the same idea. I'll track the ticket. Enis On Mon, Mar 25, 2013 at 12:18 PM, Liyin Tang <[EMAIL PROTECTED]> wrote: > Hi Enis, > Good ideas ! And hbase community is driving on these 2 items. > 1) [HBASE-7404]: L1/L2 block cache > 2) [HBASE-5263] Preserving cached data on compactions through > cache-on-write > > Thanks a lot > Liyin > ________________________________________ > From: Enis Söztutar [[EMAIL PROTECTED]] > Sent: Monday, March 25, 2013 11:24 AM > To: hbase-user > Cc: lars hofhansl > Subject: Re: Does HBase RegionServer benefit from OS Page Cache > > Thanks Liyin for sharing your use cases. > > Related to those, I was thinking of two improvements: > - AFAIK, MySQL keeps the compressed and uncompressed versions of the blocs > in its block cache, failing over the compressed one if decompressed one > gets evicted. With very large heaps, maybe keeping around the compressed > blocks in a secondary cache makes sense? > - A compaction will trash the cache. But maybe we can track keyvalues > (inside cached blocks are cached) for the files in the compaction, and mark > the blocks of the resulting compacted file which contain previously cached > keyvalues to be cached after the compaction. I have to research the > feasibility of this approach. > > Enis > > > On Sun, Mar 24, 2013 at 10:15 PM, Liyin Tang <[EMAIL PROTECTED]> wrote: > > > Block cache is for uncompressed data while OS page contains the > compressed > > data. Unless the request pattern is full-table sequential scan, the block > > cache is still quite useful. I think the size of the block cache should > be > > the amont of hot data we want to retain within a compaction cycle, which > is > > quite hard to estimate in some use cases. > > > > > > Thanks a lot > > Liyin > > ________________________________________ > > From: lars hofhansl [[EMAIL PROTECTED]] > > Sent: Saturday, March 23, 2013 10:20 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Does HBase RegionServer benefit from OS Page Cache > > > > Interesting. > > > > > 2) The blocks in the block cache will be naturally invalid quickly > after > > the compactions. > > > > Should one keep the block cache small in order to increase the OS page > > cache? > > > > Does you data suggest we should not use the block cache at all? > > > > > > Thanks. > > > > -- Lars > > > > > > > > ________________________________ > > From: Liyin Tang <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Sent: Saturday, March 23, 2013 9:44 PM > > Subject: Re: Does HBase RegionServer benefit from OS Page Cache > > > > We (Facebook) are closely monitoring the OS page cache hit ratio in the > > production environments. My experience is if your data access pattern is > > very random, then the OS page cache won't help you so much even though > the > > data locality is very high. On the other hand, if the requests are always > > against the recent data points, then the page cache hit ratio could be > much > > higher. > > > > Actually, there are lots of optimizations could be done in HDFS. For > > example, we are working on fadvice away the 2nd/3rd replicated data from > OS > > page cache so that it potentially could improve your OS page cache by 3X.
-
答复: Does HBase RegionServer benefit from OS Page Cache谢良 2013-03-26, 03:01
Maybe we should adopt some ideas from RDBMS ?
In MySQL area: Innodb storage engine has a buffer pool(just like current block cache), caches both compressed and uncompressed pages in latest innodb version, it brings about adaptive LRU algorithm, see http://dev.mysql.com/doc/innodb/1.1/en/innodb-compression-internals.html, in short, it's somehow more subtle for this detail than leveldb&hbase's implementation, per my view. In deed, we(Xiaomi) had a plan to develop&evaluate it already (we logged it in our internal phabricator system before), hopefully we could contribute it to community in the future. Another storage engine Falcon has "Row Cache" feature, which similar with Enis mentioned, It's more friendly against random read scenario. Every user table could choose a prefered storage engine in MySQL, so here, my point is: maybe we need to consider supporting more configureable cache strategy per table granularity Regards, Liang ________________________________________ 发件人: Enis Söztutar [[EMAIL PROTECTED]] 发送时间: 2013年3月26日 4:26 收件人: hbase-user Cc: lars hofhansl 主题: Re: Does HBase RegionServer benefit from OS Page Cache > With very large heaps and a GC that can handle them (perhaps the G1 GC), another option which might be worth experimenting with is a value (KV) cache independent of the block cache which could be enabled on a per-table basis Thanks Andy for bringing this up. We've had some discussions some time ago about a row-cache (or KV cache) http://search-hadoop.com/m/XTlxT1xRtYw/hbase+key+value+cache+from%253Aenis&subj=RE+keyvalue+cache The takeaway was that if you are mostly doing point gets, rather than scans, this cache might be better. > 1) [HBASE-7404]: L1/L2 block cache I knew about the Bucket cache, but not that bucket cache could hold compressed blocks. Is it the case, or are you suggesting we can add that to this L2 cache. > 2) [HBASE-5263] Preserving cached data on compactions through cache-on-write Thanks, this is the same idea. I'll track the ticket. Enis On Mon, Mar 25, 2013 at 12:18 PM, Liyin Tang <[EMAIL PROTECTED]> wrote: > Hi Enis, > Good ideas ! And hbase community is driving on these 2 items. > 1) [HBASE-7404]: L1/L2 block cache > 2) [HBASE-5263] Preserving cached data on compactions through > cache-on-write > > Thanks a lot > Liyin > ________________________________________ > From: Enis Söztutar [[EMAIL PROTECTED]] > Sent: Monday, March 25, 2013 11:24 AM > To: hbase-user > Cc: lars hofhansl > Subject: Re: Does HBase RegionServer benefit from OS Page Cache > > Thanks Liyin for sharing your use cases. > > Related to those, I was thinking of two improvements: > - AFAIK, MySQL keeps the compressed and uncompressed versions of the blocs > in its block cache, failing over the compressed one if decompressed one > gets evicted. With very large heaps, maybe keeping around the compressed > blocks in a secondary cache makes sense? > - A compaction will trash the cache. But maybe we can track keyvalues > (inside cached blocks are cached) for the files in the compaction, and mark > the blocks of the resulting compacted file which contain previously cached > keyvalues to be cached after the compaction. I have to research the > feasibility of this approach. > > Enis > > > On Sun, Mar 24, 2013 at 10:15 PM, Liyin Tang <[EMAIL PROTECTED]> wrote: > > > Block cache is for uncompressed data while OS page contains the > compressed > > data. Unless the request pattern is full-table sequential scan, the block > > cache is still quite useful. I think the size of the block cache should > be > > the amont of hot data we want to retain within a compaction cycle, which > is > > quite hard to estimate in some use cases. > > > > > > Thanks a lot > > Liyin > > ________________________________________ > > From: lars hofhansl [[EMAIL PROTECTED]] > > Sent: Saturday, March 23, 2013 10:20 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Does HBase RegionServer benefit from OS Page Cache > > > > Interesting. > > > > > 2) The blocks in the block cache will be naturally invalid quickly |