|
Tatsuya Kawano
2011-01-23, 01:18
Ted Yu
2011-01-23, 01:45
Ryan Rawson
2011-01-23, 01:48
Stack
2011-01-24, 17:54
Yifeng Jiang
2011-01-25, 02:03
Tatsuya Kawano
2011-01-25, 04:06
Andrew Purtell
2011-01-25, 04:45
Amandeep Khurana
2011-01-25, 05:27
Tatsuya Kawano
2011-01-26, 03:34
Yifeng Jiang
2011-01-26, 05:14
Tatsuya Kawano
2011-01-26, 08:11
Tatsuya Kawano
2011-01-26, 08:14
Tatsuya Kawano
2011-01-26, 11:32
Tatsuya Kawano
2011-01-26, 11:42
|
-
Items to contribute (plan)Tatsuya Kawano 2011-01-23, 01:18
Hi, I wanted to let you know that I'm planning to contribute the following items to the HBase community. These are my spare time projects and I'll only be able to spend my time about 7 hours a week, so the progress will be very slow. I want some feedback from you guys to prioritize them. Also, if someone/team wants to work on them (with me or alone), I'll be happy to provide more details. 1. RADOS integration Run HBase not only on HDFS but also RADOS distributed object store (the lower layer of Ceph), so that the following options will become available to HBase users: -- No SPOF (RADOS doesn't have the name node(s), but only ZK-like monitors and data nodes) -- Instant backup of HBase tables (RADOS provides copy-on-write snapshot per object pool) -- Extra durability option on WAL (RADOS can do both synchronous and asynchronous disk flush. HDFS doesn't have the earlier option) Note: RADOS object = HFile, WAL object pool = group of HFiles or WAL Current status: Design phase 2. mapreduce.HFileInputFormat MR library to read data directly from HFiles. (Roughly 2.5 times faster than TableInputFormat in my tests) Current status: Completed a proof-of-concept prototype and measured performance. 3. Enhance Get/Scan performance of RS Add an hash code and a couple of flags to HFile at the flush time and change scanner implementation so that: -- Get/Scan operations will get faster. (less key comparisons for reconstructing a row: O(h * c) -> O(h). [h = number of HFiles for the row, c = number of columns in an HFile]) -- The size of HFiles will become a bit smaller. (The flags will eliminate duplicate bytes in keys (row, column family and qualifier) from HFiles.) Current status: Completed a proof-of-concept prototype and measured performance. Detals: https://github.com/tatsuya6502/hbase-mr-pof/ (I meant "poc" not "pof"...) 4. Writing Japanese books and documents -- Currently I'm authoring a book chapter about HBase for a Japanese NOSQL book -- I'll translate The Apache HBase Book to Japanese Thank you, -- Tatsuya Kawano (Mr.) Tokyo, Japan http://twitter.com/#!/tatsuya6502
-
Re: Items to contribute (plan)Ted Yu 2011-01-23, 01:45
#1 looks similar to what MapR has done.
On Sat, Jan 22, 2011 at 5:18 PM, Tatsuya Kawano <[EMAIL PROTECTED]>wrote: > > Hi, > > I wanted to let you know that I'm planning to contribute the following > items to the HBase community. These are my spare time projects and I'll only > be able to spend my time about 7 hours a week, so the progress will be very > slow. I want some feedback from you guys to prioritize them. Also, if > someone/team wants to work on them (with me or alone), I'll be happy to > provide more details. > > > 1. RADOS integration > > Run HBase not only on HDFS but also RADOS distributed object store (the > lower layer of Ceph), so that the following options will become available to > HBase users: > > -- No SPOF (RADOS doesn't have the name node(s), but only ZK-like monitors > and data nodes) > -- Instant backup of HBase tables (RADOS provides copy-on-write snapshot > per object pool) > -- Extra durability option on WAL (RADOS can do both synchronous and > asynchronous disk flush. HDFS doesn't have the earlier option) > > Note: > RADOS object = HFile, WAL > object pool = group of HFiles or WAL > > Current status: Design phase > > > 2. mapreduce.HFileInputFormat > > MR library to read data directly from HFiles. (Roughly 2.5 times faster > than TableInputFormat in my tests) > > Current status: Completed a proof-of-concept prototype and measured > performance. > > > 3. Enhance Get/Scan performance of RS > > Add an hash code and a couple of flags to HFile at the flush time and > change scanner implementation so that: > > -- Get/Scan operations will get faster. (less key comparisons for > reconstructing a row: O(h * c) -> O(h). [h = number of HFiles for the row, > c = number of columns in an HFile]) > -- The size of HFiles will become a bit smaller. (The flags will eliminate > duplicate bytes in keys (row, column family and qualifier) from HFiles.) > > Current status: Completed a proof-of-concept prototype and measured > performance. > > Detals: > https://github.com/tatsuya6502/hbase-mr-pof/ > (I meant "poc" not "pof"...) > > > 4. Writing Japanese books and documents > > -- Currently I'm authoring a book chapter about HBase for a Japanese NOSQL > book > -- I'll translate The Apache HBase Book to Japanese > > > Thank you, > > > -- > Tatsuya Kawano (Mr.) > Tokyo, Japan > > http://twitter.com/#!/tatsuya6502 <http://twitter.com/#%21/tatsuya6502> > > >
-
Re: Items to contribute (plan)Ryan Rawson 2011-01-23, 01:48
Hopefully to do #1, you would not require many/any changes in HFile or
HBase. Implementing the HDFS stream API should be enough. #2 is interesting, what is the benefit? How did you measure said benefit? -ryan On Sat, Jan 22, 2011 at 5:45 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > #1 looks similar to what MapR has done. > > On Sat, Jan 22, 2011 at 5:18 PM, Tatsuya Kawano <[EMAIL PROTECTED]>wrote: > >> >> Hi, >> >> I wanted to let you know that I'm planning to contribute the following >> items to the HBase community. These are my spare time projects and I'll only >> be able to spend my time about 7 hours a week, so the progress will be very >> slow. I want some feedback from you guys to prioritize them. Also, if >> someone/team wants to work on them (with me or alone), I'll be happy to >> provide more details. >> >> >> 1. RADOS integration >> >> Run HBase not only on HDFS but also RADOS distributed object store (the >> lower layer of Ceph), so that the following options will become available to >> HBase users: >> >> -- No SPOF (RADOS doesn't have the name node(s), but only ZK-like monitors >> and data nodes) >> -- Instant backup of HBase tables (RADOS provides copy-on-write snapshot >> per object pool) >> -- Extra durability option on WAL (RADOS can do both synchronous and >> asynchronous disk flush. HDFS doesn't have the earlier option) >> >> Note: >> RADOS object = HFile, WAL >> object pool = group of HFiles or WAL >> >> Current status: Design phase >> >> >> 2. mapreduce.HFileInputFormat >> >> MR library to read data directly from HFiles. (Roughly 2.5 times faster >> than TableInputFormat in my tests) >> >> Current status: Completed a proof-of-concept prototype and measured >> performance. >> >> >> 3. Enhance Get/Scan performance of RS >> >> Add an hash code and a couple of flags to HFile at the flush time and >> change scanner implementation so that: >> >> -- Get/Scan operations will get faster. (less key comparisons for >> reconstructing a row: O(h * c) -> O(h). [h = number of HFiles for the row, >> c = number of columns in an HFile]) >> -- The size of HFiles will become a bit smaller. (The flags will eliminate >> duplicate bytes in keys (row, column family and qualifier) from HFiles.) >> >> Current status: Completed a proof-of-concept prototype and measured >> performance. >> >> Detals: >> https://github.com/tatsuya6502/hbase-mr-pof/ >> (I meant "poc" not "pof"...) >> >> >> 4. Writing Japanese books and documents >> >> -- Currently I'm authoring a book chapter about HBase for a Japanese NOSQL >> book >> -- I'll translate The Apache HBase Book to Japanese >> >> >> Thank you, >> >> >> -- >> Tatsuya Kawano (Mr.) >> Tokyo, Japan >> >> http://twitter.com/#!/tatsuya6502 <http://twitter.com/#%21/tatsuya6502> >> >> >> >
-
Re: Items to contribute (plan)Stack 2011-01-24, 17:54
On Sat, Jan 22, 2011 at 5:18 PM, Tatsuya Kawano <[EMAIL PROTECTED]> wrote:
> 1. RADOS integration > > Run HBase not only on HDFS but also RADOS distributed object store (the lower layer of Ceph), so that the following options will become available to HBase users: > > -- No SPOF (RADOS doesn't have the name node(s), but only ZK-like monitors and data nodes) > -- Instant backup of HBase tables (RADOS provides copy-on-write snapshot per object pool) > -- Extra durability option on WAL (RADOS can do both synchronous and asynchronous disk flush. HDFS doesn't have the earlier option) > > Note: > RADOS object = HFile, WAL > object pool = group of HFiles or WAL > > Current status: Design phase > I know a few people are interested in this Tatsuya so would suggest that you open issue now and work publicly. > 2. mapreduce.HFileInputFormat > > MR library to read data directly from HFiles. (Roughly 2.5 times faster than TableInputFormat in my tests) > > Current status: Completed a proof-of-concept prototype and measured performance. > What about the in-memory edits? Or you thinking of reading the WALs too? > 3. Enhance Get/Scan performance of RS > > Add an hash code and a couple of flags to HFile at the flush time and change scanner implementation so that: > > -- Get/Scan operations will get faster. (less key comparisons for reconstructing a row: O(h * c) -> O(h). [h = number of HFiles for the row, c = number of columns in an HFile]) > -- The size of HFiles will become a bit smaller. (The flags will eliminate duplicate bytes in keys (row, column family and qualifier) from HFiles.) > > Current status: Completed a proof-of-concept prototype and measured performance. > > Detals: > https://github.com/tatsuya6502/hbase-mr-pof/ > (I meant "poc" not "pof"...) > Sounds great. > 4. Writing Japanese books and documents > > -- Currently I'm authoring a book chapter about HBase for a Japanese NOSQL book > -- I'll translate The Apache HBase Book to Japanese > > All of the above sound great Tatsuya. Thanks, St.Ack
-
Re: Items to contribute (plan)Yifeng Jiang 2011-01-25, 02:03
#4. Writing Japanese books and documents
I am glad if I can work on this one with you. On 01/23/2011 10:18 AM, Tatsuya Kawano wrote: > Hi, > > I wanted to let you know that I'm planning to contribute the following items to the HBase community. These are my spare time projects and I'll only be able to spend my time about 7 hours a week, so the progress will be very slow. I want some feedback from you guys to prioritize them. Also, if someone/team wants to work on them (with me or alone), I'll be happy to provide more details. > > > 1. RADOS integration > > Run HBase not only on HDFS but also RADOS distributed object store (the lower layer of Ceph), so that the following options will become available to HBase users: > > -- No SPOF (RADOS doesn't have the name node(s), but only ZK-like monitors and data nodes) > -- Instant backup of HBase tables (RADOS provides copy-on-write snapshot per object pool) > -- Extra durability option on WAL (RADOS can do both synchronous and asynchronous disk flush. HDFS doesn't have the earlier option) > > Note: > RADOS object = HFile, WAL > object pool = group of HFiles or WAL > > Current status: Design phase > > > 2. mapreduce.HFileInputFormat > > MR library to read data directly from HFiles. (Roughly 2.5 times faster than TableInputFormat in my tests) > > Current status: Completed a proof-of-concept prototype and measured performance. > > > 3. Enhance Get/Scan performance of RS > > Add an hash code and a couple of flags to HFile at the flush time and change scanner implementation so that: > > -- Get/Scan operations will get faster. (less key comparisons for reconstructing a row: O(h * c) -> O(h). [h = number of HFiles for the row, c = number of columns in an HFile]) > -- The size of HFiles will become a bit smaller. (The flags will eliminate duplicate bytes in keys (row, column family and qualifier) from HFiles.) > > Current status: Completed a proof-of-concept prototype and measured performance. > > Detals: > https://github.com/tatsuya6502/hbase-mr-pof/ > (I meant "poc" not "pof"...) > > > 4. Writing Japanese books and documents > > -- Currently I'm authoring a book chapter about HBase for a Japanese NOSQL book > -- I'll translate The Apache HBase Book to Japanese > > > Thank you, > > > -- > Tatsuya Kawano (Mr.) > Tokyo, Japan > > http://twitter.com/#!/tatsuya6502 > > > -- Yifeng Jiang
-
Re: Items to contribute (plan)Tatsuya Kawano 2011-01-25, 04:06
Thanks all for your replies.
So besides the Japanese book chapter which has the strict deadline, item 1 "RADOS (Ceph) integration" will be the first thing to work on. I'll open issue and work publicly as Stack suggested. Hopefully, I'll start to play with the API soon and write a design proposal for critique. Item 2 "mapreduce.HFileInputFormat" will come next, but it seems we need more discussions on the features and benefits so we can be sure if it's worth to work on. I'll post separate reply later. For now, item 3 "Enhance Get/Scan performance on RS" gets the lowest priority, and I'll work on it as a personal project. - Tatsuya -- Tatsuya Kawano Tokyo, Japan On Jan 25, 2011, at 2:54 AM, Stack <[EMAIL PROTECTED]> wrote: > On Sat, Jan 22, 2011 at 5:18 PM, Tatsuya Kawano <[EMAIL PROTECTED]> wrote: >> 1. RADOS integration >> >> Run HBase not only on HDFS but also RADOS distributed object store (the lower layer of Ceph), so that the following options will become available to HBase users: >> >> -- No SPOF (RADOS doesn't have the name node(s), but only ZK-like monitors and data nodes) >> -- Instant backup of HBase tables (RADOS provides copy-on-write snapshot per object pool) >> -- Extra durability option on WAL (RADOS can do both synchronous and asynchronous disk flush. HDFS doesn't have the earlier option) >> >> Note: >> RADOS object = HFile, WAL >> object pool = group of HFiles or WAL >> >> Current status: Design phase >> > > I know a few people are interested in this Tatsuya so would suggest > that you open issue now and work publicly. > > >> 2. mapreduce.HFileInputFormat >> >> MR library to read data directly from HFiles. (Roughly 2.5 times faster than TableInputFormat in my tests) >> >> Current status: Completed a proof-of-concept prototype and measured performance. >> > > What about the in-memory edits? Or you thinking of reading the WALs too? > > > >> 3. Enhance Get/Scan performance of RS >> >> Add an hash code and a couple of flags to HFile at the flush time and change scanner implementation so that: >> >> -- Get/Scan operations will get faster. (less key comparisons for reconstructing a row: O(h * c) -> O(h). [h = number of HFiles for the row, c = number of columns in an HFile]) >> -- The size of HFiles will become a bit smaller. (The flags will eliminate duplicate bytes in keys (row, column family and qualifier) from HFiles.) >> >> Current status: Completed a proof-of-concept prototype and measured performance. >> >> Detals: >> https://github.com/tatsuya6502/hbase-mr-pof/ >> (I meant "poc" not "pof"...) >> > > Sounds great. > > >> 4. Writing Japanese books and documents >> >> -- Currently I'm authoring a book chapter about HBase for a Japanese NOSQL book >> -- I'll translate The Apache HBase Book to Japanese >> >> > > All of the above sound great Tatsuya. > Thanks, > St.Ack
-
Re: Items to contribute (plan)Andrew Purtell 2011-01-25, 04:45
Count me as interested in #1 also.
Best regards, - Andy --- On Mon, 1/24/11, Stack <[EMAIL PROTECTED]> wrote: > From: Stack <[EMAIL PROTECTED]> > Subject: Re: Items to contribute (plan) > To: [EMAIL PROTECTED] > Date: Monday, January 24, 2011, 9:54 AM > On Sat, Jan 22, 2011 at 5:18 PM, > Tatsuya Kawano <[EMAIL PROTECTED]> wrote: > > 1. RADOS integration > > > > Run HBase not only on HDFS but also RADOS distributed > object store (the lower layer of Ceph), so that the > following options will become available to HBase users: > > > > -- No SPOF (RADOS doesn't have the name node(s), but > only ZK-like monitors and data nodes) > > -- Instant backup of HBase tables (RADOS provides > copy-on-write snapshot per object pool) > > -- Extra durability option on WAL (RADOS can do both > synchronous and asynchronous disk flush. HDFS doesn't have > the earlier option) > > > > Note: > > RADOS object = HFile, WAL > > object pool = group of HFiles or WAL > > > > Current status: Design phase > > > > I know a few people are interested in this Tatsuya so would > suggest that you open issue now and work publicly. [...]
-
Re: Items to contribute (plan)Amandeep Khurana 2011-01-25, 05:27
I'd be interested in #1 too... I did some work in getting HBase to run on
Ceph some time back and it seemed promising but the Ceph client wasn't stable enough to take the load.. On Mon, Jan 24, 2011 at 8:45 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote: > Count me as interested in #1 also. > > Best regards, > > - Andy > > --- On Mon, 1/24/11, Stack <[EMAIL PROTECTED]> wrote: > > > From: Stack <[EMAIL PROTECTED]> > > Subject: Re: Items to contribute (plan) > > To: [EMAIL PROTECTED] > > Date: Monday, January 24, 2011, 9:54 AM > > On Sat, Jan 22, 2011 at 5:18 PM, > > Tatsuya Kawano <[EMAIL PROTECTED]> wrote: > > > 1. RADOS integration > > > > > > Run HBase not only on HDFS but also RADOS distributed > > object store (the lower layer of Ceph), so that the > > following options will become available to HBase users: > > > > > > -- No SPOF (RADOS doesn't have the name node(s), but > > only ZK-like monitors and data nodes) > > > -- Instant backup of HBase tables (RADOS provides > > copy-on-write snapshot per object pool) > > > -- Extra durability option on WAL (RADOS can do both > > synchronous and asynchronous disk flush. HDFS doesn't have > > the earlier option) > > > > > > Note: > > > RADOS object = HFile, WAL > > > object pool = group of HFiles or WAL > > > > > > Current status: Design phase > > > > > > > I know a few people are interested in this Tatsuya so would > > suggest that you open issue now and work publicly. > [...] > > > > >
-
Re: Items to contribute (plan)Tatsuya Kawano 2011-01-26, 03:34
Hi Yifeng, > #4. Writing Japanese books and documents > I am glad if I can work on this one with you. Thanks for your offer. Let me explain a bit more about them. >> -- Currently I'm authoring a book chapter about HBase for a Japanese NOSQL book This one is a commercial book from a Japanese publisher, so I'll do this by myself. >> -- I'll translate The Apache HBase Book to Japanese This one comes with HBase, and I'm looking for some people (like you) to work with. http://hbase.apache.org/book.html I created a Jira entry to track this task: https://issues.apache.org/jira/browse/HBASE-3391 Are you working at Rakuten in Tokyo? Maybe we can meet at next Hadoop Source Code Reading at Rakuten Tower. Do you know this event? Thanks, Tatsuya -- Tatsuya Kawano (Mr.) Tokyo, Japan On Jan 25, 2011, at 11:03 AM, Yifeng Jiang <[EMAIL PROTECTED]> wrote: > #4. Writing Japanese books and documents > I am glad if I can work on this one with you. > > > On 01/23/2011 10:18 AM, Tatsuya Kawano wrote: >> Hi, >> >> I wanted to let you know that I'm planning to contribute the following items to the HBase community. These are my spare time projects and I'll only be able to spend my time about 7 hours a week, so the progress will be very slow. I want some feedback from you guys to prioritize them. Also, if someone/team wants to work on them (with me or alone), I'll be happy to provide more details. >> >> >> 1. RADOS integration >> >> Run HBase not only on HDFS but also RADOS distributed object store (the lower layer of Ceph), so that the following options will become available to HBase users: >> >> -- No SPOF (RADOS doesn't have the name node(s), but only ZK-like monitors and data nodes) >> -- Instant backup of HBase tables (RADOS provides copy-on-write snapshot per object pool) >> -- Extra durability option on WAL (RADOS can do both synchronous and asynchronous disk flush. HDFS doesn't have the earlier option) >> >> Note: >> RADOS object = HFile, WAL >> object pool = group of HFiles or WAL >> >> Current status: Design phase >> >> >> 2. mapreduce.HFileInputFormat >> >> MR library to read data directly from HFiles. (Roughly 2.5 times faster than TableInputFormat in my tests) >> >> Current status: Completed a proof-of-concept prototype and measured performance. >> >> >> 3. Enhance Get/Scan performance of RS >> >> Add an hash code and a couple of flags to HFile at the flush time and change scanner implementation so that: >> >> -- Get/Scan operations will get faster. (less key comparisons for reconstructing a row: O(h * c) -> O(h). [h = number of HFiles for the row, c = number of columns in an HFile]) >> -- The size of HFiles will become a bit smaller. (The flags will eliminate duplicate bytes in keys (row, column family and qualifier) from HFiles.) >> >> Current status: Completed a proof-of-concept prototype and measured performance. >> >> Detals: >> https://github.com/tatsuya6502/hbase-mr-pof/ >> (I meant "poc" not "pof"...) >> >> >> 4. Writing Japanese books and documents >> >> -- Currently I'm authoring a book chapter about HBase for a Japanese NOSQL book >> -- I'll translate The Apache HBase Book to Japanese >> >> >> Thank you, >> >> >> -- >> Tatsuya Kawano (Mr.) >> Tokyo, Japan >> >> http://twitter.com/#!/tatsuya6502 >> >> >> > > > -- > Yifeng Jiang >
-
Re: Items to contribute (plan)Yifeng Jiang 2011-01-26, 05:14
Hi Tatsuya,
>This one is a commercial book from a Japanese publisher, so I'll do this by myself. I see. I'll be looking forward to reading your book. >I created a Jira entry to track this task: >https://issues.apache.org/jira/browse/HBASE-3391 Good job. I'm glad to work with you on the translation. Maybe I can translate the book to Chinese, too. > Are you working at Rakuten in Tokyo? Maybe we can meet at next Hadoop Source Code Reading at Rakuten Tower. Do you know this event? I'm working for Rakuten at the Osaka branch. So I don't know if I can go to Tokyo to meet you at next Hadoop Source Code Reading event. May I send you a greeting email from my personal address, and talk about your plan via email at first? Thanks, On 01/26/2011 12:34 PM, Tatsuya Kawano wrote: > Hi Yifeng, >> #4. Writing Japanese books and documents >> I am glad if I can work on this one with you. > Thanks for your offer. Let me explain a bit more about them. >>> -- Currently I'm authoring a book chapter about HBase for a Japanese NOSQL book > This one is a commercial book from a Japanese publisher, so I'll do this by myself. > >>> -- I'll translate The Apache HBase Book to Japanese > This one comes with HBase, and I'm looking for some people (like you) to work with. > > http://hbase.apache.org/book.html > > I created a Jira entry to track this task: > https://issues.apache.org/jira/browse/HBASE-3391 > > > Are you working at Rakuten in Tokyo? Maybe we can meet at next Hadoop Source Code Reading at Rakuten Tower. Do you know this event? > > Thanks, > Tatsuya > > -- > Tatsuya Kawano (Mr.) > Tokyo, Japan > > > On Jan 25, 2011, at 11:03 AM, Yifeng Jiang<[EMAIL PROTECTED]> wrote: > >> #4. Writing Japanese books and documents >> I am glad if I can work on this one with you. >> >> >> On 01/23/2011 10:18 AM, Tatsuya Kawano wrote: >>> Hi, >>> >>> I wanted to let you know that I'm planning to contribute the following items to the HBase community. These are my spare time projects and I'll only be able to spend my time about 7 hours a week, so the progress will be very slow. I want some feedback from you guys to prioritize them. Also, if someone/team wants to work on them (with me or alone), I'll be happy to provide more details. >>> >>> >>> 1. RADOS integration >>> >>> Run HBase not only on HDFS but also RADOS distributed object store (the lower layer of Ceph), so that the following options will become available to HBase users: >>> >>> -- No SPOF (RADOS doesn't have the name node(s), but only ZK-like monitors and data nodes) >>> -- Instant backup of HBase tables (RADOS provides copy-on-write snapshot per object pool) >>> -- Extra durability option on WAL (RADOS can do both synchronous and asynchronous disk flush. HDFS doesn't have the earlier option) >>> >>> Note: >>> RADOS object = HFile, WAL >>> object pool = group of HFiles or WAL >>> >>> Current status: Design phase >>> >>> >>> 2. mapreduce.HFileInputFormat >>> >>> MR library to read data directly from HFiles. (Roughly 2.5 times faster than TableInputFormat in my tests) >>> >>> Current status: Completed a proof-of-concept prototype and measured performance. >>> >>> >>> 3. Enhance Get/Scan performance of RS >>> >>> Add an hash code and a couple of flags to HFile at the flush time and change scanner implementation so that: >>> >>> -- Get/Scan operations will get faster. (less key comparisons for reconstructing a row: O(h * c) -> O(h). [h = number of HFiles for the row, c = number of columns in an HFile]) >>> -- The size of HFiles will become a bit smaller. (The flags will eliminate duplicate bytes in keys (row, column family and qualifier) from HFiles.) >>> >>> Current status: Completed a proof-of-concept prototype and measured performance. >>> >>> Detals: >>> https://github.com/tatsuya6502/hbase-mr-pof/ >>> (I meant "poc" not "pof"...) >>> >>> >>> 4. Writing Japanese books and documents >>> >>> -- Currently I'm authoring a book chapter about HBase for a Japanese NOSQL book >>> -- I'll translate The Apache HBase Book to Japanese Yifeng Jiang
-
Re: Items to contribute (plan)Tatsuya Kawano 2011-01-26, 08:11
Hi Yifeng > Maybe I can translate the book to Chinese, too. > May I send you a greeting email from my personal address, and talk about your plan via email at first? Sure, -- Tatsuya Kawano (Mr.) Tokyo, Japan On Jan 26, 2011, at 2:14 PM, Yifeng Jiang <[EMAIL PROTECTED]> wrote: > Hi Tatsuya, > > >This one is a commercial book from a Japanese publisher, so I'll do this by myself. > > I see. I'll be looking forward to reading your book. > > > > >I created a Jira entry to track this task: > >https://issues.apache.org/jira/browse/HBASE-3391 > > Good job. I'm glad to work with you on the translation. > Maybe I can translate the book to Chinese, too. > > > > Are you working at Rakuten in Tokyo? Maybe we can meet at next Hadoop Source Code Reading at Rakuten > Tower. Do you know this event? > > I'm working for Rakuten at the Osaka branch. > So I don't know if I can go to Tokyo to meet you at next Hadoop Source Code Reading event. > > May I send you a greeting email from my personal address, and talk about your plan via email at first? > > Thanks, > > On 01/26/2011 12:34 PM, Tatsuya Kawano wrote: >> Hi Yifeng, >>> #4. Writing Japanese books and documents >>> I am glad if I can work on this one with you. >> Thanks for your offer. Let me explain a bit more about them. >>>> -- Currently I'm authoring a book chapter about HBase for a Japanese NOSQL book >> This one is a commercial book from a Japanese publisher, so I'll do this by myself. >> >>>> -- I'll translate The Apache HBase Book to Japanese >> This one comes with HBase, and I'm looking for some people (like you) to work with. >> >> http://hbase.apache.org/book.html >> >> I created a Jira entry to track this task: >> https://issues.apache.org/jira/browse/HBASE-3391 >> >> >> Are you working at Rakuten in Tokyo? Maybe we can meet at next Hadoop Source Code Reading at Rakuten Tower. Do you know this event? >> >> Thanks, >> Tatsuya >> >> -- >> Tatsuya Kawano (Mr.) >> Tokyo, Japan >> >> >> On Jan 25, 2011, at 11:03 AM, Yifeng Jiang<[EMAIL PROTECTED]> wrote: >> >>> #4. Writing Japanese books and documents >>> I am glad if I can work on this one with you. >>> >>> >>> On 01/23/2011 10:18 AM, Tatsuya Kawano wrote: >>>> Hi, >>>> >>>> I wanted to let you know that I'm planning to contribute the following items to the HBase community. These are my spare time projects and I'll only be able to spend my time about 7 hours a week, so the progress will be very slow. I want some feedback from you guys to prioritize them. Also, if someone/team wants to work on them (with me or alone), I'll be happy to provide more details. >>>> >>>> >>>> 1. RADOS integration >>>> >>>> Run HBase not only on HDFS but also RADOS distributed object store (the lower layer of Ceph), so that the following options will become available to HBase users: >>>> >>>> -- No SPOF (RADOS doesn't have the name node(s), but only ZK-like monitors and data nodes) >>>> -- Instant backup of HBase tables (RADOS provides copy-on-write snapshot per object pool) >>>> -- Extra durability option on WAL (RADOS can do both synchronous and asynchronous disk flush. HDFS doesn't have the earlier option) >>>> >>>> Note: >>>> RADOS object = HFile, WAL >>>> object pool = group of HFiles or WAL >>>> >>>> Current status: Design phase >>>> >>>> >>>> 2. mapreduce.HFileInputFormat >>>> >>>> MR library to read data directly from HFiles. (Roughly 2.5 times faster than TableInputFormat in my tests) >>>> >>>> Current status: Completed a proof-of-concept prototype and measured performance. >>>> >>>> >>>> 3. Enhance Get/Scan performance of RS >>>> >>>> Add an hash code and a couple of flags to HFile at the flush time and change scanner implementation so that: >>>> >>>> -- Get/Scan operations will get faster. (less key comparisons for reconstructing a row: O(h * c) -> O(h). [h = number of HFiles for the row, c = number of columns in an HFile]) >>>> -- The size of HFiles will become a bit smaller. (The flags will eliminate duplicate bytes in keys (row, column family and qualifier) from HFiles.)
-
Re: Items to contribute (plan)Tatsuya Kawano 2011-01-26, 08:14
Hi Yifeng, Sorry for the last email. I hit the Send button on my iPhone by accident. > Maybe I can translate the book to Chinese, too. It sounds great! > May I send you a greeting email from my personal address, and talk about your plan via email at first? Sure, let's take this discussion offline. - Tatsuya -- Tatsuya Kawano (Mr.) Tokyo, Japan On Jan 26, 2011, at 2:14 PM, Yifeng Jiang <[EMAIL PROTECTED]> wrote: > Hi Tatsuya, > > >This one is a commercial book from a Japanese publisher, so I'll do this by myself. > > I see. I'll be looking forward to reading your book. > > > > >I created a Jira entry to track this task: > >https://issues.apache.org/jira/browse/HBASE-3391 > > Good job. I'm glad to work with you on the translation. > Maybe I can translate the book to Chinese, too. > > > > Are you working at Rakuten in Tokyo? Maybe we can meet at next Hadoop Source Code Reading at Rakuten > Tower. Do you know this event? > > I'm working for Rakuten at the Osaka branch. > So I don't know if I can go to Tokyo to meet you at next Hadoop Source Code Reading event. > > May I send you a greeting email from my personal address, and talk about your plan via email at first? > > Thanks, > > On 01/26/2011 12:34 PM, Tatsuya Kawano wrote: >> Hi Yifeng, >>> #4. Writing Japanese books and documents >>> I am glad if I can work on this one with you. >> Thanks for your offer. Let me explain a bit more about them. >>>> -- Currently I'm authoring a book chapter about HBase for a Japanese NOSQL book >> This one is a commercial book from a Japanese publisher, so I'll do this by myself. >> >>>> -- I'll translate The Apache HBase Book to Japanese >> This one comes with HBase, and I'm looking for some people (like you) to work with. >> >> http://hbase.apache.org/book.html >> >> I created a Jira entry to track this task: >> https://issues.apache.org/jira/browse/HBASE-3391 >> >> >> Are you working at Rakuten in Tokyo? Maybe we can meet at next Hadoop Source Code Reading at Rakuten Tower. Do you know this event? >> >> Thanks, >> Tatsuya >> >> -- >> Tatsuya Kawano (Mr.) >> Tokyo, Japan >> >> >> On Jan 25, 2011, at 11:03 AM, Yifeng Jiang<[EMAIL PROTECTED]> wrote: >> >>> #4. Writing Japanese books and documents >>> I am glad if I can work on this one with you. >>> >>> >>> On 01/23/2011 10:18 AM, Tatsuya Kawano wrote: >>>> Hi, >>>> >>>> I wanted to let you know that I'm planning to contribute the following items to the HBase community. These are my spare time projects and I'll only be able to spend my time about 7 hours a week, so the progress will be very slow. I want some feedback from you guys to prioritize them. Also, if someone/team wants to work on them (with me or alone), I'll be happy to provide more details. >>>> >>>> >>>> 1. RADOS integration >>>> >>>> Run HBase not only on HDFS but also RADOS distributed object store (the lower layer of Ceph), so that the following options will become available to HBase users: >>>> >>>> -- No SPOF (RADOS doesn't have the name node(s), but only ZK-like monitors and data nodes) >>>> -- Instant backup of HBase tables (RADOS provides copy-on-write snapshot per object pool) >>>> -- Extra durability option on WAL (RADOS can do both synchronous and asynchronous disk flush. HDFS doesn't have the earlier option) >>>> >>>> Note: >>>> RADOS object = HFile, WAL >>>> object pool = group of HFiles or WAL >>>> >>>> Current status: Design phase >>>> >>>> >>>> 2. mapreduce.HFileInputFormat >>>> >>>> MR library to read data directly from HFiles. (Roughly 2.5 times faster than TableInputFormat in my tests) >>>> >>>> Current status: Completed a proof-of-concept prototype and measured performance. >>>> >>>> >>>> 3. Enhance Get/Scan performance of RS >>>> >>>> Add an hash code and a couple of flags to HFile at the flush time and change scanner implementation so that: >>>> >>>> -- Get/Scan operations will get faster. (less key comparisons for reconstructing a row: O(h * c) -> O(h). [h = number of HFiles for the row, c = number of columns in an HFile])
-
Re: Items to contribute (plan)Tatsuya Kawano 2011-01-26, 11:32
Hi Ryan, >> 2. mapreduce.HFileInputFormat >> >> MR library to read data directly from HFiles. (Roughly 2.5 times faster than TableInputFormat in my tests) >> >> Current status: Completed a proof-of-concept prototype and measured performance. > On Jan 23, 2011, Ryan Rawson wrote: >> #2 is interesting, what is the benefit? How did you measure said benefit? I have only performed simplified tests; single test thread on single server. It was even not a MR job but a simple program that scans through the whole rows in the table. I'll definitely need deeper tests in a clustering environment to measure more realistic results. The related test programs can be found here (V1 is the one): https://github.com/tatsuya6502/hbase-mr-pof And the chart comparing throughput on RS, HFileInputFormat and HDFS SequenceFile: http://github.com/tatsuya6502/hbase-mr-pof/raw/master/docs/performance_comparison_0821_2010.pdf Please note: The disk drive attached to the EC2 instance was slow, so for this particular test, I used a small table to fit the whole contents of the files in Linux's disk read cache, ran each test twice and only recorded second result. (I restarted RS between first and second tests to clear its block cache) One interesting thing I saw in the result was HDFS SequenceFile didn't scale well in my environment. SequenceFile needed more processor power than HFile and suffered by the processor bottleneck. CPU utilization was about 100% for SequenceFile and about 30% for HFile throughout the tests - Tatsuya -- Tatsuya Kawano Tokyo, Japan
-
Re: Items to contribute (plan)Tatsuya Kawano 2011-01-26, 11:42
Hi Stack, On Jan 25, 2011, Stack wrote: >> 2. mapreduce.HFileInputFormat >> >> MR library to read data directly from HFiles. (Roughly 2.5 times faster than TableInputFormat in my tests) >> >> Current status: Completed a proof-of-concept prototype and measured performance. >> > > What about the in-memory edits? Or you thinking of reading the WALs too? My prototype doesn't read in-memory edits. So you have to flush the table before running your MR job. To read in-memory edits, I would create a special scanner in RS which reads KeyValues only from MemTable. I'll also add observer to RS to watch region flush event. Also, my prototype doesn't deal with region compactions so the MR job will fail if the compaction threads delete old HFiles after minor/major compaction. I need to find a solution for this too. - Tatsuya -- Tatsuya Kawano (Mr.) Tokyo, Japan |