|
Eric Yang
2010-11-20, 20:15
James Seigel
2010-11-20, 20:20
Eric Yang
2010-11-21, 05:02
Bill Graham
2010-11-22, 18:22
Deshpande, Deepak
2010-11-22, 18:47
Ariel Rabkin
2010-11-22, 18:50
James Seigel
2010-11-22, 19:16
Eric Yang
2010-11-22, 19:41
Bill Graham
2010-11-22, 21:19
Ahmed Fathalla
2010-11-22, 23:05
Eric Yang
2010-11-23, 00:22
Eric Yang
2010-11-23, 01:00
Bill Graham
2010-11-23, 06:38
Eric Yang
2010-11-23, 21:54
Bill Graham
2010-11-24, 18:04
Eric Yang
2010-11-24, 19:19
Bill Graham
2010-11-24, 20:15
Eric Yang
2010-11-24, 21:18
|
-
[DISCUSSION] Making HBaseWriter defaultEric Yang 2010-11-20, 20:15
Hi all,
In order to use full features of Chukwa in trunk, HBase is required to display data on HICC. I am wondering if anyone has good success in using HBase+HICC? I am leaning toward making hbase the default data storage for chukwa, and the default configuration for chukwa collector will make use of HBaseWriter. What do the community feel about changing the default writer config? regards, Eric
-
Re: [DISCUSSION] Making HBaseWriter defaultJames Seigel 2010-11-20, 20:20
Hello!
As a high volume user, I was just wondering how the HbaseWriter compares with the current one under load? Better or worse and by how much? Cheers James. On 2010-11-20, at 1:15 PM, Eric Yang wrote: > Hi all, > > In order to use full features of Chukwa in trunk, HBase is required to > display data on HICC. I am wondering if anyone has good success in > using HBase+HICC? I am leaning toward making hbase the default data > storage for chukwa, and the default configuration for chukwa collector > will make use of HBaseWriter. What do the community feel about > changing the default writer config? > > regards, > Eric
-
Re: [DISCUSSION] Making HBaseWriter defaultEric Yang 2010-11-21, 05:02
Hi James,
In my 10 nodes cluster, it used to take 7 minutes (3 minutes M/R + 4 minutes load to mysql) to process data and being able to visualize on HICC UI. Now, it takes 50 milliseconds. For data aggregation, it used to take 15-20 minutes to roll up data for 2000 nodes data daily, now it takes <5 minutes. The improvement is 2100 times better for data load latency, and 3 times better for data analytics throughput with pig+hbase. regards, Eric On Sat, Nov 20, 2010 at 12:20 PM, James Seigel <[EMAIL PROTECTED]> wrote: > Hello! > > As a high volume user, I was just wondering how the HbaseWriter compares with the current one under load? Better or worse and by how much? > > Cheers > James. > > > On 2010-11-20, at 1:15 PM, Eric Yang wrote: > >> Hi all, >> >> In order to use full features of Chukwa in trunk, HBase is required to >> display data on HICC. I am wondering if anyone has good success in >> using HBase+HICC? I am leaning toward making hbase the default data >> storage for chukwa, and the default configuration for chukwa collector >> will make use of HBaseWriter. What do the community feel about >> changing the default writer config? >> >> regards, >> Eric > >
-
Re: [DISCUSSION] Making HBaseWriter defaultBill Graham 2010-11-22, 18:22
Hi Eric,
I think we should have a default config that is easy to tweak to work with or without HBase. My inclination would be to not have HBase enabled by default, since it raises the barrier to entry for a basic set-up that might not otherwise need HBase. When I first installed Chukwa 0.3.0 for evaluation I spent a lot of time setting up MySQL and HICC because I thought I had to, only to realize later that those components weren't needed for my use cases (this wasn't and still isn't clearly reflected in the quick start documentation). Hence I think it's better to require a few extra steps for people who have HBase, than to risk losing users to the extra steps required to get a basic setup running without HBase. thanks, Bill On Sat, Nov 20, 2010 at 9:02 PM, Eric Yang <[EMAIL PROTECTED]> wrote: > Hi James, > > In my 10 nodes cluster, it used to take 7 minutes (3 minutes M/R + 4 > minutes load to mysql) to process data and being able to visualize on > HICC UI. Now, it takes 50 milliseconds. For data aggregation, it > used to take 15-20 minutes to roll up data for 2000 nodes data daily, > now it takes <5 minutes. The improvement is 2100 times better for > data load latency, and 3 times better for data analytics throughput > with pig+hbase. > > regards, > Eric > > On Sat, Nov 20, 2010 at 12:20 PM, James Seigel <[EMAIL PROTECTED]> wrote: >> Hello! >> >> As a high volume user, I was just wondering how the HbaseWriter compares with the current one under load? Better or worse and by how much? >> >> Cheers >> James. >> >> >> On 2010-11-20, at 1:15 PM, Eric Yang wrote: >> >>> Hi all, >>> >>> In order to use full features of Chukwa in trunk, HBase is required to >>> display data on HICC. I am wondering if anyone has good success in >>> using HBase+HICC? I am leaning toward making hbase the default data >>> storage for chukwa, and the default configuration for chukwa collector >>> will make use of HBaseWriter. What do the community feel about >>> changing the default writer config? >>> >>> regards, >>> Eric >> >> >
-
RE: [DISCUSSION] Making HBaseWriter defaultDeshpande, Deepak 2010-11-22, 18:47
I agree. Making HBase by default would make some Chukwa users life difficult. In my set up, I don't need HDFS. I am using Chukwa merely as a Log Streaming framework. I have plugged in my own writer to write log files in Local File system (instead of HDFS). I evaluated Chukwa with other frameworks and Chukwa had very good fault tolerance built in than other frameworks. This made me recommend Chukwa over other frameworks.
By making HBase default option would definitely make my life difficult :). Thanks, Deepak Deshpande -----Original Message----- From: Bill Graham [mailto:[EMAIL PROTECTED]] Sent: Monday, November 22, 2010 1:23 PM To: [EMAIL PROTECTED] Subject: Re: [DISCUSSION] Making HBaseWriter default Hi Eric, I think we should have a default config that is easy to tweak to work with or without HBase. My inclination would be to not have HBase enabled by default, since it raises the barrier to entry for a basic set-up that might not otherwise need HBase. When I first installed Chukwa 0.3.0 for evaluation I spent a lot of time setting up MySQL and HICC because I thought I had to, only to realize later that those components weren't needed for my use cases (this wasn't and still isn't clearly reflected in the quick start documentation). Hence I think it's better to require a few extra steps for people who have HBase, than to risk losing users to the extra steps required to get a basic setup running without HBase. thanks, Bill On Sat, Nov 20, 2010 at 9:02 PM, Eric Yang <[EMAIL PROTECTED]> wrote: > Hi James, > > In my 10 nodes cluster, it used to take 7 minutes (3 minutes M/R + 4 > minutes load to mysql) to process data and being able to visualize on > HICC UI. Now, it takes 50 milliseconds. For data aggregation, it > used to take 15-20 minutes to roll up data for 2000 nodes data daily, > now it takes <5 minutes. The improvement is 2100 times better for > data load latency, and 3 times better for data analytics throughput > with pig+hbase. > > regards, > Eric > > On Sat, Nov 20, 2010 at 12:20 PM, James Seigel <[EMAIL PROTECTED]> wrote: >> Hello! >> >> As a high volume user, I was just wondering how the HbaseWriter compares with the current one under load? Better or worse and by how much? >> >> Cheers >> James. >> >> >> On 2010-11-20, at 1:15 PM, Eric Yang wrote: >> >>> Hi all, >>> >>> In order to use full features of Chukwa in trunk, HBase is required to >>> display data on HICC. I am wondering if anyone has good success in >>> using HBase+HICC? I am leaning toward making hbase the default data >>> storage for chukwa, and the default configuration for chukwa collector >>> will make use of HBaseWriter. What do the community feel about >>> changing the default writer config? >>> >>> regards, >>> Eric >> >> >
-
Re: [DISCUSSION] Making HBaseWriter defaultAriel Rabkin 2010-11-22, 18:50
I agree with Bill and Deshpande that we ought to make clear to users
that they don't nee HICC, and therefore don't need either MySQL or HBase. But I think what Eric meant to ask was which of MySQL and HBase ought to be the default *for HICC*. My sense is that the HBase support isn't quite mature enough, but it's getting there. I think HBase is ultimately the way to go. I think we might benefit as a community by doing a 0.5 release first, while waiting for the pig-based aggregation support that's blocking HBase. --Ari On Mon, Nov 22, 2010 at 10:47 AM, Deshpande, Deepak <[EMAIL PROTECTED]> wrote: > I agree. Making HBase by default would make some Chukwa users life difficult. In my set up, I don't need HDFS. I am using Chukwa merely as a Log Streaming framework. I have plugged in my own writer to write log files in Local File system (instead of HDFS). I evaluated Chukwa with other frameworks and Chukwa had very good fault tolerance built in than other frameworks. This made me recommend Chukwa over other frameworks. > > By making HBase default option would definitely make my life difficult :). > > Thanks, > Deepak Deshpande > -- Ari Rabkin [EMAIL PROTECTED] UC Berkeley Computer Science Department
-
Re: [DISCUSSION] Making HBaseWriter defaultJames Seigel 2010-11-22, 19:16
+1
On 2010-11-22, at 11:22 AM, Bill Graham wrote: > Hi Eric, > > I think we should have a default config that is easy to tweak to work > with or without HBase. My inclination would be to not have HBase > enabled by default, since it raises the barrier to entry for a basic > set-up that might not otherwise need HBase. > > When I first installed Chukwa 0.3.0 for evaluation I spent a lot of > time setting up MySQL and HICC because I thought I had to, only to > realize later that those components weren't needed for my use cases > (this wasn't and still isn't clearly reflected in the quick start > documentation). Hence I think it's better to require a few extra steps > for people who have HBase, than to risk losing users to the extra > steps required to get a basic setup running without HBase. > > thanks, > Bill > > > > On Sat, Nov 20, 2010 at 9:02 PM, Eric Yang <[EMAIL PROTECTED]> wrote: >> Hi James, >> >> In my 10 nodes cluster, it used to take 7 minutes (3 minutes M/R + 4 >> minutes load to mysql) to process data and being able to visualize on >> HICC UI. Now, it takes 50 milliseconds. For data aggregation, it >> used to take 15-20 minutes to roll up data for 2000 nodes data daily, >> now it takes <5 minutes. The improvement is 2100 times better for >> data load latency, and 3 times better for data analytics throughput >> with pig+hbase. >> >> regards, >> Eric >> >> On Sat, Nov 20, 2010 at 12:20 PM, James Seigel <[EMAIL PROTECTED]> wrote: >>> Hello! >>> >>> As a high volume user, I was just wondering how the HbaseWriter compares with the current one under load? Better or worse and by how much? >>> >>> Cheers >>> James. >>> >>> >>> On 2010-11-20, at 1:15 PM, Eric Yang wrote: >>> >>>> Hi all, >>>> >>>> In order to use full features of Chukwa in trunk, HBase is required to >>>> display data on HICC. I am wondering if anyone has good success in >>>> using HBase+HICC? I am leaning toward making hbase the default data >>>> storage for chukwa, and the default configuration for chukwa collector >>>> will make use of HBaseWriter. What do the community feel about >>>> changing the default writer config? >>>> >>>> regards, >>>> Eric >>> >>> >>
-
Re: [DISCUSSION] Making HBaseWriter defaultEric Yang 2010-11-22, 19:41
MySQL support has been removed from Chukwa 0.5. My concern is that the demux process is going to become two parallel tracks, one works in mapreduce, and another one works in collector. It becomes difficult to have clean efficient parsers which works in both places. From architecture perspective, incremental updates to data is better than batch processing for near real time monitoring purpose. I like to ensure Chukwa framework can deliver Chukwa's mission statement, hence I standby Hbase as default. I was playing with Hbase 0.20.6+Pig 0.8 branch last weekend, I was very impressed by both speed and performance of this combination. I encourage people to try it out.
Regards, Eric On 11/22/10 10:50 AM, "Ariel Rabkin" <[EMAIL PROTECTED]> wrote: I agree with Bill and Deshpande that we ought to make clear to users that they don't nee HICC, and therefore don't need either MySQL or HBase. But I think what Eric meant to ask was which of MySQL and HBase ought to be the default *for HICC*. My sense is that the HBase support isn't quite mature enough, but it's getting there. I think HBase is ultimately the way to go. I think we might benefit as a community by doing a 0.5 release first, while waiting for the pig-based aggregation support that's blocking HBase. --Ari On Mon, Nov 22, 2010 at 10:47 AM, Deshpande, Deepak <[EMAIL PROTECTED]> wrote: > I agree. Making HBase by default would make some Chukwa users life difficult. In my set up, I don't need HDFS. I am using Chukwa merely as a Log Streaming framework. I have plugged in my own writer to write log files in Local File system (instead of HDFS). I evaluated Chukwa with other frameworks and Chukwa had very good fault tolerance built in than other frameworks. This made me recommend Chukwa over other frameworks. > > By making HBase default option would definitely make my life difficult :). > > Thanks, > Deepak Deshpande > -- Ari Rabkin [EMAIL PROTECTED] UC Berkeley Computer Science Department
-
Re: [DISCUSSION] Making HBaseWriter defaultBill Graham 2010-11-22, 21:19
We are going to continue to have use cases where we want log data
rolled up into 5 minute, hourly and daily increments in HDFS to run map reduce jobs on them. How will this model work with the HBase approach? What process will aggregate the HBase data into time increments like the current demux and hourly/daily rolling processes do? Basically, what does the time partitioning look like in the HBase storage scheme? > My concern is that the demux process is going to become two parallel > tracks, one works in mapreduce, and another one works in collector. It > becomes difficult to have clean efficient parsers which works in both This statement makes me concerned that you're implying the need to deprecate the current demux model, which is very different than making one or the other the default in the configs. Is that the case? On Mon, Nov 22, 2010 at 11:41 AM, Eric Yang <[EMAIL PROTECTED]> wrote: > MySQL support has been removed from Chukwa 0.5. My concern is that the demux process is going to become two parallel tracks, one works in mapreduce, and another one works in collector. It becomes difficult to have clean efficient parsers which works in both places. From architecture perspective, incremental updates to data is better than batch processing for near real time monitoring purpose. I like to ensure Chukwa framework can deliver Chukwa's mission statement, hence I standby Hbase as default. I was playing with Hbase 0.20.6+Pig 0.8 branch last weekend, I was very impressed by both speed and performance of this combination. I encourage people to try it out. > > Regards, > Eric > > On 11/22/10 10:50 AM, "Ariel Rabkin" <[EMAIL PROTECTED]> wrote: > > I agree with Bill and Deshpande that we ought to make clear to users > that they don't nee HICC, and therefore don't need either MySQL or > HBase. > > But I think what Eric meant to ask was which of MySQL and HBase ought > to be the default *for HICC*. My sense is that the HBase support > isn't quite mature enough, but it's getting there. > > I think HBase is ultimately the way to go. I think we might benefit as > a community by doing a 0.5 release first, while waiting for the > pig-based aggregation support that's blocking HBase. > > --Ari > > On Mon, Nov 22, 2010 at 10:47 AM, Deshpande, Deepak > <[EMAIL PROTECTED]> wrote: >> I agree. Making HBase by default would make some Chukwa users life difficult. In my set up, I don't need HDFS. I am using Chukwa merely as a Log Streaming framework. I have plugged in my own writer to write log files in Local File system (instead of HDFS). I evaluated Chukwa with other frameworks and Chukwa had very good fault tolerance built in than other frameworks. This made me recommend Chukwa over other frameworks. >> >> By making HBase default option would definitely make my life difficult :). >> >> Thanks, >> Deepak Deshpande >> > > > -- > Ari Rabkin [EMAIL PROTECTED] > UC Berkeley Computer Science Department > >
-
Re: [DISCUSSION] Making HBaseWriter defaultAhmed Fathalla 2010-11-22, 23:05
I think what we need to do is create some kind of comparison table
contrasting the merits of each approach (HBase vs Normal Demux processing). This exercise will be both useful in making the decision of choosing the default and for documentation purposes to illustrate the difference for new users. On Mon, Nov 22, 2010 at 11:19 PM, Bill Graham <[EMAIL PROTECTED]> wrote: > We are going to continue to have use cases where we want log data > rolled up into 5 minute, hourly and daily increments in HDFS to run > map reduce jobs on them. How will this model work with the HBase > approach? What process will aggregate the HBase data into time > increments like the current demux and hourly/daily rolling processes > do? Basically, what does the time partitioning look like in the HBase > storage scheme? > > > My concern is that the demux process is going to become two parallel > > tracks, one works in mapreduce, and another one works in collector. It > > becomes difficult to have clean efficient parsers which works in both > > This statement makes me concerned that you're implying the need to > deprecate the current demux model, which is very different than making > one or the other the default in the configs. Is that the case? > > > > On Mon, Nov 22, 2010 at 11:41 AM, Eric Yang <[EMAIL PROTECTED]> wrote: > > MySQL support has been removed from Chukwa 0.5. My concern is that the > demux process is going to become two parallel tracks, one works in > mapreduce, and another one works in collector. It becomes difficult to have > clean efficient parsers which works in both places. From architecture > perspective, incremental updates to data is better than batch processing for > near real time monitoring purpose. I like to ensure Chukwa framework can > deliver Chukwa's mission statement, hence I standby Hbase as default. I was > playing with Hbase 0.20.6+Pig 0.8 branch last weekend, I was very impressed > by both speed and performance of this combination. I encourage people to > try it out. > > > > Regards, > > Eric > > > > On 11/22/10 10:50 AM, "Ariel Rabkin" <[EMAIL PROTECTED]> wrote: > > > > I agree with Bill and Deshpande that we ought to make clear to users > > that they don't nee HICC, and therefore don't need either MySQL or > > HBase. > > > > But I think what Eric meant to ask was which of MySQL and HBase ought > > to be the default *for HICC*. My sense is that the HBase support > > isn't quite mature enough, but it's getting there. > > > > I think HBase is ultimately the way to go. I think we might benefit as > > a community by doing a 0.5 release first, while waiting for the > > pig-based aggregation support that's blocking HBase. > > > > --Ari > > > > On Mon, Nov 22, 2010 at 10:47 AM, Deshpande, Deepak > > <[EMAIL PROTECTED]> wrote: > >> I agree. Making HBase by default would make some Chukwa users life > difficult. In my set up, I don't need HDFS. I am using Chukwa merely as a > Log Streaming framework. I have plugged in my own writer to write log files > in Local File system (instead of HDFS). I evaluated Chukwa with other > frameworks and Chukwa had very good fault tolerance built in than other > frameworks. This made me recommend Chukwa over other frameworks. > >> > >> By making HBase default option would definitely make my life difficult > :). > >> > >> Thanks, > >> Deepak Deshpande > >> > > > > > > -- > > Ari Rabkin [EMAIL PROTECTED] > > UC Berkeley Computer Science Department > > > > > -- Ahmed Fathalla
-
Re: [DISCUSSION] Making HBaseWriter defaultEric Yang 2010-11-23, 00:22
Hbase makes life easier with file management on HDFS. Hbase roll up the data into large file sets which is more efficient for scanning and random access. HBase supports mapreduce on table instead of on files. Therefore, data analytics on hbase is a great improvement and no drawback. The data analytics jobs continue to run every n minutes interval, but you don't need to wait 5 minutes for data to arrive in order to start data processing.
Another eliminated limitation was in daily rolling and hourly rolling. Chukwa used to produce files periodically, and those files need to be roll up into bigger files and regular append doesn't work because late arrival data needs to be resorted in the sequence file. Hence, we run hourly and daily job which does purely sorting and merging data. This is somewhat wasteful of burning cpu cycles without actual good benefits. Data looks like this in Chukwa Record: Time Partition/Primary Key/Actual Timestamp - [small hashmap] Data looks like this in Hbase: Timestamp/Primary Key - [big hashmap] Therefore, it's identical, the only difference is scan for data is a lot faster and not burn cpu cycle for sorting/merging data. Hbase handles the merging and indexing of data much more elegantly. We don't need to make data into different partitions because hbase handles this for us. We can continue to insert data and hbase regional server will partition the data for us and provide fast scanning. If the number of records is beyond trillions, it is still possible to partition table name by date, if user choose to do this. Bill, you are reading my mind. I also imply to deprecate the current hybrid model, and make a cleaner solution that work in the collector. It would be easier for new comer to adopt. Regards, Eric On 11/22/10 1:19 PM, "Bill Graham" <[EMAIL PROTECTED]> wrote: We are going to continue to have use cases where we want log data rolled up into 5 minute, hourly and daily increments in HDFS to run map reduce jobs on them. How will this model work with the HBase approach? What process will aggregate the HBase data into time increments like the current demux and hourly/daily rolling processes do? Basically, what does the time partitioning look like in the HBase storage scheme? > My concern is that the demux process is going to become two parallel > tracks, one works in mapreduce, and another one works in collector. It > becomes difficult to have clean efficient parsers which works in both This statement makes me concerned that you're implying the need to deprecate the current demux model, which is very different than making one or the other the default in the configs. Is that the case? On Mon, Nov 22, 2010 at 11:41 AM, Eric Yang <[EMAIL PROTECTED]> wrote: > MySQL support has been removed from Chukwa 0.5. My concern is that the demux process is going to become two parallel tracks, one works in mapreduce, and another one works in collector. It becomes difficult to have clean efficient parsers which works in both places. From architecture perspective, incremental updates to data is better than batch processing for near real time monitoring purpose. I like to ensure Chukwa framework can deliver Chukwa's mission statement, hence I standby Hbase as default. I was playing with Hbase 0.20.6+Pig 0.8 branch last weekend, I was very impressed by both speed and performance of this combination. I encourage people to try it out. > > Regards, > Eric > > On 11/22/10 10:50 AM, "Ariel Rabkin" <[EMAIL PROTECTED]> wrote: > > I agree with Bill and Deshpande that we ought to make clear to users > that they don't nee HICC, and therefore don't need either MySQL or > HBase. > > But I think what Eric meant to ask was which of MySQL and HBase ought > to be the default *for HICC*. My sense is that the HBase support > isn't quite mature enough, but it's getting there. > > I think HBase is ultimately the way to go. I think we might benefit as > a community by doing a 0.5 release first, while waiting for the
-
Re: [DISCUSSION] Making HBaseWriter defaultEric Yang 2010-11-23, 01:00
Comparison chart:
--------------------------------------------------------------------------- | Chukwa Types | Chukwa classic | Chukwa on Hbase | --------------------------------------------------------------------------- | Installation cost | Hadoop + Chukwa | Hadoop + Hbase + Chukwa | --------------------------------------------------------------------------- | Data latency | fixed n Minutes | 50-100 ms | --------------------------------------------------------------------------- | File Management | Hourly/Daily Roll Up | Hbase periodically | | Cost | Mapreduce Job | spill data to disk | --------------------------------------------------------------------------- | Record Size | Small needs to fit | Data node block | | | in java HashMap | size. (64MB) | --------------------------------------------------------------------------- | GUI friendly view | Data needs to be | drill down to raw | | | aggregated first | data or aggregated | --------------------------------------------------------------------------- | Demux | Single reducer | Write to hbase in | | | or creates multiple | parallel | | | part-nnn files, and | | | | unsorted between files | | --------------------------------------------------------------------------- | Demux Output | Sequence file | Hbase Table | --------------------------------------------------------------------------- | Data analytics tools | Mapreduce/Pig | MR/Pig/Hive/Cascading | --------------------------------------------------------------------------- Regards, Eric On 11/22/10 3:05 PM, "Ahmed Fathalla" <[EMAIL PROTECTED]> wrote: > I think what we need to do is create some kind of comparison table > contrasting the merits of each approach (HBase vs Normal Demux processing). > This exercise will be both useful in making the decision of choosing the > default and for documentation purposes to illustrate the difference for new > users. > > > On Mon, Nov 22, 2010 at 11:19 PM, Bill Graham <[EMAIL PROTECTED]> wrote: > >> We are going to continue to have use cases where we want log data >> rolled up into 5 minute, hourly and daily increments in HDFS to run >> map reduce jobs on them. How will this model work with the HBase >> approach? What process will aggregate the HBase data into time >> increments like the current demux and hourly/daily rolling processes >> do? Basically, what does the time partitioning look like in the HBase >> storage scheme? >> >>> My concern is that the demux process is going to become two parallel >>> tracks, one works in mapreduce, and another one works in collector. It >>> becomes difficult to have clean efficient parsers which works in both >> >> This statement makes me concerned that you're implying the need to >> deprecate the current demux model, which is very different than making >> one or the other the default in the configs. Is that the case? >> >> >> >> On Mon, Nov 22, 2010 at 11:41 AM, Eric Yang <[EMAIL PROTECTED]> wrote: >>> MySQL support has been removed from Chukwa 0.5. My concern is that the >> demux process is going to become two parallel tracks, one works in >> mapreduce, and another one works in collector. It becomes difficult to have >> clean efficient parsers which works in both places. From architecture >> perspective, incremental updates to data is better than batch processing for >> near real time monitoring purpose. I like to ensure Chukwa framework can >> deliver Chukwa's mission statement, hence I standby Hbase as default. I was >> playing with Hbase 0.20.6+Pig 0.8 branch last weekend, I was very impressed >> by both speed and performance of this combination. I encourage people to
-
Re: [DISCUSSION] Making HBaseWriter defaultBill Graham 2010-11-23, 06:38
I see plenty of value in the HBase approach, but I'm still not clear
on how the time and data type partitioning would be done more efficiently within HBase when running a job on a specific 5 minute interval for a given data type. I've only used HBase briefly so I could certainly be missing something, but I thought the sort for range scans is by byte order, which works for string types, but not numbers. So if your row ids are are <timestamp>/<data_type>, how do you fetch all the data for a given data_type for a given time period without potentially scanning many unnecessary rows? The timestamps will be in alphabetical order, not numeric and data_types would be mixed. Under the current scheme, since partitioning is done in HDFS you could just get <data_type>/<time>/part-* to get exactly the records you're looking for. On Mon, Nov 22, 2010 at 5:00 PM, Eric Yang <[EMAIL PROTECTED]> wrote: > Comparison chart: > > --------------------------------------------------------------------------- > | Chukwa Types | Chukwa classic | Chukwa on Hbase | > --------------------------------------------------------------------------- > | Installation cost | Hadoop + Chukwa | Hadoop + Hbase + Chukwa | > --------------------------------------------------------------------------- > | Data latency | fixed n Minutes | 50-100 ms | > --------------------------------------------------------------------------- > | File Management | Hourly/Daily Roll Up | Hbase periodically | > | Cost | Mapreduce Job | spill data to disk | > --------------------------------------------------------------------------- > | Record Size | Small needs to fit | Data node block | > | | in java HashMap | size. (64MB) | > --------------------------------------------------------------------------- > | GUI friendly view | Data needs to be | drill down to raw | > | | aggregated first | data or aggregated | > --------------------------------------------------------------------------- > | Demux | Single reducer | Write to hbase in | > | | or creates multiple | parallel | > | | part-nnn files, and | | > | | unsorted between files | | > --------------------------------------------------------------------------- > | Demux Output | Sequence file | Hbase Table | > --------------------------------------------------------------------------- > | Data analytics tools | Mapreduce/Pig | MR/Pig/Hive/Cascading | > --------------------------------------------------------------------------- > > Regards, > Eric > > On 11/22/10 3:05 PM, "Ahmed Fathalla" <[EMAIL PROTECTED]> wrote: > >> I think what we need to do is create some kind of comparison table >> contrasting the merits of each approach (HBase vs Normal Demux processing). >> This exercise will be both useful in making the decision of choosing the >> default and for documentation purposes to illustrate the difference for new >> users. >> >> >> On Mon, Nov 22, 2010 at 11:19 PM, Bill Graham <[EMAIL PROTECTED]> wrote: >> >>> We are going to continue to have use cases where we want log data >>> rolled up into 5 minute, hourly and daily increments in HDFS to run >>> map reduce jobs on them. How will this model work with the HBase >>> approach? What process will aggregate the HBase data into time >>> increments like the current demux and hourly/daily rolling processes >>> do? Basically, what does the time partitioning look like in the HBase >>> storage scheme? >>> >>>> My concern is that the demux process is going to become two parallel >>>> tracks, one works in mapreduce, and another one works in collector. It >>>> becomes difficult to have clean efficient parsers which works in both
-
Re: [DISCUSSION] Making HBaseWriter defaultEric Yang 2010-11-23, 21:54
It is more efficient because there is no need to wait for the file to be closed before the map reduce job can be launched. Data type is grouped into a hbase table or column families. The choice is in the hand of parser developer. Rowkey is a combination of timestamp+primary key as string. I.e 1234567890-hostname. Therefore, the byte order of string sorting works fine.
There are two ways to deal with this problem, it can be scanned using StartRow feature in Hbase to narrow down the row range, or use Hbase timestamp field to control the scanning range. Hbase timestamp is a special numeric field. To translate your query to hbase: Scan "<data_type>", { STARTROW => 'timestamp' }; Or Scan "user_table", { COLUMNS => "<data_type>", timestamp => 1234567890 }; The design is up to the parser designer. FYI, Hbase shell doesn't support timestamp range query, but the java api does. Regards, Eric On 11/22/10 10:38 PM, "Bill Graham" <[EMAIL PROTECTED]> wrote: I see plenty of value in the HBase approach, but I'm still not clear on how the time and data type partitioning would be done more efficiently within HBase when running a job on a specific 5 minute interval for a given data type. I've only used HBase briefly so I could certainly be missing something, but I thought the sort for range scans is by byte order, which works for string types, but not numbers. So if your row ids are are <timestamp>/<data_type>, how do you fetch all the data for a given data_type for a given time period without potentially scanning many unnecessary rows? The timestamps will be in alphabetical order, not numeric and data_types would be mixed. Under the current scheme, since partitioning is done in HDFS you could just get <data_type>/<time>/part-* to get exactly the records you're looking for. On Mon, Nov 22, 2010 at 5:00 PM, Eric Yang <[EMAIL PROTECTED]> wrote: > Comparison chart: > > --------------------------------------------------------------------------- > | Chukwa Types | Chukwa classic | Chukwa on Hbase | > --------------------------------------------------------------------------- > | Installation cost | Hadoop + Chukwa | Hadoop + Hbase + Chukwa | > --------------------------------------------------------------------------- > | Data latency | fixed n Minutes | 50-100 ms | > --------------------------------------------------------------------------- > | File Management | Hourly/Daily Roll Up | Hbase periodically | > | Cost | Mapreduce Job | spill data to disk | > --------------------------------------------------------------------------- > | Record Size | Small needs to fit | Data node block | > | | in java HashMap | size. (64MB) | > --------------------------------------------------------------------------- > | GUI friendly view | Data needs to be | drill down to raw | > | | aggregated first | data or aggregated | > --------------------------------------------------------------------------- > | Demux | Single reducer | Write to hbase in | > | | or creates multiple | parallel | > | | part-nnn files, and | | > | | unsorted between files | | > --------------------------------------------------------------------------- > | Demux Output | Sequence file | Hbase Table | > --------------------------------------------------------------------------- > | Data analytics tools | Mapreduce/Pig | MR/Pig/Hive/Cascading | > --------------------------------------------------------------------------- > > Regards, > Eric > > On 11/22/10 3:05 PM, "Ahmed Fathalla" <[EMAIL PROTECTED]> wrote: > >> I think what we need to do is create some kind of comparison table
-
Re: [DISCUSSION] Making HBaseWriter defaultBill Graham 2010-11-24, 18:04
> Rowkey is a combination of timestamp+primary key as string. I.e 1234567890-hostname. Therefore, the byte order of string sorting works fine.
I don't think this is correct. If your row keys are strings, you'd get an ordering like this: 1000-hostname 200-hostname 3000-hostname For the use case I was concerned about, I think it would be solved my making the row key a long timestamp and the data-type a column family. Then you could something similar to what you described: Scan “user_table”, { COLUMNS => “<data_type>”, STARTROW => 1234567890, STOPROW => 1234597890 }; I'm not sure how to do the same thing though if you want to partition by both hostname and datatype. On Tue, Nov 23, 2010 at 1:54 PM, Eric Yang <[EMAIL PROTECTED]> wrote: > It is more efficient because there is no need to wait for the file to be > closed before the map reduce job can be launched. Data type is grouped into > a hbase table or column families. The choice is in the hand of parser > developer. Rowkey is a combination of timestamp+primary key as string. I.e > 1234567890-hostname. Therefore, the byte order of string sorting works > fine. > > There are two ways to deal with this problem, it can be scanned using > StartRow feature in Hbase to narrow down the row range, or use Hbase > timestamp field to control the scanning range. Hbase timestamp is a special > numeric field. > > To translate your query to hbase: > > Scan “<data_type>”, { STARTROW => ‘timestamp’ }; > > Or > > Scan “user_table”, { COLUMNS => “<data_type>”, timestamp => 1234567890 }; > > The design is up to the parser designer. FYI, Hbase shell doesn’t support > timestamp range query, but the java api does. > > Regards, > Eric > > On 11/22/10 10:38 PM, "Bill Graham" <[EMAIL PROTECTED]> wrote: > > I see plenty of value in the HBase approach, but I'm still not clear > on how the time and data type partitioning would be done more > efficiently within HBase when running a job on a specific 5 minute > interval for a given data type. I've only used HBase briefly so I > could certainly be missing something, but I thought the sort for range > scans is by byte order, which works for string types, but not numbers. > > So if your row ids are are <timestamp>/<data_type>, how do you fetch > all the data for a given data_type for a given time period without > potentially scanning many unnecessary rows? The timestamps will be in > alphabetical order, not numeric and data_types would be mixed. > > Under the current scheme, since partitioning is done in HDFS you could > just get <data_type>/<time>/part-* to get exactly the records you're > looking for. > > > On Mon, Nov 22, 2010 at 5:00 PM, Eric Yang <[EMAIL PROTECTED]> wrote: >> Comparison chart: >> >> >> --------------------------------------------------------------------------- >> | Chukwa Types | Chukwa classic | Chukwa on Hbase >> | >> >> --------------------------------------------------------------------------- >> | Installation cost | Hadoop + Chukwa | Hadoop + Hbase + Chukwa >> | >> >> --------------------------------------------------------------------------- >> | Data latency | fixed n Minutes | 50-100 ms >> | >> >> --------------------------------------------------------------------------- >> | File Management | Hourly/Daily Roll Up | Hbase periodically >> | >> | Cost | Mapreduce Job | spill data to disk >> | >> >> --------------------------------------------------------------------------- >> | Record Size | Small needs to fit | Data node block >> | >> | | in java HashMap | size. (64MB) >> | >> >> --------------------------------------------------------------------------- >> | GUI friendly view | Data needs to be | drill down to raw >> | >> | | aggregated first | data or aggregated >> | >> >> --------------------------------------------------------------------------- >> | Demux | Single reducer | Write to hbase in
-
Re: [DISCUSSION] Making HBaseWriter defaultEric Yang 2010-11-24, 19:19
Hi Bill,
I was assuming that data are going to use chukwa to process data after epoach timestamp: 1234567890, and it will work up to 9999999999. Meaning, any log file produced between: Feb 13, 2009 23:31:30UTC to Nov 20, 2286 17:46:39UTC would work. Then again, it might be short sighted on my part. We will probably want to store binary epoch, long (8 bytes)-hostname. This will ensure the data has a good range to work with. Partition by time, host, and data type can be done two ways: 1. you can use (8bytes)-hostname for row key which will partition by time, then by host, and by data type (column family). (Tall table, Hbase guys recommend this approach) 2. Use hostname as row key and partition by data type (column family), hbase timestamp and table name for time partition. (Thick row) 3. No partition, use bloom filter on hbase to filter all regions in parallel and return the results in chunks. I also got stuck on this parition problem when I started on Hbase path. After studying it for 8 months, it suddenly became clear after I implemented the first prototype. Hope this helps. Regards, Eric On 11/24/10 10:04 AM, "Bill Graham" <[EMAIL PROTECTED]> wrote: > Rowkey is a combination of timestamp+primary key as string. I.e 1234567890-hostname. Therefore, the byte order of string sorting works fine. I don't think this is correct. If your row keys are strings, you'd get an ordering like this: 1000-hostname 200-hostname 3000-hostname For the use case I was concerned about, I think it would be solved my making the row key a long timestamp and the data-type a column family. Then you could something similar to what you described: Scan "user_table", { COLUMNS => "<data_type>", STARTROW => 1234567890, STOPROW => 1234597890 }; I'm not sure how to do the same thing though if you want to partition by both hostname and datatype. On Tue, Nov 23, 2010 at 1:54 PM, Eric Yang <[EMAIL PROTECTED]> wrote: > It is more efficient because there is no need to wait for the file to be > closed before the map reduce job can be launched. Data type is grouped into > a hbase table or column families. The choice is in the hand of parser > developer. Rowkey is a combination of timestamp+primary key as string. I.e > 1234567890-hostname. Therefore, the byte order of string sorting works > fine. > > There are two ways to deal with this problem, it can be scanned using > StartRow feature in Hbase to narrow down the row range, or use Hbase > timestamp field to control the scanning range. Hbase timestamp is a special > numeric field. > > To translate your query to hbase: > > Scan "<data_type>", { STARTROW => 'timestamp' }; > > Or > > Scan "user_table", { COLUMNS => "<data_type>", timestamp => 1234567890 }; > > The design is up to the parser designer. FYI, Hbase shell doesn't support > timestamp range query, but the java api does. > > Regards, > Eric > > On 11/22/10 10:38 PM, "Bill Graham" <[EMAIL PROTECTED]> wrote: > > I see plenty of value in the HBase approach, but I'm still not clear > on how the time and data type partitioning would be done more > efficiently within HBase when running a job on a specific 5 minute > interval for a given data type. I've only used HBase briefly so I > could certainly be missing something, but I thought the sort for range > scans is by byte order, which works for string types, but not numbers. > > So if your row ids are are <timestamp>/<data_type>, how do you fetch > all the data for a given data_type for a given time period without > potentially scanning many unnecessary rows? The timestamps will be in > alphabetical order, not numeric and data_types would be mixed. > > Under the current scheme, since partitioning is done in HDFS you could > just get <data_type>/<time>/part-* to get exactly the records you're > looking for. > > > On Mon, Nov 22, 2010 at 5:00 PM, Eric Yang <[EMAIL PROTECTED]> wrote: >> Comparison chart: >> >> >> ---------------------------------------------------------------------------
-
Re: [DISCUSSION] Making HBaseWriter defaultBill Graham 2010-11-24, 20:15
Thanks, that helps. I'm still learning the best partitioning schemes
for HBase as well. > Meaning, any log file produced between: Feb 13, 2009 23:31:30UTC to Nov 20, > 2286 17:46:39UTC would work. I think we're ok then. :) I don't know why I thought the turnover happened more frequently than this... I think one take-away from this is that the partitioning scheme needs to be plug-able based on the use cases. For example, a hostname scan isn't desired for my current use cases, so <ts>-<hostname>-<data_type> wouldn't be ideal. Instead I'd look to use something like the current TSProcessor, only with TS rowKeys and data-type column families. That would allow the ability to just get the rows in a given range for a given data type. Going forward I think we'd want a way to decouple the Chukwa record-parsing code from the HBase row assembly code in the processors, since common records types can be stored in multiple ways in HBase depending on what the data access patterns will be. On Wed, Nov 24, 2010 at 11:19 AM, Eric Yang <[EMAIL PROTECTED]> wrote: > Hi Bill, > > I was assuming that data are going to use chukwa to process data after > epoach timestamp: 1234567890, and it will work up to 9999999999. > Meaning, any log file produced between: Feb 13, 2009 23:31:30UTC to Nov 20, > 2286 17:46:39UTC would work. > Then again, it might be short sighted on my part. We will probably want to > store binary epoch, long (8 bytes)-hostname. This will ensure the data has > a good range to work with. > > Partition by time, host, and data type can be done two ways: > > 1. you can use (8bytes)-hostname for row key which will partition by time, > then by host, and by data type (column family). (Tall table, Hbase guys > recommend this approach) > 2. Use hostname as row key and partition by data type (column family), hbase > timestamp and table name for time partition. (Thick row) > 3. No partition, use bloom filter on hbase to filter all regions in parallel > and return the results in chunks. > > I also got stuck on this parition problem when I started on Hbase path. > After studying it for 8 months, it suddenly became clear after I > implemented the first prototype. Hope this helps. > > Regards, > Eric > > On 11/24/10 10:04 AM, "Bill Graham" <[EMAIL PROTECTED]> wrote: > >> Rowkey is a combination of timestamp+primary key as string. I.e >> 1234567890-hostname. Therefore, the byte order of string sorting works fine. > > I don't think this is correct. If your row keys are strings, you'd get > an ordering like this: > > 1000-hostname > 200-hostname > 3000-hostname > > For the use case I was concerned about, I think it would be solved my > making the row key a long timestamp and the data-type a column family. > Then you could something similar to what you described: > > Scan “user_table”, { COLUMNS => “<data_type>”, STARTROW => 1234567890, > STOPROW => 1234597890 }; > > I'm not sure how to do the same thing though if you want to partition > by both hostname and datatype. > > > On Tue, Nov 23, 2010 at 1:54 PM, Eric Yang <[EMAIL PROTECTED]> wrote: >> It is more efficient because there is no need to wait for the file to be >> closed before the map reduce job can be launched. Data type is grouped >> into >> a hbase table or column families. The choice is in the hand of parser >> developer. Rowkey is a combination of timestamp+primary key as string. >> I.e >> 1234567890-hostname. Therefore, the byte order of string sorting works >> fine. >> >> There are two ways to deal with this problem, it can be scanned using >> StartRow feature in Hbase to narrow down the row range, or use Hbase >> timestamp field to control the scanning range. Hbase timestamp is a >> special >> numeric field. >> >> To translate your query to hbase: >> >> Scan “<data_type>”, { STARTROW => ‘timestamp’ }; >> >> Or >> >> Scan “user_table”, { COLUMNS => “<data_type>”, timestamp => 1234567890 }; >> >> The design is up to the parser designer. FYI, Hbase shell doesn’t support
-
Re: [DISCUSSION] Making HBaseWriter defaultEric Yang 2010-11-24, 21:18
The partition scheme is pluggable by the demux parser using annotation. Extending TSProcessor and modify the annotation should meet your needs.
While I agree with you on restructure Hbase row assembly code, I intend to keep using annotations to control hbase row assembly code. It could be done in a more elegant approach if demux isn't bulk down into MapReduce framework. That is something I will revisit after the community are sold on Hbase. :) Regards, Eric On 11/24/10 12:15 PM, "Bill Graham" <[EMAIL PROTECTED]> wrote: Thanks, that helps. I'm still learning the best partitioning schemes for HBase as well. > Meaning, any log file produced between: Feb 13, 2009 23:31:30UTC to Nov 20, > 2286 17:46:39UTC would work. I think we're ok then. :) I don't know why I thought the turnover happened more frequently than this... I think one take-away from this is that the partitioning scheme needs to be plug-able based on the use cases. For example, a hostname scan isn't desired for my current use cases, so <ts>-<hostname>-<data_type> wouldn't be ideal. Instead I'd look to use something like the current TSProcessor, only with TS rowKeys and data-type column families. That would allow the ability to just get the rows in a given range for a given data type. Going forward I think we'd want a way to decouple the Chukwa record-parsing code from the HBase row assembly code in the processors, since common records types can be stored in multiple ways in HBase depending on what the data access patterns will be. On Wed, Nov 24, 2010 at 11:19 AM, Eric Yang <[EMAIL PROTECTED]> wrote: > Hi Bill, > > I was assuming that data are going to use chukwa to process data after > epoach timestamp: 1234567890, and it will work up to 9999999999. > Meaning, any log file produced between: Feb 13, 2009 23:31:30UTC to Nov 20, > 2286 17:46:39UTC would work. > Then again, it might be short sighted on my part. We will probably want to > store binary epoch, long (8 bytes)-hostname. This will ensure the data has > a good range to work with. > > Partition by time, host, and data type can be done two ways: > > 1. you can use (8bytes)-hostname for row key which will partition by time, > then by host, and by data type (column family). (Tall table, Hbase guys > recommend this approach) > 2. Use hostname as row key and partition by data type (column family), hbase > timestamp and table name for time partition. (Thick row) > 3. No partition, use bloom filter on hbase to filter all regions in parallel > and return the results in chunks. > > I also got stuck on this parition problem when I started on Hbase path. > After studying it for 8 months, it suddenly became clear after I > implemented the first prototype. Hope this helps. > > Regards, > Eric > > On 11/24/10 10:04 AM, "Bill Graham" <[EMAIL PROTECTED]> wrote: > >> Rowkey is a combination of timestamp+primary key as string. I.e >> 1234567890-hostname. Therefore, the byte order of string sorting works fine. > > I don't think this is correct. If your row keys are strings, you'd get > an ordering like this: > > 1000-hostname > 200-hostname > 3000-hostname > > For the use case I was concerned about, I think it would be solved my > making the row key a long timestamp and the data-type a column family. > Then you could something similar to what you described: > > Scan "user_table", { COLUMNS => "<data_type>", STARTROW => 1234567890, > STOPROW => 1234597890 }; > > I'm not sure how to do the same thing though if you want to partition > by both hostname and datatype. > > > On Tue, Nov 23, 2010 at 1:54 PM, Eric Yang <[EMAIL PROTECTED]> wrote: >> It is more efficient because there is no need to wait for the file to be >> closed before the map reduce job can be launched. Data type is grouped >> into >> a hbase table or column families. The choice is in the hand of parser >> developer. Rowkey is a combination of timestamp+primary key as string. >> I.e >> 1234567890-hostname. Therefore, the byte order of string sorting works |