Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Chukwa, mail # dev - [DISCUSSION] Making HBaseWriter default


Copy link to this message
-
Re: [DISCUSSION] Making HBaseWriter default
Bill Graham 2010-11-24, 20:15
Thanks, that helps. I'm still learning the best partitioning schemes
for HBase as well.

> Meaning, any log file produced between: Feb 13, 2009 23:31:30UTC to Nov 20,
> 2286 17:46:39UTC would work.

I think we're ok then. :) I don't know why I thought the turnover
happened more frequently than this...

I think one take-away from this is that the partitioning scheme needs
to be plug-able based on the use cases. For example, a hostname scan
isn't desired for my current use cases, so <ts>-<hostname>-<data_type>
wouldn't be ideal. Instead I'd look to use something like the current
TSProcessor, only with TS rowKeys and data-type column families. That
would allow the ability to just get the rows in a given range for a
given data type.

Going forward I think we'd want a way to decouple the Chukwa
record-parsing code from the HBase row assembly code in the
processors, since common records types can be stored in multiple ways
in HBase depending on what the data access patterns will be.
On Wed, Nov 24, 2010 at 11:19 AM, Eric Yang <[EMAIL PROTECTED]> wrote:
> Hi Bill,
>
> I was assuming that data are going to use chukwa to process data after
> epoach timestamp: 1234567890, and it will work up to 9999999999.
> Meaning, any log file produced between: Feb 13, 2009 23:31:30UTC to Nov 20,
> 2286 17:46:39UTC would work.
> Then again, it might be short sighted on my part.  We will probably want to
> store binary epoch, long (8 bytes)-hostname.  This will ensure the data has
> a good range to work with.
>
> Partition by time, host, and data type can be done two ways:
>
> 1. you can use (8bytes)-hostname for row key which will partition by time,
> then by host, and by data type (column family).  (Tall table, Hbase guys
> recommend this approach)
> 2. Use hostname as row key and partition by data type (column family), hbase
> timestamp and table name for time partition. (Thick row)
> 3. No partition, use bloom filter on hbase to filter all regions in parallel
> and return the results in chunks.
>
> I also got stuck on this parition problem when I started on Hbase path.
>  After studying it for 8 months, it suddenly became clear after I
> implemented the first prototype.  Hope this helps.
>
> Regards,
> Eric
>
> On 11/24/10 10:04 AM, "Bill Graham" <[EMAIL PROTECTED]> wrote:
>
>> Rowkey is a combination of timestamp+primary key as string. I.e
>> 1234567890-hostname. Therefore, the byte order of string sorting works fine.
>
> I don't think this is correct. If your row keys are strings, you'd get
> an ordering like this:
>
> 1000-hostname
> 200-hostname
> 3000-hostname
>
> For the use case I was concerned about, I think it would be solved my
> making the row key a long timestamp and the data-type a column family.
> Then you could something similar to what you described:
>
> Scan “user_table”, { COLUMNS => “<data_type>”, STARTROW => 1234567890,
> STOPROW => 1234597890 };
>
> I'm not sure how to do the same thing though if you want to partition
> by both hostname and datatype.
>
>
> On Tue, Nov 23, 2010 at 1:54 PM, Eric Yang <[EMAIL PROTECTED]> wrote:
>> It is more efficient because there is no need to wait for the file to be
>> closed before the map reduce job can be launched.  Data type is grouped
>> into
>> a hbase table or column families.  The choice is in the hand of parser
>> developer.  Rowkey is a combination of timestamp+primary key as string.
>>  I.e
>> 1234567890-hostname.  Therefore, the byte order of string sorting works
>> fine.
>>
>> There are two ways to deal with this problem, it can be scanned using
>> StartRow feature in Hbase to narrow down the row range, or use Hbase
>> timestamp field to control the scanning range.  Hbase timestamp is a
>> special
>> numeric field.
>>
>> To translate your query to hbase:
>>
>> Scan “<data_type>”, { STARTROW => ‘timestamp’ };
>>
>> Or
>>
>> Scan “user_table”, { COLUMNS => “<data_type>”, timestamp => 1234567890 };
>>
>> The design is up to the parser designer.  FYI, Hbase shell doesn’t support