Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Chukwa >> mail # dev >> [DISCUSSION] Making HBaseWriter default


Copy link to this message
-
Re: [DISCUSSION] Making HBaseWriter default
> Rowkey is a combination of timestamp+primary key as string. I.e 1234567890-hostname. Therefore, the byte order of string sorting works fine.

I don't think this is correct. If your row keys are strings, you'd get
an ordering like this:

1000-hostname
200-hostname
3000-hostname

For the use case I was concerned about, I think it would be solved my
making the row key a long timestamp and the data-type a column family.
Then you could something similar to what you described:

Scan “user_table”, { COLUMNS => “<data_type>”, STARTROW => 1234567890,
STOPROW => 1234597890 };

I'm not sure how to do the same thing though if you want to partition
by both hostname and datatype.
On Tue, Nov 23, 2010 at 1:54 PM, Eric Yang <[EMAIL PROTECTED]> wrote:
> It is more efficient because there is no need to wait for the file to be
> closed before the map reduce job can be launched.  Data type is grouped into
> a hbase table or column families.  The choice is in the hand of parser
> developer.  Rowkey is a combination of timestamp+primary key as string.  I.e
> 1234567890-hostname.  Therefore, the byte order of string sorting works
> fine.
>
> There are two ways to deal with this problem, it can be scanned using
> StartRow feature in Hbase to narrow down the row range, or use Hbase
> timestamp field to control the scanning range.  Hbase timestamp is a special
> numeric field.
>
> To translate your query to hbase:
>
> Scan “<data_type>”, { STARTROW => ‘timestamp’ };
>
> Or
>
> Scan “user_table”, { COLUMNS => “<data_type>”, timestamp => 1234567890 };
>
> The design is up to the parser designer.  FYI, Hbase shell doesn’t support
> timestamp range query, but the java api does.
>
> Regards,
> Eric
>
> On 11/22/10 10:38 PM, "Bill Graham" <[EMAIL PROTECTED]> wrote:
>
> I see plenty of value in the HBase approach, but I'm still not clear
> on how the time and data type partitioning would be done more
> efficiently within HBase when running a job on a specific 5 minute
> interval for a given data type. I've only used HBase briefly so I
> could certainly be missing something, but I thought the sort for range
> scans is by byte order, which works for string types, but not numbers.
>
> So if your row ids are are <timestamp>/<data_type>, how do you fetch
> all the data for a given data_type for a given time period without
> potentially scanning many unnecessary rows? The timestamps will be in
> alphabetical order, not numeric and data_types would be mixed.
>
> Under the current scheme, since partitioning is done in HDFS you could
> just get <data_type>/<time>/part-* to get exactly the records you're
> looking for.
>
>
> On Mon, Nov 22, 2010 at 5:00 PM, Eric Yang <[EMAIL PROTECTED]> wrote:
>> Comparison chart:
>>
>>
>> ---------------------------------------------------------------------------
>> | Chukwa Types         | Chukwa classic         | Chukwa on Hbase
>> |
>>
>> ---------------------------------------------------------------------------
>> | Installation cost    | Hadoop + Chukwa        | Hadoop + Hbase + Chukwa
>> |
>>
>> ---------------------------------------------------------------------------
>> | Data latency         | fixed n Minutes        | 50-100 ms
>> |
>>
>> ---------------------------------------------------------------------------
>> | File Management      | Hourly/Daily Roll Up   | Hbase periodically
>>  |
>> | Cost                 | Mapreduce Job          | spill data to disk
>>  |
>>
>> ---------------------------------------------------------------------------
>> | Record Size          | Small needs to fit     | Data node block
>> |
>> |                      | in java HashMap        | size. (64MB)
>>  |
>>
>> ---------------------------------------------------------------------------
>> | GUI friendly view    | Data needs to be       | drill down to raw
>> |
>> |                      | aggregated first       | data or aggregated
>>  |
>>
>> ---------------------------------------------------------------------------
>> | Demux                | Single reducer         | Write to hbase in