Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Chukwa >> mail # dev >> [DISCUSSION] Making HBaseWriter default


Copy link to this message
-
Re: [DISCUSSION] Making HBaseWriter default
> Rowkey is a combination of timestamp+primary key as string. I.e 1234567890-hostname. Therefore, the byte order of string sorting works fine.

I don't think this is correct. If your row keys are strings, you'd get
an ordering like this:

1000-hostname
200-hostname
3000-hostname

For the use case I was concerned about, I think it would be solved my
making the row key a long timestamp and the data-type a column family.
Then you could something similar to what you described:

Scan “user_table”, { COLUMNS => “<data_type>”, STARTROW => 1234567890,
STOPROW => 1234597890 };

I'm not sure how to do the same thing though if you want to partition
by both hostname and datatype.
On Tue, Nov 23, 2010 at 1:54 PM, Eric Yang <[EMAIL PROTECTED]> wrote:
> It is more efficient because there is no need to wait for the file to be
> closed before the map reduce job can be launched.  Data type is grouped into
> a hbase table or column families.  The choice is in the hand of parser
> developer.  Rowkey is a combination of timestamp+primary key as string.  I.e
> 1234567890-hostname.  Therefore, the byte order of string sorting works
> fine.
>
> There are two ways to deal with this problem, it can be scanned using
> StartRow feature in Hbase to narrow down the row range, or use Hbase
> timestamp field to control the scanning range.  Hbase timestamp is a special
> numeric field.
>
> To translate your query to hbase:
>
> Scan “<data_type>”, { STARTROW => ‘timestamp’ };
>
> Or
>
> Scan “user_table”, { COLUMNS => “<data_type>”, timestamp => 1234567890 };
>
> The design is up to the parser designer.  FYI, Hbase shell doesn’t support
> timestamp range query, but the java api does.
>
> Regards,
> Eric
>
> On 11/22/10 10:38 PM, "Bill Graham" <[EMAIL PROTECTED]> wrote:
>
> I see plenty of value in the HBase approach, but I'm still not clear
> on how the time and data type partitioning would be done more
> efficiently within HBase when running a job on a specific 5 minute
> interval for a given data type. I've only used HBase briefly so I
> could certainly be missing something, but I thought the sort for range
> scans is by byte order, which works for string types, but not numbers.
>
> So if your row ids are are <timestamp>/<data_type>, how do you fetch
> all the data for a given data_type for a given time period without
> potentially scanning many unnecessary rows? The timestamps will be in
> alphabetical order, not numeric and data_types would be mixed.
>
> Under the current scheme, since partitioning is done in HDFS you could
> just get <data_type>/<time>/part-* to get exactly the records you're
> looking for.
>
>
> On Mon, Nov 22, 2010 at 5:00 PM, Eric Yang <[EMAIL PROTECTED]> wrote:
>> Comparison chart:
>>
>>
>> ---------------------------------------------------------------------------
>> | Chukwa Types         | Chukwa classic         | Chukwa on Hbase
>> |
>>
>> ---------------------------------------------------------------------------
>> | Installation cost    | Hadoop + Chukwa        | Hadoop + Hbase + Chukwa
>> |
>>
>> ---------------------------------------------------------------------------
>> | Data latency         | fixed n Minutes        | 50-100 ms
>> |
>>
>> ---------------------------------------------------------------------------
>> | File Management      | Hourly/Daily Roll Up   | Hbase periodically
>>  |
>> | Cost                 | Mapreduce Job          | spill data to disk
>>  |
>>
>> ---------------------------------------------------------------------------
>> | Record Size          | Small needs to fit     | Data node block
>> |
>> |                      | in java HashMap        | size. (64MB)
>>  |
>>
>> ---------------------------------------------------------------------------
>> | GUI friendly view    | Data needs to be       | drill down to raw
>> |
>> |                      | aggregated first       | data or aggregated
>>  |
>>
>> ---------------------------------------------------------------------------
>> | Demux                | Single reducer         | Write to hbase in
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB