Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Chukwa, mail # dev - [DISCUSSION] Making HBaseWriter default


Copy link to this message
-
Re: [DISCUSSION] Making HBaseWriter default
Bill Graham 2010-11-23, 06:38
I see plenty of value in the HBase approach, but I'm still not clear
on how the time and data type partitioning would be done more
efficiently within HBase when running a job on a specific 5 minute
interval for a given data type. I've only used HBase briefly so I
could certainly be missing something, but I thought the sort for range
scans is by byte order, which works for string types, but not numbers.

So if your row ids are are <timestamp>/<data_type>, how do you fetch
all the data for a given data_type for a given time period without
potentially scanning many unnecessary rows? The timestamps will be in
alphabetical order, not numeric and data_types would be mixed.

Under the current scheme, since partitioning is done in HDFS you could
just get <data_type>/<time>/part-* to get exactly the records you're
looking for.
On Mon, Nov 22, 2010 at 5:00 PM, Eric Yang <[EMAIL PROTECTED]> wrote:
> Comparison chart:
>
> ---------------------------------------------------------------------------
> | Chukwa Types         | Chukwa classic         | Chukwa on Hbase         |
> ---------------------------------------------------------------------------
> | Installation cost    | Hadoop + Chukwa        | Hadoop + Hbase + Chukwa |
> ---------------------------------------------------------------------------
> | Data latency         | fixed n Minutes        | 50-100 ms               |
> ---------------------------------------------------------------------------
> | File Management      | Hourly/Daily Roll Up   | Hbase periodically      |
> | Cost                 | Mapreduce Job          | spill data to disk      |
> ---------------------------------------------------------------------------
> | Record Size          | Small needs to fit     | Data node block         |
> |                      | in java HashMap        | size. (64MB)            |
> ---------------------------------------------------------------------------
> | GUI friendly view    | Data needs to be       | drill down to raw       |
> |                      | aggregated first       | data or aggregated      |
> ---------------------------------------------------------------------------
> | Demux                | Single reducer         | Write to hbase in       |
> |                      | or creates multiple    | parallel                |
> |                      | part-nnn files, and    |                         |
> |                      | unsorted between files |                         |
> ---------------------------------------------------------------------------
> | Demux Output         | Sequence file          | Hbase Table             |
> ---------------------------------------------------------------------------
> | Data analytics tools | Mapreduce/Pig          | MR/Pig/Hive/Cascading   |
> ---------------------------------------------------------------------------
>
> Regards,
> Eric
>
> On 11/22/10 3:05 PM, "Ahmed Fathalla" <[EMAIL PROTECTED]> wrote:
>
>> I think what we need to do is create some kind of comparison table
>> contrasting the merits of each approach (HBase vs Normal Demux processing).
>> This exercise will be both useful in making the decision of choosing the
>> default and for documentation purposes to illustrate the difference for new
>> users.
>>
>>
>> On Mon, Nov 22, 2010 at 11:19 PM, Bill Graham <[EMAIL PROTECTED]> wrote:
>>
>>> We are going to continue to have use cases where we want log data
>>> rolled up into 5 minute, hourly and daily increments in HDFS to run
>>> map reduce jobs on them. How will this model work with the HBase
>>> approach? What process will aggregate the HBase data into time
>>> increments like the current demux and hourly/daily rolling processes
>>> do? Basically, what does the time partitioning look like in the HBase
>>> storage scheme?
>>>
>>>> My concern is that the demux process is going to become two parallel
>>>> tracks, one works in mapreduce, and another one works in collector.  It
>>>> becomes difficult to have clean efficient parsers which works in both