Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Chukwa >> mail # dev >> [DISCUSSION] Making HBaseWriter default


Copy link to this message
-
Re: [DISCUSSION] Making HBaseWriter default
I see plenty of value in the HBase approach, but I'm still not clear
on how the time and data type partitioning would be done more
efficiently within HBase when running a job on a specific 5 minute
interval for a given data type. I've only used HBase briefly so I
could certainly be missing something, but I thought the sort for range
scans is by byte order, which works for string types, but not numbers.

So if your row ids are are <timestamp>/<data_type>, how do you fetch
all the data for a given data_type for a given time period without
potentially scanning many unnecessary rows? The timestamps will be in
alphabetical order, not numeric and data_types would be mixed.

Under the current scheme, since partitioning is done in HDFS you could
just get <data_type>/<time>/part-* to get exactly the records you're
looking for.
On Mon, Nov 22, 2010 at 5:00 PM, Eric Yang <[EMAIL PROTECTED]> wrote:
> Comparison chart:
>
> ---------------------------------------------------------------------------
> | Chukwa Types         | Chukwa classic         | Chukwa on Hbase         |
> ---------------------------------------------------------------------------
> | Installation cost    | Hadoop + Chukwa        | Hadoop + Hbase + Chukwa |
> ---------------------------------------------------------------------------
> | Data latency         | fixed n Minutes        | 50-100 ms               |
> ---------------------------------------------------------------------------
> | File Management      | Hourly/Daily Roll Up   | Hbase periodically      |
> | Cost                 | Mapreduce Job          | spill data to disk      |
> ---------------------------------------------------------------------------
> | Record Size          | Small needs to fit     | Data node block         |
> |                      | in java HashMap        | size. (64MB)            |
> ---------------------------------------------------------------------------
> | GUI friendly view    | Data needs to be       | drill down to raw       |
> |                      | aggregated first       | data or aggregated      |
> ---------------------------------------------------------------------------
> | Demux                | Single reducer         | Write to hbase in       |
> |                      | or creates multiple    | parallel                |
> |                      | part-nnn files, and    |                         |
> |                      | unsorted between files |                         |
> ---------------------------------------------------------------------------
> | Demux Output         | Sequence file          | Hbase Table             |
> ---------------------------------------------------------------------------
> | Data analytics tools | Mapreduce/Pig          | MR/Pig/Hive/Cascading   |
> ---------------------------------------------------------------------------
>
> Regards,
> Eric
>
> On 11/22/10 3:05 PM, "Ahmed Fathalla" <[EMAIL PROTECTED]> wrote:
>
>> I think what we need to do is create some kind of comparison table
>> contrasting the merits of each approach (HBase vs Normal Demux processing).
>> This exercise will be both useful in making the decision of choosing the
>> default and for documentation purposes to illustrate the difference for new
>> users.
>>
>>
>> On Mon, Nov 22, 2010 at 11:19 PM, Bill Graham <[EMAIL PROTECTED]> wrote:
>>
>>> We are going to continue to have use cases where we want log data
>>> rolled up into 5 minute, hourly and daily increments in HDFS to run
>>> map reduce jobs on them. How will this model work with the HBase
>>> approach? What process will aggregate the HBase data into time
>>> increments like the current demux and hourly/daily rolling processes
>>> do? Basically, what does the time partitioning look like in the HBase
>>> storage scheme?
>>>
>>>> My concern is that the demux process is going to become two parallel
>>>> tracks, one works in mapreduce, and another one works in collector.  It
>>>> becomes difficult to have clean efficient parsers which works in both
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB