When HBase counter is removed from the data pipeline, this means agent
has not changed, but collector changed?
It sounds like the problem is in collector writing too slowly to HBase
rather than agent sending data too slowly to collector.
What does the demux parser look like? Maybe the slow down was
occurring in the extraction process? ETL process consumes more cpu
resource in collector for HBaseWriter. You might need to make sure
demux parser is optimized.
Are there more data streams added to Chukwa Agent? What adapters are
used to collect data?
There are two reasons that we don't split data stream at agent level.
1. Check pointing at agent level is done by stream offset for
efficiency. When splitting multiple chunks, and check point every
chunk, it will be too much write operation on the source node which
can potentially impact source system.
2. For log collection, we want to make sure that we are sending data
in sequence order to ensure the log stream is processed in linear
On Thu, Jan 26, 2012 at 1:27 PM, Corbin Hoenes <[EMAIL PROTECTED]> wrote:
> We use chukwa for log aggregation of web servers and it powers our analytics pipeline. It's been super useful and solid but we are running into a bit of a problem. I was hoping to split my data stream and create a realtime pipeline w/hbase but also stream into HDFS for bach MR processing still.
> I am running some simple calculations on pageviews coming in and wanted to update hbase using counters. This is slow right now since I only really have 1 servlet processing my chunk in my demo environment. Without the realtime hbase counters in the pipeline data flows a couple order of magnitudes quicker--I was hoping that smaller chunks lots more collector servlets I could make it scale better but right now it slows down the data stream too much.
> We use only 3 collectors in production and they handle the traffic well... but adding more would give us more concurrent hbase writer capability, was hoping there was a knob to allow for more concurrent chunk writing.
> On Jan 26, 2012, at 1:03 PM, Eric Yang wrote:
>> Hi Corbin,
>> This is by design. We are concatenating all data streams into in
>> memory queue on the agent, and establish only one http connection to
>> collector. This is for horizontal scalability that we can support
>> more machines. At the same time, it also ensures that agent can write
>> more data per HTTP post to reduce overhead of HTTP headers and
>> connection handshakes.
>> On Thu, Jan 26, 2012 at 11:51 AM, Corbin Hoenes <[EMAIL PROTECTED]> wrote:
>>> I am trying to do some real-time processing of the data coming into my chukwa pipeline and notice that using a single agent I don't seem to be getting very many servlets handling the requests. Peeking at the ChukwaAgent code it looks like the agents are limited to a single HttpConnector.
>>> Is this by design or am I off-base in my analysis of how it works?