Subject: RE: Deal with duplicates in Flume with a crash.
From our view, a single FlumeEvent is atomic, it either writes to HDFS or it does not. At peak I'm fluming approximated 2 million log lines per second. That would be 2 million check/and puts per second to HBase. Putting 200 log lines in a single event results in 10,000 FlumeEvents, each with a UUID resulting in 10,000 Hbase check/puts per second. When putting a single log line per FlumeEvent I was hammer HBase with 2,000,000 check/puts per second. Addtionally I found I got orders of magnitude more throughput on a flume flow doing this than I did by increasing batch size. The obvious trade off, I'm not running stock flume code.
From: Guillermo Ortiz [[EMAIL PROTECTED]]
Sent: Wednesday, December 03, 2014 4:46 PM
To: [EMAIL PROTECTED]
Subject: Re: Deal with duplicates in Flume with a crash.
That's interesting, do you have the RegionServers in different nodes
that your Flume Agents?? Because that could be a lot of traffic.
If you want to check duplicates for each log, the number of
checks/puts are always the same. What's the sense to put several logs
in the same event?
2014-12-03 23:35 GMT+01:00 Mike Keane <[EMAIL PROTECTED]>:
This email and any files included with it may contain privileged,
proprietary and/or confidential information that is for the sole use
of the intended recipient(s). Any disclosure, copying, distribution,
posting, or use of the information contained in or attached to this
email is prohibited unless permitted by the sender. If you have
received this email in error, please immediately notify the sender
via return email, telephone, or fax and destroy this original transmission
and its included files without reading or saving it in any manner.