Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume, mail # user - Log Events get Lost - flume 1.3


+
Kumar, Deepak8 2013-04-16, 08:16
+
Brock Noland 2013-04-16, 14:39
+
Kumar, Deepak8 2013-04-16, 18:36
Copy link to this message
-
Re: Log Events get Lost - flume 1.3
Israel Ekpo 2013-04-16, 19:02
Hello

You can append or prefix a unique value to the nano time since you want
each key to be unique per batch.

Here is the first approach:

private static AtomicLong idCounter = new AtomicLong();

public static String createSequenceID()
{
    return String.valueOf(idCounter.getAndIncrement());
}
You can also use the UUID random number generator to get the unique values

String uniqueID = UUID.randomUUID().toString();

http://docs.oracle.com/javase/6/docs/api/java/util/UUID.html#randomUUID()

I prefer the first option since it is more readable and helpful especially
when you are debugging issues.

I hope this helps.

On 16 April 2013 14:36, Kumar, Deepak8 <[EMAIL PROTECTED]> wrote:

>  Hi Brock,****
>
> Thanks for assisting.****
>
> ** **
>
> Actually we have an interceptor implementation through which we are
> generating our row key for hbase (hbase is sink). If we have larger batch
> size then the chances are that the timestamp is getting repeated in rowkey
> which would overwrite the rows in hbase.****
>
> ** **
>
> Could you please guide me if we do have any work around so that I can have
> larger batchsize as well as the row key is not repeated. I am taking the
> count till nano timestamp.****
>
> ** **
>
> Regards,****
>
> Deepak****
>
> ** **
>
> @Override****
>
>   *public* Event intercept(Event event) {****
>
> //      eventCounter++;****
>
>     //*env*, logType, appId, logPath and logFileName ****
>
>     Map<String, String> headers = event.getHeaders(); ****
>
>     *long* now = System.*currentTimeMillis*();****
>
>     String nowNano = Long.*toString*(System.*nanoTime*());****
>
>     //nowNano = nowNano.substring(nowNano.length()-5);****
>
>        ****
>
>     headers.put(*TIMESTAMP*, Long.*toString*(now));****
>
>     headers.put(*HOST_NAME*, hostName);****
>
>     headers.put(*ENV*, env);****
>
>     headers.put(*LOG_TYPE*, logType);****
>
>     headers.put(*APP_ID*, appId);****
>
>     headers.put(*LOG_FILE_PATH*, logFilePath);****
>
>     headers.put(*LOG_FILE_NAME*, logFileName);****
>
>     headers.put(*TIME_STAMP_NANO*, nowNano);****
>
>     ****
>
>     *return* event;****
>
>   }****
>
> ** **
>
> @Override****
>
>   *public* List<Event> intercept(List<Event> events) {****
>
>     *for* (Event event : events) {****
>
>       intercept(event);****
>
>     }****
>
>     *return* events;****
>
>   }****
>
> ** **
>
> ** **
>
> *From:* Brock Noland [mailto:[EMAIL PROTECTED]]
> *Sent:* Tuesday, April 16, 2013 10:39 AM
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Log Events get Lost - flume 1.3****
>
> ** **
>
> Hi,****
>
> ** **
>
> There are two issues with your configuration:****
>
> ** **
>
> 1) batch size of 1 with file channel is anti-pattern. This will result in
> extremely poor performance because the file channel will have to do an
> fsync() (expensive disk operation required to ensure no data loss) for each
> event. Your batch size should probably be in the hundreds or thousands.***
> *
>
> ** **
>
> 2) tail -F *will* lose data. There is a writeup on this in documentation.
> If you care about your data, you will want to use Spooling Directory Source.
> ****
>
> ** **
>
> Issue #2 is being worsened by issue #1. Since you have such a low batch
> size, throughput of the file channel is extremely low. As tail -F results
> in no feedback to the tail process, more data than is being lost than would
> otherwise be the case due to the low channel throughput.****
>
> ** **
>
> ** **
>
> Brock****
>
> ** **
>
> On Tue, Apr 16, 2013 at 3:16 AM, Kumar, Deepak8 <[EMAIL PROTECTED]>
> wrote:****
>
> Hi,****
>
> I have 10 flume agents configured at a single machine. A single log file
> has frequency of 500 log events/sec. Hence in 10 log files the logs are
> getting generated as 5000 log events per second (5000/sec).****
>
>  ****
>
> If my channel capacity is 1 million,  more than 70% of log events is lost!
> If I increase the channel capacity to 50 millions, then flume agent takes
> more than 24 hours to transfer the log events from source to sink.****
+
Brock Noland 2013-04-16, 19:17
+
Kumar, Deepak8 2013-04-17, 14:22