Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - Log Events get Lost - flume 1.3


Copy link to this message
-
Re: Log Events get Lost - flume 1.3
Brock Noland 2013-04-16, 19:17
As Israel said, you should not depend on timestamp being unique. I'd
proceed with caution when using the sequence id approach if it will be the
front of you key for the same reason timestamp at the front of your key can
be a problem. More on that below.

You didn't specify where the timestamp was located in your key, but I'd
like to warn you about an extremely common problem new users of HBase have.
It's called "hot spotting" and using a timestamp as the first portion of
the hey is nearly going to guarantee your tables "hot spot".  I'd suggest
reading chapter 9 of HBase the definitive guide.

Brock
On Tue, Apr 16, 2013 at 2:02 PM, Israel Ekpo <[EMAIL PROTECTED]> wrote:

> Hello
>
> You can append or prefix a unique value to the nano time since you want
> each key to be unique per batch.
>
> Here is the first approach:
>
> private static AtomicLong idCounter = new AtomicLong();
>
> public static String createSequenceID()
> {
>     return String.valueOf(idCounter.getAndIncrement());
> }
>
>
> You can also use the UUID random number generator to get the unique values
>
> String uniqueID = UUID.randomUUID().toString();
>
> http://docs.oracle.com/javase/6/docs/api/java/util/UUID.html#randomUUID()
>
> I prefer the first option since it is more readable and helpful especially
> when you are debugging issues.
>
> I hope this helps.
>
> On 16 April 2013 14:36, Kumar, Deepak8 <[EMAIL PROTECTED]> wrote:
>
>>  Hi Brock,****
>>
>> Thanks for assisting.****
>>
>> ** **
>>
>> Actually we have an interceptor implementation through which we are
>> generating our row key for hbase (hbase is sink). If we have larger batch
>> size then the chances are that the timestamp is getting repeated in rowkey
>> which would overwrite the rows in hbase.****
>>
>> ** **
>>
>> Could you please guide me if we do have any work around so that I can
>> have larger batchsize as well as the row key is not repeated. I am taking
>> the count till nano timestamp.****
>>
>> ** **
>>
>> Regards,****
>>
>> Deepak****
>>
>> ** **
>>
>> @Override****
>>
>>   *public* Event intercept(Event event) {****
>>
>> //      eventCounter++;****
>>
>>     //*env*, logType, appId, logPath and logFileName ****
>>
>>     Map<String, String> headers = event.getHeaders(); ****
>>
>>     *long* now = System.*currentTimeMillis*();****
>>
>>     String nowNano = Long.*toString*(System.*nanoTime*());****
>>
>>     //nowNano = nowNano.substring(nowNano.length()-5);****
>>
>>        ****
>>
>>     headers.put(*TIMESTAMP*, Long.*toString*(now));****
>>
>>     headers.put(*HOST_NAME*, hostName);****
>>
>>     headers.put(*ENV*, env);****
>>
>>     headers.put(*LOG_TYPE*, logType);****
>>
>>     headers.put(*APP_ID*, appId);****
>>
>>     headers.put(*LOG_FILE_PATH*, logFilePath);****
>>
>>     headers.put(*LOG_FILE_NAME*, logFileName);****
>>
>>     headers.put(*TIME_STAMP_NANO*, nowNano);****
>>
>>     ****
>>
>>     *return* event;****
>>
>>   }****
>>
>> ** **
>>
>> @Override****
>>
>>   *public* List<Event> intercept(List<Event> events) {****
>>
>>     *for* (Event event : events) {****
>>
>>       intercept(event);****
>>
>>     }****
>>
>>     *return* events;****
>>
>>   }****
>>
>> ** **
>>
>> ** **
>>
>> *From:* Brock Noland [mailto:[EMAIL PROTECTED]]
>> *Sent:* Tuesday, April 16, 2013 10:39 AM
>> *To:* [EMAIL PROTECTED]
>> *Subject:* Re: Log Events get Lost - flume 1.3****
>>
>> ** **
>>
>> Hi,****
>>
>> ** **
>>
>> There are two issues with your configuration:****
>>
>> ** **
>>
>> 1) batch size of 1 with file channel is anti-pattern. This will result in
>> extremely poor performance because the file channel will have to do an
>> fsync() (expensive disk operation required to ensure no data loss) for each
>> event. Your batch size should probably be in the hundreds or thousands.**
>> **
>>
>> ** **
>>
>> 2) tail -F *will* lose data. There is a writeup on this in documentation.
>> If you care about your data, you will want to use Spooling Directory Source.
>> ****
>>
>> ** **
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org