Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - coprocessor enabled put very slow, help please~~~

Copy link to this message
Re: coprocessor enabled put very slow, help please~~~
Michel Segel 2013-02-20, 14:14

What happens when you have a poem like Mary had a little lamb?

Did you turn off the WAL on both table inserts, or just the index?

If you want to avoid processing duplicate docs... You could do this a couple of ways. The simplest way is to record the doc ID and a check sum for the doc. If the doc you are processing matches... You can simply do NOOP for the lines in the doc. (This isn't the fastest, but its easy.)
The other is to run a preprocess which removes duplicate doc from your directory and you then process the docs...

Third thing... Do a code review. Sloppy code will kill performance...

Sent from a remote device. Please excuse any typos...

Mike Segel

On Feb 20, 2013, at 5:26 AM, Prakash Kadel <[EMAIL PROTECTED]> wrote:

> michael,
>   infact i dont care about latency bw doc write and index write.
> today i did some tests.
> turns out turning off WAL does speed up the writes by about a factor of 2.
> interestingly, enabling bloom filter did little to improve the checkandput.
> earlier you mentioned
>>>>> The OP doesn't really get in to the use case, so we don't know why the
>>>> Check and Put in the M/R job.
>>>>> He should just be using put() and then a postPut().
> the main reason i use checkandput is to make sure the word count index doesnt get duplicate increments when duplicate documents come in. additionally i also need to dump dup free docs to hdfs for legacy system that we have in place.
> is there some way to avoid chechandput?
> Sincerely,
> Prakash
> On Feb 20, 2013, at 10:00 PM, Michel Segel <[EMAIL PROTECTED]> wrote:
>> I was suggesting removing the write to WAL on your write to the index table only.
>> The thing you have to realize that true low latency systems use databases as a sink. It's the end of the line so to speak.
>> So if you're worried about a small latency between the writing to your doc table, and then the write of your index.. You are designing the wrong system.
>> Consider that it takes some time t to write the base record and then to write the indexes.
>> For that period, you have a Schrödinger's cat problem as to if the row exists or not. Since HBase lacks transactions and ACID, trying to write a solution where you require the low latency... You are using the wrong tool.