Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> strange PerformanceEvaluation behaviour


+
Oliver Meyn 2012-02-14, 15:56
+
Stack 2012-02-14, 16:14
+
Stack 2012-02-15, 06:32
+
Oliver Meyn 2012-02-15, 08:09
+
Oliver Meyn 2012-02-15, 09:53
Copy link to this message
-
Re: strange PerformanceEvaluation behaviour
Oliver:
Thanks for digging.

Please file Jira's for these issues.

On Feb 15, 2012, at 1:53 AM, "Oliver Meyn (GBIF)" <[EMAIL PROTECTED]> wrote:

> On 2012-02-15, at 9:09 AM, Oliver Meyn (GBIF) wrote:
>
>> On 2012-02-15, at 7:32 AM, Stack wrote:
>>
>>> On Tue, Feb 14, 2012 at 8:14 AM, Stack <[EMAIL PROTECTED]> wrote:
>>>>> 2) With that same randomWrite command line above, I would expect a resulting table with 10 * (1024 * 1024) rows (so 10485700 = roughly 10M rows).  Instead what I'm seeing is that the randomWrite job reports writing that many rows (exactly) but running rowcounter against the table reveals only 6549899 rows.  A second attempt to build the table produces slightly different results (e.g. 6627689).  I see a similar discrepancy when using 50 instead of 10 clients (~35% smaller than expected).  Key collision could explain it, but it seems pretty unlikely (given I only need e.g. 10M keys from a potential 2B).
>>>>>
>>>>
>>>
>>> I just tried it here and got similar result.  I wonder if its the
>>> randomWrite?  What if you do sequentialWrite, do you get our 10M?
>>
>> Thanks for checking into this stack - when using sequentialWrite I get the expected 10485700 rows.  I'll hack around a bit on the PE to count the number of collisions, and try to think of a reasonable solution.
>
> So hacking around reveals that key collision is indeed the problem.  I thought the modulo part of the getRandomRow method was suspect but while removing it improved the behaviour (I got ~8M rows instead of ~6.6M) it didn't fix it completely.  Since that's really what UUIDs are for I gave that a shot (i.e UUID.randomUUID()) and sure enough now I get the full 10M rows.  Those are 16-byte keys now though, instead of the 10-byte that the integers produced.  But because we're testing scan performance I think using a sequentially written table would probably be cheating and so will stick with randomWrite with slightly bigger keys.  That means it's a little harder to compare to the results that other people get, but at least I know my internal tests are apples to apples.
>
> Oh and I removed the outer 10x loop and that produced the desired number of mappers (ie what I passed in on the commandline) but made no difference in the key generation/collision story.
>
> Should I file bugs for these 2 issues?
>
> Thanks,
> Oliver
>
+
Oliver Meyn 2012-02-15, 14:37
+
Stack 2012-02-15, 16:39
+
Oliver Meyn 2012-02-16, 09:37
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB