Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - HBaseStorage. Inconsistent result.


Copy link to this message
-
HBaseStorage. Inconsistent result.
Eugene Morozov 2013-04-09, 07:58
Hello everyone.

I have following script:
pages = LOAD 'hbase://mmpages' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('t:d', '-loadKey');
pages2 = FOREACH pages GENERATE $0;
pages3 = DISTINCT pages2;
g_pages = GROUP pages3 all PARALLEL 1;
s_pages = FOREACH g_pages GENERATE 'count', COUNT(pages3);
DUMP s_pages;

It just calculates number of keys in the table.
The issue with this is that it gives me different results.
I had two launch.
    * first one - 7 tasks in parallel (I launched same script 7 times
trying to imitate heavy workload)
    * second one - 9 tasks in parallel.

All 7 guys in first and 8 guys in second give me correct result, which is:

Input(s):
Successfully read 246419854 records (102194 bytes) from: "hbase://mmpages"
...
(count,246419854)
But one last of second run gives different
Input(s):
Successfully read 246419853 records (102194 bytes) from: "hbase://mmpages"
...
(count,246419853)

Number of read bytes is same, but number of rows is different.

There was definitely no change in mmpages. We do not use standard
Put/Delete - only bulkImport and there were no  Major compaction run on
this table. Even if it would be run, it wouldn't delete anything,
because TTL of this page is => '2147483647'. Moreover this table was for
debug purposes - nobody uses it, but me.
Original issue I got was actually same, but with my own HBaseStorage. It
gives much less consistent results. For example for 7 parallel run it gives
me:
--(count,246419854)
--(count,246419173) : Successfully read 246419173 records (2333164 bytes)
from: "hbase://mmpages"
--(count,246419854) : Successfully read 246419854 records (2333164 bytes)
from: "hbase://mmpages"
--(count,246419854) : Successfully read 246419854 records (2333164 bytes)
from: "hbase://mmpages"
--(count,246419173) : Successfully read 246419173 records (2333164 bytes)
from: "hbase://mmpages"
--(count,246418816) : Successfully read 246418816 records (2333164 bytes)
from: "hbase://mmpages"
--(count,246418690)
-- and one job has been failed due to lease exception.
During run with my own HBaseStorage I see many map tasks killed with "lease
does not exist exception", though job usually finish successful.

As you can see number of read bytes is exactly same every time, but numbers
of read rows are different. Exactly same I got with native HBaseStorage,
though difference is really small.

But anyway, I didn't expect to see that original HBaseStorage could also do
the trick. And now my question is more about org.apache...HBaseStorage than
about my own HBaseStorage.

Any advice
    to prove anything regarding native org.apache...HBaseStorage to fix it
or
    to do more experiments on the matter would be really really appreciated.
--
Eugene Morozov
Developer of Grid Dynamics
Skype: morozov.evgeny
www.griddynamics.com
[EMAIL PROTECTED]