Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Re: HBaseStorage. Inconsistent result.


Copy link to this message
-
Re: HBaseStorage. Inconsistent result.
Jean-Daniel Cryans 2013-04-17, 17:50
Can you run a RowCounter a bunch of times to see if it exhibits the same
issue? It would tell us if it's HBase or Pig that causes the issue.

http://hbase.apache.org/book.html#rowcounter

J-D
On Tue, Apr 9, 2013 at 3:58 AM, Eugene Morozov <[EMAIL PROTECTED]>wrote:

> Hello everyone.
>
> I have following script:
> pages = LOAD 'hbase://mmpages' USING
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('t:d', '-loadKey');
> pages2 = FOREACH pages GENERATE $0;
> pages3 = DISTINCT pages2;
> g_pages = GROUP pages3 all PARALLEL 1;
> s_pages = FOREACH g_pages GENERATE 'count', COUNT(pages3);
> DUMP s_pages;
>
> It just calculates number of keys in the table.
> The issue with this is that it gives me different results.
> I had two launch.
>     * first one - 7 tasks in parallel (I launched same script 7 times
> trying to imitate heavy workload)
>     * second one - 9 tasks in parallel.
>
> All 7 guys in first and 8 guys in second give me correct result, which is:
>
> Input(s):
> Successfully read 246419854 records (102194 bytes) from: "hbase://mmpages"
> ...
> (count,246419854)
>
>
> But one last of second run gives different
> Input(s):
> Successfully read 246419853 records (102194 bytes) from: "hbase://mmpages"
> ...
> (count,246419853)
>
> Number of read bytes is same, but number of rows is different.
>
> There was definitely no change in mmpages. We do not use standard
> Put/Delete - only bulkImport and there were no  Major compaction run on
> this table. Even if it would be run, it wouldn't delete anything,
> because TTL of this page is => '2147483647'. Moreover this table was for
> debug purposes - nobody uses it, but me.
>
>
> Original issue I got was actually same, but with my own HBaseStorage. It
> gives much less consistent results. For example for 7 parallel run it gives
> me:
> --(count,246419854)
> --(count,246419173) : Successfully read 246419173 records (2333164 bytes)
> from: "hbase://mmpages"
> --(count,246419854) : Successfully read 246419854 records (2333164 bytes)
> from: "hbase://mmpages"
> --(count,246419854) : Successfully read 246419854 records (2333164 bytes)
> from: "hbase://mmpages"
> --(count,246419173) : Successfully read 246419173 records (2333164 bytes)
> from: "hbase://mmpages"
> --(count,246418816) : Successfully read 246418816 records (2333164 bytes)
> from: "hbase://mmpages"
> --(count,246418690)
> -- and one job has been failed due to lease exception.
> During run with my own HBaseStorage I see many map tasks killed with "lease
> does not exist exception", though job usually finish successful.
>
> As you can see number of read bytes is exactly same every time, but numbers
> of read rows are different. Exactly same I got with native HBaseStorage,
> though difference is really small.
>
> But anyway, I didn't expect to see that original HBaseStorage could also do
> the trick. And now my question is more about org.apache...HBaseStorage than
> about my own HBaseStorage.
>
> Any advice
>     to prove anything regarding native org.apache...HBaseStorage to fix it
> or
>     to do more experiments on the matter would be really really
> appreciated.
> --
> Eugene Morozov
> Developer of Grid Dynamics
> Skype: morozov.evgeny
> www.griddynamics.com
> [EMAIL PROTECTED]
>