Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Re: HBaseStorage. Inconsistent result.


Copy link to this message
-
Re: HBaseStorage. Inconsistent result.
Can you run a RowCounter a bunch of times to see if it exhibits the same
issue? It would tell us if it's HBase or Pig that causes the issue.

http://hbase.apache.org/book.html#rowcounter

J-D
On Tue, Apr 9, 2013 at 3:58 AM, Eugene Morozov <[EMAIL PROTECTED]>wrote:

> Hello everyone.
>
> I have following script:
> pages = LOAD 'hbase://mmpages' USING
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('t:d', '-loadKey');
> pages2 = FOREACH pages GENERATE $0;
> pages3 = DISTINCT pages2;
> g_pages = GROUP pages3 all PARALLEL 1;
> s_pages = FOREACH g_pages GENERATE 'count', COUNT(pages3);
> DUMP s_pages;
>
> It just calculates number of keys in the table.
> The issue with this is that it gives me different results.
> I had two launch.
>     * first one - 7 tasks in parallel (I launched same script 7 times
> trying to imitate heavy workload)
>     * second one - 9 tasks in parallel.
>
> All 7 guys in first and 8 guys in second give me correct result, which is:
>
> Input(s):
> Successfully read 246419854 records (102194 bytes) from: "hbase://mmpages"
> ...
> (count,246419854)
>
>
> But one last of second run gives different
> Input(s):
> Successfully read 246419853 records (102194 bytes) from: "hbase://mmpages"
> ...
> (count,246419853)
>
> Number of read bytes is same, but number of rows is different.
>
> There was definitely no change in mmpages. We do not use standard
> Put/Delete - only bulkImport and there were no  Major compaction run on
> this table. Even if it would be run, it wouldn't delete anything,
> because TTL of this page is => '2147483647'. Moreover this table was for
> debug purposes - nobody uses it, but me.
>
>
> Original issue I got was actually same, but with my own HBaseStorage. It
> gives much less consistent results. For example for 7 parallel run it gives
> me:
> --(count,246419854)
> --(count,246419173) : Successfully read 246419173 records (2333164 bytes)
> from: "hbase://mmpages"
> --(count,246419854) : Successfully read 246419854 records (2333164 bytes)
> from: "hbase://mmpages"
> --(count,246419854) : Successfully read 246419854 records (2333164 bytes)
> from: "hbase://mmpages"
> --(count,246419173) : Successfully read 246419173 records (2333164 bytes)
> from: "hbase://mmpages"
> --(count,246418816) : Successfully read 246418816 records (2333164 bytes)
> from: "hbase://mmpages"
> --(count,246418690)
> -- and one job has been failed due to lease exception.
> During run with my own HBaseStorage I see many map tasks killed with "lease
> does not exist exception", though job usually finish successful.
>
> As you can see number of read bytes is exactly same every time, but numbers
> of read rows are different. Exactly same I got with native HBaseStorage,
> though difference is really small.
>
> But anyway, I didn't expect to see that original HBaseStorage could also do
> the trick. And now my question is more about org.apache...HBaseStorage than
> about my own HBaseStorage.
>
> Any advice
>     to prove anything regarding native org.apache...HBaseStorage to fix it
> or
>     to do more experiments on the matter would be really really
> appreciated.
> --
> Eugene Morozov
> Developer of Grid Dynamics
> Skype: morozov.evgeny
> www.griddynamics.com
> [EMAIL PROTECTED]
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB