Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Performance between HBaseClient scan and HFileReaderV2


+
Jerry Lam 2013-12-23, 20:18
+
Tom Hood 2013-12-30, 02:09
+
Jerry Lam 2014-01-02, 15:56
+
Vladimir Rodionov 2014-01-02, 18:30
+
Jean-Marc Spaggiari 2014-01-02, 18:35
+
Jerry Lam 2014-01-02, 21:32
+
Sergey Shelukhin 2014-01-02, 21:42
+
Sergey Shelukhin 2014-01-02, 21:43
+
Enis Söztutar 2014-01-02, 22:02
Copy link to this message
-
Re: Performance between HBaseClient scan and HFileReaderV2
Hello Sergey and Enis,

Thank you for the pointer! HBASE-8691 will definitely help. HBASE-10076
(Very interesting/exciting feature by the way!) is what I need. How can I
port it to 0.92.x if it is at all possible?

I understand that my test is not realistic however since I have only 1
region with 1 HFile (this is by design), so there should not have any
"merge" sorted read going on.

One thing I'm not sure is that since I use snappy compression, does the
value of the KeyValue is decompress at the region server? If yes, I think
it is quite inefficient because the decompression can be done at the client
side. Saving bandwidth saves a lot of time for the type of workload I'm
working on.

Best Regards,

Jerry

On Thu, Jan 2, 2014 at 5:02 PM, Enis Söztutar <[EMAIL PROTECTED]> wrote:

> Nice test!
>
> There is a couple of things here:
>
>  (1) HFileReader reads only one file, versus, an HRegion reads multiple
> files (into the KeyValueHeap) to do a merge scan. So, although there is
> only one file, there is some overehead of doing a merge sort'ed read from
> multiple files in the region. For a more realistic test, you can try to do
> the reads using HRegion directly (instead of HFileReader). The overhead is
> not that much though in my tests.
>  (2) For scanning with client API, the results have to be serialized and
> deserialized and send over the network (or loopback for local). This is
> another overhead that is not there in HfileReader.
>  (3) HBase scanner RPC implementation is NOT streaming. The RPC works like
> fetching batch size (10000) records, and cannot fully saturate the disk and
> network pipeline.
>
> In my tests for "MapReduce over snapshot files (HBASE-8369)", I have
> measured 5x difference, because of layers (2) and (3). Please see my slides
> at http://www.slideshare.net/enissoz/mapreduce-over-snapshots
>
> I think we can do a much better job at (3), see HBASE-8691. However, there
> will always be "some" overhead, although it should not be 5-8x.
>
> As suggested above, in the meantime, you can take a look at the patch for
> HBASE-8369, and https://issues.apache.org/jira/browse/HBASE-10076 to see
> whether it suits your use case.
>
> Enis
>
>
> On Thu, Jan 2, 2014 at 1:43 PM, Sergey Shelukhin <[EMAIL PROTECTED]
> >wrote:
>
> > Er, using MR over snapshots, which reads files directly...
> > https://issues.apache.org/jira/browse/HBASE-8369
> > However, it was only committed to 98.
> > There was interest in 94 port (HBASE-10076), but it never happened...
> >
> >
> > On Thu, Jan 2, 2014 at 1:42 PM, Sergey Shelukhin <[EMAIL PROTECTED]
> > >wrote:
> >
> > > You might be interested in using
> > > https://issues.apache.org/jira/browse/HBASE-8369
> > > However, it was only committed to 98.
> > > There was interest in 94 port (HBASE-10076), but it never happened...
> > >
> > >
> > > On Thu, Jan 2, 2014 at 1:32 PM, Jerry Lam <[EMAIL PROTECTED]>
> wrote:
> > >
> > >> Hello Vladimir,
> > >>
> > >> In my use case, I guarantee that a major compaction is executed before
> > any
> > >> scan happens because the system we build is a read only system. There
> > will
> > >> have no deleted cells. Additionally, I only need to read from a single
> > >> column family and therefore I don't need to access multiple HFiles.
> > >>
> > >> Filter conditions are nice to have because if I can read HFile 8x
> faster
> > >> than using HBaseClient, I can do the filter on the client side and
> still
> > >> perform faster than using HBaseClient.
> > >>
> > >> Thank you for your input!
> > >>
> > >> Jerry
> > >>
> > >>
> > >>
> > >> On Thu, Jan 2, 2014 at 1:30 PM, Vladimir Rodionov
> > >> <[EMAIL PROTECTED]>wrote:
> > >>
> > >> > HBase scanner MUST guarantee correct order of KeyValues (coming from
> > >> > different HFile's),
> > >> > filter condition+ filter condition on included column families and
> > >> > qualifiers, time range, max versions and correctly process deleted
> > >> cells.
> > >> > Direct HFileReader does nothing from the above list.
+
Ted Yu 2014-01-02, 23:35
+
lars hofhansl 2014-01-02, 21:45
+
lars hofhansl 2014-01-02, 21:44
+
Jerry Lam 2014-01-02, 23:53
+
Stack 2014-01-02, 16:23
+
Jerry Lam 2014-01-02, 17:18
+
Andrew Purtell 2014-01-02, 17:47
+
lars hofhansl 2014-01-02, 18:54
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB