Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Performance between HBaseClient scan and HFileReaderV2


Copy link to this message
-
Re: Performance between HBaseClient scan and HFileReaderV2
Enis Söztutar 2014-01-02, 22:02
Nice test!

There is a couple of things here:

 (1) HFileReader reads only one file, versus, an HRegion reads multiple
files (into the KeyValueHeap) to do a merge scan. So, although there is
only one file, there is some overehead of doing a merge sort'ed read from
multiple files in the region. For a more realistic test, you can try to do
the reads using HRegion directly (instead of HFileReader). The overhead is
not that much though in my tests.
 (2) For scanning with client API, the results have to be serialized and
deserialized and send over the network (or loopback for local). This is
another overhead that is not there in HfileReader.
 (3) HBase scanner RPC implementation is NOT streaming. The RPC works like
fetching batch size (10000) records, and cannot fully saturate the disk and
network pipeline.

In my tests for "MapReduce over snapshot files (HBASE-8369)", I have
measured 5x difference, because of layers (2) and (3). Please see my slides
at http://www.slideshare.net/enissoz/mapreduce-over-snapshots

I think we can do a much better job at (3), see HBASE-8691. However, there
will always be "some" overhead, although it should not be 5-8x.

As suggested above, in the meantime, you can take a look at the patch for
HBASE-8369, and https://issues.apache.org/jira/browse/HBASE-10076 to see
whether it suits your use case.

Enis
On Thu, Jan 2, 2014 at 1:43 PM, Sergey Shelukhin <[EMAIL PROTECTED]>wrote:

> Er, using MR over snapshots, which reads files directly...
> https://issues.apache.org/jira/browse/HBASE-8369
> However, it was only committed to 98.
> There was interest in 94 port (HBASE-10076), but it never happened...
>
>
> On Thu, Jan 2, 2014 at 1:42 PM, Sergey Shelukhin <[EMAIL PROTECTED]
> >wrote:
>
> > You might be interested in using
> > https://issues.apache.org/jira/browse/HBASE-8369
> > However, it was only committed to 98.
> > There was interest in 94 port (HBASE-10076), but it never happened...
> >
> >
> > On Thu, Jan 2, 2014 at 1:32 PM, Jerry Lam <[EMAIL PROTECTED]> wrote:
> >
> >> Hello Vladimir,
> >>
> >> In my use case, I guarantee that a major compaction is executed before
> any
> >> scan happens because the system we build is a read only system. There
> will
> >> have no deleted cells. Additionally, I only need to read from a single
> >> column family and therefore I don't need to access multiple HFiles.
> >>
> >> Filter conditions are nice to have because if I can read HFile 8x faster
> >> than using HBaseClient, I can do the filter on the client side and still
> >> perform faster than using HBaseClient.
> >>
> >> Thank you for your input!
> >>
> >> Jerry
> >>
> >>
> >>
> >> On Thu, Jan 2, 2014 at 1:30 PM, Vladimir Rodionov
> >> <[EMAIL PROTECTED]>wrote:
> >>
> >> > HBase scanner MUST guarantee correct order of KeyValues (coming from
> >> > different HFile's),
> >> > filter condition+ filter condition on included column families and
> >> > qualifiers, time range, max versions and correctly process deleted
> >> cells.
> >> > Direct HFileReader does nothing from the above list.
> >> >
> >> > Best regards,
> >> > Vladimir Rodionov
> >> > Principal Platform Engineer
> >> > Carrier IQ, www.carrieriq.com
> >> > e-mail: [EMAIL PROTECTED]
> >> >
> >> > ________________________________________
> >> > From: Jerry Lam [[EMAIL PROTECTED]]
> >> > Sent: Thursday, January 02, 2014 7:56 AM
> >> > To: user
> >> > Subject: Re: Performance between HBaseClient scan and HFileReaderV2
> >> >
> >> > Hi Tom,
> >> >
> >> > Good point. Note that I also ran the HBaseClient performance test
> >> several
> >> > times (as you can see from the chart). The caching should also benefit
> >> the
> >> > second time I ran the HBaseClient performance test not just
> benefitting
> >> the
> >> > HFileReaderV2 test.
> >> >
> >> > I still don't understand what makes the HBaseClient performs so poorly
> >> in
> >> > comparison to access directly HDFS. I can understand maybe a factor
> of 2
> >> > (even that it is too much) but a factor of 8 is quite unreasonable.