Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Performance between HBaseClient scan and HFileReaderV2


+
Jerry Lam 2013-12-23, 20:18
+
Tom Hood 2013-12-30, 02:09
+
Jerry Lam 2014-01-02, 15:56
+
Vladimir Rodionov 2014-01-02, 18:30
+
Jean-Marc Spaggiari 2014-01-02, 18:35
+
Jerry Lam 2014-01-02, 21:32
+
Sergey Shelukhin 2014-01-02, 21:42
+
Sergey Shelukhin 2014-01-02, 21:43
+
Enis Söztutar 2014-01-02, 22:02
Copy link to this message
-
Re: Performance between HBaseClient scan and HFileReaderV2
Hello Sergey and Enis,

Thank you for the pointer! HBASE-8691 will definitely help. HBASE-10076
(Very interesting/exciting feature by the way!) is what I need. How can I
port it to 0.92.x if it is at all possible?

I understand that my test is not realistic however since I have only 1
region with 1 HFile (this is by design), so there should not have any
"merge" sorted read going on.

One thing I'm not sure is that since I use snappy compression, does the
value of the KeyValue is decompress at the region server? If yes, I think
it is quite inefficient because the decompression can be done at the client
side. Saving bandwidth saves a lot of time for the type of workload I'm
working on.

Best Regards,

Jerry

On Thu, Jan 2, 2014 at 5:02 PM, Enis Söztutar <[EMAIL PROTECTED]> wrote:

> Nice test!
>
> There is a couple of things here:
>
>  (1) HFileReader reads only one file, versus, an HRegion reads multiple
> files (into the KeyValueHeap) to do a merge scan. So, although there is
> only one file, there is some overehead of doing a merge sort'ed read from
> multiple files in the region. For a more realistic test, you can try to do
> the reads using HRegion directly (instead of HFileReader). The overhead is
> not that much though in my tests.
>  (2) For scanning with client API, the results have to be serialized and
> deserialized and send over the network (or loopback for local). This is
> another overhead that is not there in HfileReader.
>  (3) HBase scanner RPC implementation is NOT streaming. The RPC works like
> fetching batch size (10000) records, and cannot fully saturate the disk and
> network pipeline.
>
> In my tests for "MapReduce over snapshot files (HBASE-8369)", I have
> measured 5x difference, because of layers (2) and (3). Please see my slides
> at http://www.slideshare.net/enissoz/mapreduce-over-snapshots
>
> I think we can do a much better job at (3), see HBASE-8691. However, there
> will always be "some" overhead, although it should not be 5-8x.
>
> As suggested above, in the meantime, you can take a look at the patch for
> HBASE-8369, and https://issues.apache.org/jira/browse/HBASE-10076 to see
> whether it suits your use case.
>
> Enis
>
>
> On Thu, Jan 2, 2014 at 1:43 PM, Sergey Shelukhin <[EMAIL PROTECTED]
> >wrote:
>
> > Er, using MR over snapshots, which reads files directly...
> > https://issues.apache.org/jira/browse/HBASE-8369
> > However, it was only committed to 98.
> > There was interest in 94 port (HBASE-10076), but it never happened...
> >
> >
> > On Thu, Jan 2, 2014 at 1:42 PM, Sergey Shelukhin <[EMAIL PROTECTED]
> > >wrote:
> >
> > > You might be interested in using
> > > https://issues.apache.org/jira/browse/HBASE-8369
> > > However, it was only committed to 98.
> > > There was interest in 94 port (HBASE-10076), but it never happened...
> > >
> > >
> > > On Thu, Jan 2, 2014 at 1:32 PM, Jerry Lam <[EMAIL PROTECTED]>
> wrote:
> > >
> > >> Hello Vladimir,
> > >>
> > >> In my use case, I guarantee that a major compaction is executed before
> > any
> > >> scan happens because the system we build is a read only system. There
> > will
> > >> have no deleted cells. Additionally, I only need to read from a single
> > >> column family and therefore I don't need to access multiple HFiles.
> > >>
> > >> Filter conditions are nice to have because if I can read HFile 8x
> faster
> > >> than using HBaseClient, I can do the filter on the client side and
> still
> > >> perform faster than using HBaseClient.
> > >>
> > >> Thank you for your input!
> > >>
> > >> Jerry
> > >>
> > >>
> > >>
> > >> On Thu, Jan 2, 2014 at 1:30 PM, Vladimir Rodionov
> > >> <[EMAIL PROTECTED]>wrote:
> > >>
> > >> > HBase scanner MUST guarantee correct order of KeyValues (coming from
> > >> > different HFile's),
> > >> > filter condition+ filter condition on included column families and
> > >> > qualifiers, time range, max versions and correctly process deleted
> > >> cells.
> > >> > Direct HFileReader does nothing from the above list.
+
Ted Yu 2014-01-02, 23:35
+
lars hofhansl 2014-01-02, 21:45
+
lars hofhansl 2014-01-02, 21:44
+
Jerry Lam 2014-01-02, 23:53
+
Stack 2014-01-02, 16:23
+
Jerry Lam 2014-01-02, 17:18
+
Andrew Purtell 2014-01-02, 17:47
+
lars hofhansl 2014-01-02, 18:54