Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Scan vs Put vs Get


Copy link to this message
-
RE: Scan vs Put vs Get
Hi
     How many Gets you batch together in one call? Is this equal to the Scan#setCaching () that u are using?
If both are same u can be sure that the the number of NW calls is coming almost same.

Also you are giving random keys in the Gets. The scan will be always sequential. Seems in your get scenario it is very very random reads resulting in too many reads of HFile block from HDFS. [Block caching is enabled?]

Also have you tried using Bloom filters?  ROW blooms might improve your get performance.

-Anoop-
________________________________________
From: Jean-Marc Spaggiari [[EMAIL PROTECTED]]
Sent: Thursday, June 28, 2012 5:04 AM
To: user
Subject: Scan vs Put vs Get

Hi,

I have a small piece of code, for testing, which is putting 1B lines
in an existing table, getting 3000 lines and scanning 10000.

The table is one family, one column.

Everything is done randomly. Put with Random key (24 bytes), fixed
family and fixed column names with random content (24 bytes).

Get (batch) is done with random keys and scan with RandomRowFilter.

And here are the results.
Time to insert 1000000 lines: 43 seconds (23255 lines/seconds)
That's correct for my needs based on the poor performances of the
servers in the cluster. I'm fine with the results.

Time to read 3000 lines: 11444.0 mseconds (262 lines/seconds)
This is way to low. I don't understand why. So I tried the random scan
because I'm not able to figure the issue.

Time to read 10000 lines: 108.0 mseconds (92593 lines/seconds)
This it impressive! I have added that after I failed with the get. I
moved from 262 lines per seconds to almost 100K lines/seconds!!! It's
awesome!

However, I'm still wondering what's wrong with my gets.

The code is very simple. I'm using Get objects that I'm executing in a
Batch. I tried to add a filter but it's not helping. Here is an
extract of the code.

                        for (long l = 0; l < linesToRead; l++)
                        {
                                byte[] array1 = new byte[24];
                                for (int i = 0; i < array1.length; i++)
                                                array1[i] = (byte)Math.floor(Math.random() * 256);
                                Get g = new Get (array1);
                                gets.addElement(g);
                        }
                                Object[] results = new Object[gets.size()];
                                System.out.println(new java.util.Date () + " \"gets\" created.");
                                long timeBefore = System.currentTimeMillis();
                        table.batch(gets, results);
                        long timeAfter = System.currentTimeMillis();

                        float duration = timeAfter - timeBefore;
                        System.out.println ("Time to read " + gets.size() + " lines : "
+ duration + " mseconds (" + Math.round(((float)linesToRead /
(duration / 1000))) + " lines/seconds)");

What's wrong with it? I can't add the setBatch neither I can add
setCaching because it's not a scan. I tried with different numbers of
gets but it's almost always the same speed. Am I using it the wrong
way? Does anyone have any advice to improve that?

Thanks,

JM
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB