Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> HBase Writes With Large Number of Columns


Copy link to this message
-
Re: HBase Writes With Large Number of Columns
Correct me if I'm wrong, but you the drop is expected, according to the
following math:

If you have a Put, for a specific rowkey, and that rowkey weighs 100 bytes,
then if you have 20 columns you should add the following size to the
combined size of the columns:
20 x (100 bytes) = 2000 bytes
So the size of the Put sent to HBase should be:
1500 bytes (sum of all column qualifier size) + 20x100 (size of row key).

I add this 20x100 since, for each column qualifier, the Put object is
adding another KeyValue member object, which duplicates the RowKey.
See here (take from Put.java, v0.94.3 I think):

  public Put add(byte [] family, byte [] qualifier, long ts, byte [] value)
{

    List<KeyValue> list = getKeyValueList(family);

    KeyValue kv = createPutKeyValue(family, qualifier, ts, value);

    list.add(kv);

    familyMap.put(kv.getFamily(), list);

    return this;
  }

Each KeyValue also add more information which should also be taken into
account per Column Qualifier:
* KeyValue overhead - I think 2 longs
* Column Family length
* Timestamp - 1 long

I wrote a class to calculate a rough size of the HBase List<Put> size sent
to HBase, so I can calculate the throughput:

public class HBaseUtils {

    public static long getSize(List<? extends Row> actions) {
        long size = 0;
        for (Row row : actions) {
            size += getSize(row);
        }
        return size;
    }

    public static long getSize(Row row) {
        if (row instanceof Increment) {
            return calcSizeIncrement( (Increment) row);
        } else if (row instanceof Put) {
            return calcSizePut((Put) row);
        } else {
            throw new IllegalArgumentException("Can't calculate size for
Row type "+row.getClass());
        }
    }

    private static long calcSizePut(Put put) {
        long size = 0;
        size += put.getRow().length;

        Map<byte[], List<KeyValue>> familyMap = put.getFamilyMap();
        for (byte[] family : familyMap.keySet()) {
            size += family.length;
            List<KeyValue> kvs = familyMap.get(family);
            for (KeyValue kv : kvs) {
                size += kv.getLength();
            }
        }
        return size;

    }

    private static long calcSizeIncrement(Increment row) {
        long size = 0;

        size += row.getRow().length;

        Map<byte[], NavigableMap<byte[], Long>> familyMap row.getFamilyMap();
        for (byte[] family : familyMap.keySet()) {
            size += family.length;
            NavigableMap<byte[], Long> qualifiersMap familyMap.get(family);
            for (byte[] qualifier : qualifiersMap.keySet()) {
                size += qualifier.length;
                size += Bytes.SIZEOF_LONG;;
            }
        }

        return size;
    }
}

Feel free to use it.
On Tue, Mar 26, 2013 at 1:49 AM, Jean-Marc Spaggiari <
[EMAIL PROTECTED]> wrote:

> For a total of 1.5kb with 4 columns = 384 bytes/column
> bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -write 4:384:100
> -num_keys 1000000
> 13/03/25 14:54:45 INFO util.MultiThreadedAction: [W:100] Keys=991664,
> cols=3,8m, time=00:03:55 Overall: [keys/s= 4218, latency=23 ms]
> Current: [keys/s=4097, latency=24 ms], insertedUpTo=-1
>
> For a total of 1.5kb with 100 columns = 15 bytes/column
> bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -write 100:15:100
> -num_keys 1000000
> 13/03/25 16:27:44 INFO util.MultiThreadedAction: [W:100] Keys=999721,
> cols=95,3m, time=01:27:46 Overall: [keys/s= 189, latency=525 ms]
> Current: [keys/s=162, latency=616 ms], insertedUpTo=-1
>
> So overall, the speed is the same. A bit faster with 100 columns than
> with 4. I don't think there is any negative impact on HBase side
> because of all those columns. Might be interesting to test the same
> thing over Thrift...
>
> JM
>
> 2013/3/25 Pankaj Misra <[EMAIL PROTECTED]>:
> > Yes Ted, we have been observing Thrift API to clearly outperform Java
> native Hbase API, due to binary communication protocol, at higher loads.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB