Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - HBase Writes With Large Number of Columns


+
Pankaj Misra 2013-03-25, 16:55
+
Ted Yu 2013-03-25, 16:59
+
Pankaj Misra 2013-03-25, 17:18
+
Ted Yu 2013-03-25, 17:45
+
Pankaj Misra 2013-03-25, 18:03
+
Ted Yu 2013-03-25, 18:24
+
Jean-Marc Spaggiari 2013-03-25, 18:27
+
Pankaj Misra 2013-03-25, 18:40
+
Ted Yu 2013-03-25, 19:39
+
Pankaj Misra 2013-03-25, 20:54
+
Jean-Marc Spaggiari 2013-03-25, 23:49
+
ramkrishna vasudevan 2013-03-26, 06:19
Copy link to this message
-
Re: HBase Writes With Large Number of Columns
Asaf Mesika 2013-03-27, 21:52
Correct me if I'm wrong, but you the drop is expected, according to the
following math:

If you have a Put, for a specific rowkey, and that rowkey weighs 100 bytes,
then if you have 20 columns you should add the following size to the
combined size of the columns:
20 x (100 bytes) = 2000 bytes
So the size of the Put sent to HBase should be:
1500 bytes (sum of all column qualifier size) + 20x100 (size of row key).

I add this 20x100 since, for each column qualifier, the Put object is
adding another KeyValue member object, which duplicates the RowKey.
See here (take from Put.java, v0.94.3 I think):

  public Put add(byte [] family, byte [] qualifier, long ts, byte [] value)
{

    List<KeyValue> list = getKeyValueList(family);

    KeyValue kv = createPutKeyValue(family, qualifier, ts, value);

    list.add(kv);

    familyMap.put(kv.getFamily(), list);

    return this;
  }

Each KeyValue also add more information which should also be taken into
account per Column Qualifier:
* KeyValue overhead - I think 2 longs
* Column Family length
* Timestamp - 1 long

I wrote a class to calculate a rough size of the HBase List<Put> size sent
to HBase, so I can calculate the throughput:

public class HBaseUtils {

    public static long getSize(List<? extends Row> actions) {
        long size = 0;
        for (Row row : actions) {
            size += getSize(row);
        }
        return size;
    }

    public static long getSize(Row row) {
        if (row instanceof Increment) {
            return calcSizeIncrement( (Increment) row);
        } else if (row instanceof Put) {
            return calcSizePut((Put) row);
        } else {
            throw new IllegalArgumentException("Can't calculate size for
Row type "+row.getClass());
        }
    }

    private static long calcSizePut(Put put) {
        long size = 0;
        size += put.getRow().length;

        Map<byte[], List<KeyValue>> familyMap = put.getFamilyMap();
        for (byte[] family : familyMap.keySet()) {
            size += family.length;
            List<KeyValue> kvs = familyMap.get(family);
            for (KeyValue kv : kvs) {
                size += kv.getLength();
            }
        }
        return size;

    }

    private static long calcSizeIncrement(Increment row) {
        long size = 0;

        size += row.getRow().length;

        Map<byte[], NavigableMap<byte[], Long>> familyMap row.getFamilyMap();
        for (byte[] family : familyMap.keySet()) {
            size += family.length;
            NavigableMap<byte[], Long> qualifiersMap familyMap.get(family);
            for (byte[] qualifier : qualifiersMap.keySet()) {
                size += qualifier.length;
                size += Bytes.SIZEOF_LONG;;
            }
        }

        return size;
    }
}

Feel free to use it.
On Tue, Mar 26, 2013 at 1:49 AM, Jean-Marc Spaggiari <
[EMAIL PROTECTED]> wrote:

> For a total of 1.5kb with 4 columns = 384 bytes/column
> bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -write 4:384:100
> -num_keys 1000000
> 13/03/25 14:54:45 INFO util.MultiThreadedAction: [W:100] Keys=991664,
> cols=3,8m, time=00:03:55 Overall: [keys/s= 4218, latency=23 ms]
> Current: [keys/s=4097, latency=24 ms], insertedUpTo=-1
>
> For a total of 1.5kb with 100 columns = 15 bytes/column
> bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -write 100:15:100
> -num_keys 1000000
> 13/03/25 16:27:44 INFO util.MultiThreadedAction: [W:100] Keys=999721,
> cols=95,3m, time=01:27:46 Overall: [keys/s= 189, latency=525 ms]
> Current: [keys/s=162, latency=616 ms], insertedUpTo=-1
>
> So overall, the speed is the same. A bit faster with 100 columns than
> with 4. I don't think there is any negative impact on HBase side
> because of all those columns. Might be interesting to test the same
> thing over Thrift...
>
> JM
>
> 2013/3/25 Pankaj Misra <[EMAIL PROTECTED]>:
> > Yes Ted, we have been observing Thrift API to clearly outperform Java
> native Hbase API, due to binary communication protocol, at higher loads.
+
Ted Yu 2013-03-27, 22:06
+
Asaf Mesika 2013-03-27, 22:28
+
Ted Yu 2013-03-27, 22:33
+
Mohammad Tariq 2013-03-25, 19:30