Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - HBase Writes With Large Number of Columns


Copy link to this message
-
Re: HBase Writes With Large Number of Columns
Ted Yu 2013-03-27, 22:06
For 0.95 and beyond, HBaseClient is able to specify codec classes that
encode / compress CellBlock.
See the following in HBaseClient#Connection :

      builder.setCellBlockCodecClass(this.codec
.getClass().getCanonicalName());

      if (this.compressor != null) {

        builder.setCellBlockCompressorClass(this.compressor
.getClass().getCanonicalName());

      }
Cheers

On Wed, Mar 27, 2013 at 2:52 PM, Asaf Mesika <[EMAIL PROTECTED]> wrote:

> Correct me if I'm wrong, but you the drop is expected, according to the
> following math:
>
> If you have a Put, for a specific rowkey, and that rowkey weighs 100 bytes,
> then if you have 20 columns you should add the following size to the
> combined size of the columns:
> 20 x (100 bytes) = 2000 bytes
> So the size of the Put sent to HBase should be:
> 1500 bytes (sum of all column qualifier size) + 20x100 (size of row key).
>
> I add this 20x100 since, for each column qualifier, the Put object is
> adding another KeyValue member object, which duplicates the RowKey.
> See here (take from Put.java, v0.94.3 I think):
>
>   public Put add(byte [] family, byte [] qualifier, long ts, byte [] value)
> {
>
>     List<KeyValue> list = getKeyValueList(family);
>
>     KeyValue kv = createPutKeyValue(family, qualifier, ts, value);
>
>     list.add(kv);
>
>     familyMap.put(kv.getFamily(), list);
>
>     return this;
>   }
>
> Each KeyValue also add more information which should also be taken into
> account per Column Qualifier:
> * KeyValue overhead - I think 2 longs
> * Column Family length
> * Timestamp - 1 long
>
> I wrote a class to calculate a rough size of the HBase List<Put> size sent
> to HBase, so I can calculate the throughput:
>
> public class HBaseUtils {
>
>     public static long getSize(List<? extends Row> actions) {
>         long size = 0;
>         for (Row row : actions) {
>             size += getSize(row);
>         }
>         return size;
>     }
>
>     public static long getSize(Row row) {
>         if (row instanceof Increment) {
>             return calcSizeIncrement( (Increment) row);
>         } else if (row instanceof Put) {
>             return calcSizePut((Put) row);
>         } else {
>             throw new IllegalArgumentException("Can't calculate size for
> Row type "+row.getClass());
>         }
>     }
>
>     private static long calcSizePut(Put put) {
>         long size = 0;
>         size += put.getRow().length;
>
>         Map<byte[], List<KeyValue>> familyMap = put.getFamilyMap();
>         for (byte[] family : familyMap.keySet()) {
>             size += family.length;
>             List<KeyValue> kvs = familyMap.get(family);
>             for (KeyValue kv : kvs) {
>                 size += kv.getLength();
>             }
>         }
>         return size;
>
>     }
>
>     private static long calcSizeIncrement(Increment row) {
>         long size = 0;
>
>         size += row.getRow().length;
>
>         Map<byte[], NavigableMap<byte[], Long>> familyMap > row.getFamilyMap();
>         for (byte[] family : familyMap.keySet()) {
>             size += family.length;
>             NavigableMap<byte[], Long> qualifiersMap > familyMap.get(family);
>             for (byte[] qualifier : qualifiersMap.keySet()) {
>                 size += qualifier.length;
>                 size += Bytes.SIZEOF_LONG;;
>             }
>         }
>
>         return size;
>     }
> }
>
> Feel free to use it.
>
>
>
>
> On Tue, Mar 26, 2013 at 1:49 AM, Jean-Marc Spaggiari <
> [EMAIL PROTECTED]> wrote:
>
> > For a total of 1.5kb with 4 columns = 384 bytes/column
> > bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -write 4:384:100
> > -num_keys 1000000
> > 13/03/25 14:54:45 INFO util.MultiThreadedAction: [W:100] Keys=991664,
> > cols=3,8m, time=00:03:55 Overall: [keys/s= 4218, latency=23 ms]
> > Current: [keys/s=4097, latency=24 ms], insertedUpTo=-1
> >
> > For a total of 1.5kb with 100 columns = 15 bytes/column
> > bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -write 100:15:100