Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # dev >> Review Request 24876: VectorizedBatchUtil.addRowToBatchFrom is not optimized for Vectorized execution and takes 25% CPU


Copy link to this message
-
Review Request 24876: VectorizedBatchUtil.addRowToBatchFrom is not optimized for Vectorized execution and takes 25% CPU

This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24876/

Review request for hive.
Bugs: HIVE-7664
    https://issues.apache.org/jira/browse/HIVE-7664
Repository: hive-git
Description

In a Group by heavy vectorized Reducer vertex 25% of CPU is spent in VectorizedBatchUtil.addRowToBatchFrom().

Looked at the code of VectorizedBatchUtil.addRowToBatchFrom and it looks like it wasn't optimized for Vectorized processing.

addRowToBatchFrom is called for every row and for each row and every column in the batch getPrimitiveCategory is called to figure the type of each column, column types are stored in a HashMap, for VectorGroupByOperator columns types won't change between batches, so column types shouldn't be looked up for every row.

I recommend storing the column type in StructObjectInspector so that other components can leverage this optimization.

Also addRowToBatchFrom has a case statement for every row and every column used for type casting I recommend encapsulating the type logic in templatized methods.  

{code}
Stack Trace Sample Count Percentage(%)
VectorizedBatchUtil.addRowToBatchFrom 86 26.543
   AbstractPrimitiveObjectInspector.getPrimitiveCategory() 34 10.494
   LazyBinaryStructObjectInspector.getStructFieldData 25 7.716
   StandardStructObjectInspector.getStructFieldData 4 1.235
{code}

The query used :
{code}
select
    ss_sold_date_sk
from
    store_sales
where
    ss_sold_date between '1998-01-01' and '1998-06-01'
group by ss_item_sk , ss_customer_sk , ss_sold_date_sk
having sum(ss_list_price) > 50000000000000;
{code}
Diffs

  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/ReduceRecordProcessor.java 2acd842
  ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizedBatchUtil.java 16454e7

Diff: https://reviews.apache.org/r/24876/diff/
Testing
Thanks,

Navis Ryu