Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # dev - Review Request: Improve RCFile::sync(long) by 10x


Copy link to this message
-
Review Request: Improve RCFile::sync(long) by 10x
Gopal V 2013-04-26, 11:25

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10795/
-----------------------------------------------------------

Review request for hive, Ashutosh Chauhan and Gunther Hagleitner.
Description
-------

Speed up RCFile::sync() by reading large blocks of data from HDFS rather than using readByte() on the input stream.

This improves the loop behaviour and reduces the number of calls on the synchronized read() methods within HDFS, resulting in a 10x performance boost to this function.

In real time, it converts a call that takes upto a second and brings it below 100ms, by reading 512 byte chunks instead of reading data 1 byte at a time.
This addresses bug HIVE-4423.
    https://issues.apache.org/jira/browse/HIVE-4423
Diffs
-----

  ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java d3d98d0

Diff: https://reviews.apache.org/r/10795/diff/
Testing
-------

ant test -Dtestcase=TestRCFile -Dmodule=ql
ant test -Dtestcase=TestCliDriver -Dqfile_regex=.*rcfile.* -Dmodule=ql

And benchmarking with count(1) on the store_sales rcfile table at scale=10

before: 43.8, after: 39.5
Thanks,

Gopal V

+
Ashutosh Chauhan 2013-04-26, 15:13