Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # dev >> Review Request: Change ORC tree readers to return batches of rows instead of a row


+
Sarvesh Sakalanaga 2013-04-22, 22:26
Copy link to this message
-
Re: Review Request: Change ORC tree readers to return batches of rows instead of a row

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10712/#review19568
-----------------------------------------------------------
I didn't get a chance to go through the whole thing -- here are just some initial comments.
ql/src/java/org/apache/hadoop/hive/ql/io/orc/DynamicByteArray.java
<https://reviews.apache.org/r/10712/#comment40418>

    should be totalLength not totalLenght

ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReader.java
<https://reviews.apache.org/r/10712/#comment40420>

    the previous argument and the return value are always of class VectorizedRowBatch so why not declare them to be VectorizedRowBatch instead of Object?

ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40461>

    put parens around expression on right side of =, for clarity

ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40462>

    if noNulls is false, that means there are nulls, so you can't set noNulls back to true, can you?

ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40423>

    You should never have to create a new column vector here because there should always be one passed down into this code as part of the batch.
    
    Consider assuming there will always be a vector. The code will just fail if there isn't one.

ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40463>

    It appears you are replacing one isNull array (the one passed in) with another. We should be writing null information into the same Boolean array over and over. Am I missing something?

ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40424>

    this code for IntTreeReader appears identical to the ShortTreeReader case. Can the code be shared, e.g. by inheriting from super-class?
    
    Since this is a vector, consider calling it nextVector

ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40425>

    same comment -- can the code be shared for all int types?

ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40474>

    for float/double, vector entries are supposed to be set to NaN for null values.

ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40478>

    I think this could incorrectly decide something isRepeating if there are nulls, the way the code is now. But using NaN for nulls should prevent this.
    
    I think this isRepeating check is probably worth it for performance but it might not really pay off. It might be worth a performance test sometime in the future to determine this.

ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40486>

    I'd said that NULL string entries should be set to the empty string in the vector. I'm not sure if any code depends on that though. So this is probably okay to not set null entries.

ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java
<https://reviews.apache.org/r/10712/#comment40427>

    Please add comment to say what is the purpose of the file created

ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java
<https://reviews.apache.org/r/10712/#comment40489>

    Should create the batch up here, not use a null batch.
    
    Or at least have a second test that creates the batch outside and passes it down.

ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java
<https://reviews.apache.org/r/10712/#comment40490>

    style guidelines say put a blank line in front of all comments

ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java
<https://reviews.apache.org/r/10712/#comment40445>

    I think this should take and return VectorizedRowBatch since it would not make sense for it to be another type -- the method name even says "Batch"
    

ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java
<https://reviews.apache.org/r/10712/#comment40488>

    should add another test with some null values
- Eric Hanson
On April 22, 2013, 10:26 p.m., Sarvesh Sakalanaga wrote:
+
Sarvesh Sakalanaga 2013-04-24, 21:53
+
Scott Preece 2013-04-23, 14:31
+
Sarvesh Sakalanaga 2013-04-24, 21:52
+
Sarvesh Sakalanaga 2013-04-24, 21:53
+
Eric Hanson 2013-04-24, 23:01
+
Sarvesh Sakalanaga 2013-04-25, 00:59
+
Sarvesh Sakalanaga 2013-04-25, 01:00
+
Ashutosh Chauhan 2013-04-30, 18:25