Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # dev >> Review Request: Change ORC tree readers to return batches of rows instead of a row


Copy link to this message
-
Re: Review Request: Change ORC tree readers to return batches of rows instead of a row


> On April 23, 2013, 7:33 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReader.java, line 47
> > <https://reviews.apache.org/r/10712/diff/1/?file=283263#file283263line47>
> >
> >     the previous argument and the return value are always of class VectorizedRowBatch so why not declare them to be VectorizedRowBatch instead of Object?

I wanted to keep the interface same for both next and nextBatch, but it makes more sense to change it to VectorizedRowBatch.
> On April 23, 2013, 7:33 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 187
> > <https://reviews.apache.org/r/10712/diff/1/?file=283264#file283264line187>
> >
> >     if noNulls is false, that means there are nulls, so you can't set noNulls back to true, can you?

Change the logic a bit based on Scott's comments.It should look good now.
> On April 23, 2013, 7:33 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 336
> > <https://reviews.apache.org/r/10712/diff/1/?file=283264#file283264line336>
> >
> >     You should never have to create a new column vector here because there should always be one passed down into this code as part of the batch.
> >    
> >     Consider assuming there will always be a vector. The code will just fail if there isn't one.

The code is flexible enough to work either ways. If a vector is passed in it use it else will create a new one. This is the pattern followed in ORC reader and would like to keep it as such.
> On April 23, 2013, 7:33 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 342
> > <https://reviews.apache.org/r/10712/diff/1/?file=283264#file283264line342>
> >
> >     It appears you are replacing one isNull array (the one passed in) with another. We should be writing null information into the same Boolean array over and over. Am I missing something?

Changed a logic a bit and now the vector is being reused.
> On April 23, 2013, 7:33 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 393
> > <https://reviews.apache.org/r/10712/diff/1/?file=283264#file283264line393>
> >
> >     this code for IntTreeReader appears identical to the ShortTreeReader case. Can the code be shared, e.g. by inheriting from super-class?
> >    
> >     Since this is a vector, consider calling it nextVector

Creating one more class to abstract out the code creates two layers of redirection. Given that nextBatch method in TreeReaders have a dependency on the super class I would rather leave it as such.  

I will change the function name from nextBatch to nextVector as it makes more sense.
> On April 23, 2013, 7:33 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 452
> > <https://reviews.apache.org/r/10712/diff/1/?file=283264#file283264line452>
> >
> >     same comment -- can the code be shared for all int types?

As as above.
> On April 23, 2013, 7:33 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 531
> > <https://reviews.apache.org/r/10712/diff/1/?file=283264#file283264line531>
> >
> >     I think this could incorrectly decide something isRepeating if there are nulls, the way the code is now. But using NaN for nulls should prevent this.
> >    
> >     I think this isRepeating check is probably worth it for performance but it might not really pay off. It might be worth a performance test sometime in the future to determine this.

Yep we can do that once the end-end pipeline is build. I have fixed the NaN case and now isRepeating should be set correctly.
> On April 23, 2013, 7:33 p.m., Eric Hanson wrote:
> > ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java, line 105
> > <https://reviews.apache.org/r/10712/diff/1/?file=283266#file283266line105>
> >
> >     Should create the batch up here, not use a null batch.

This is fine as the next call to nextBatch passes a created batch down.

Change the logic of the same test to pass in null.
- Sarvesh
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10712/#review19568
On April 24, 2013, 9:53 p.m., Sarvesh Sakalanaga wrote: