Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # dev - Review Request: Vectorized Timestamp functions for long nanosecond based timestamps


+
Gopal V 2013-05-29, 23:36
+
Eric Hanson 2013-05-30, 21:29
Copy link to this message
-
Re: Review Request: Vectorized Timestamp functions for long nanosecond based timestamps
Gopal V 2013-05-31, 13:58


> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFTimestampFieldLong.java, line 56
> > <https://reviews.apache.org/r/11530/diff/1/?file=298364#file298364line56>
> >
> >     If you know in what cases ms can be negative, can you add that to the comment? That seems unusual.
> >    
> >     In think this because if the time is negative (before the epoch) you could get negative nanos. So you want to convert before creating the timestamp. Is that right? Please elaborate a little in the comment.

The issue has to do with the way sql.Timestamp stores values.

http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/sql/Timestamp.java#126
> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFTimestampFieldLong.java, line 62
> > <https://reviews.apache.org/r/11530/diff/1/?file=298364#file298364line62>
> >
> >     Would it be possible to re-use a timestamp that belongs to the class here, rather than calling new? If so, please do that to speed this up. I think you can do what you need with setTime() and setNanos().
> >    
> >     Eliminating new() in the inner loop of vector processing tends to speed things up a lot.

Will remove the extra timestamp created here.
> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFTimestampFieldLong.java, line 51
> > <https://reviews.apache.org/r/11530/diff/1/?file=298364#file298364line51>
> >
> >     Can you confirm this expression will be constant-folded by the compiler? Otherwise this should be evaluated by hand in advance.

in bytecode it turns into the appropriate 2 integer division ops.
   6: lload_1
   7: ldc2_w #3; //long 1000000000l
   10: ldiv
   11: ldc2_w #5; //long 1000l
   14: ldiv
> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFWeekOfYearLong.java, line 29
> > <https://reviews.apache.org/r/11530/diff/1/?file=298366#file298366line29>
> >
> >     Please explain why constant 4 is correct here and what it means.

This copied over from UDFWeekOfYear, which also leaves it undocumented.
> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFYearLong.java, line 33
> > <https://reviews.apache.org/r/11530/diff/1/?file=298367#file298367line33>
> >
> >     can you use minYear and maxYear here instead of literals?

will do.
> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFYearLong.java, line 43
> > <https://reviews.apache.org/r/11530/diff/1/?file=298367#file298367line43>
> >
> >     Given that this function is moderately heavy anyway (with the binary search) I think making it virtual will not slow things down much.
> >    
> >     But if it gets faster we should seriously consider creating a separate evaluate method for VectorUDFYearLong and make this a static function to avoid virtual method call overhead.
> >

Converting all leaf UDFs to be final classes, this triggers the virtual method cache and also makes the parent methods (from an abstract class) valid for inlining.
> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFYearLong.java, line 45
> > <https://reviews.apache.org/r/11530/diff/1/?file=298367#file298367line45>
> >
> >     Did you consider calculating approximate year using something like
> >    
> >     approxYear = yearBase + (int)(time / nanosPerYear) - 1;
> >    
> >     and then linearly search forward to find year boundary? I wonder if that would be faster than binary search.

A probing hash lookup will incur a bounds check operation in java, which is not triggered at all for Arrays functions or anything which uses GetPrimitiveArrayCritical (in the same thread) which skips boundary checks.

Could be done, but for now readability by others is preferred till first integration.

It tests a case where .isRepeating is not true, but all items are initialized to the exact same value. The % was to repeat smaller batches to fill up the entire batch size.

I used a home-made macro operation to generate all the test-cases to be identical with different UDFs

import sys, re;

data = sys.stdin.read();
macro = re.compile(r'begin-macro(.*)end-macro',re.M|re.S);
m = macro.search(data);
content = "\n".join(m.group(1).split("\n")[1:-1]);
occurrance = re.compile(r's\/([^/]*)\/([^/]*)\/g.*');
def expand(m):
original = m.group(1);
replacement = m.group(2);
return "%s\n%s" % (m.group(0),
  content.replace(original, replacement));
print re.sub(occurrance, expand, data);

will change it, this was copied off the String UDF tests.

The base test is Assert.assertEquals(vrg.cols[out].isNull[i], vrg.cols[in].isNull[i]); for both the sides of the if/else.

The logical extension being that vrg.cols[in].isNull[i] == true, I expect vrg.cols[out].isNull[i] == true.

It felt better to not simplify the logic and keep the assert as an invariant between the branches.

The input is repeated to fill up size times. I will add a comment.

Thanks, always makes sense to write more tests than impl, when reimplementing identical features.
- Gopal
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11530/#review21203
On May 29, 2013, 11:36 p.m., Gopal V wrote:
+
Gopal V 2013-05-31, 13:59
+
Eric Hanson 2013-05-31, 18:26