Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - Hive double-precision question


Copy link to this message
-
Hive double-precision question
Periya.Data 2012-12-07, 21:27
Hi Hive Users,
    I recently noticed an interesting behavior with Hive and I am unable to
find the reason for it. Your insights into this is much appreciated.

I am trying to compute the distance between two zip codes. I have the
distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF
and using Hive's built-in functions. There are some discrepancies from the
3rd decimal place when I see the output got from using Hive UDF and Hive's
built-in functions. Here is an example:

zip1          zip 2          Hadoop Built-in function
SAS                      R                                       Linux +
Java
00501   11720   4.49493083698542000 4.49508858 4.49508858054005
4.49508857976933000
The formula used to compute distance is this (UDF):

        double long1 = Math.atan(1)/45 * ux;
        double lat1 = Math.atan(1)/45 * uy;
        double long2 = Math.atan(1)/45 * mx;
        double lat2 = Math.atan(1)/45 * my;

        double X1 = long1;
        double Y1 = lat1;
        double X2 = long2;
        double Y2 = lat2;

        double distance = 3949.99 * Math.acos(Math.sin(Y1) *
                Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 -
X2));
The one used using built-in functions (same as above):
3949.99*acos(  sin(u_y_coord * (atan(1)/45 )) *
        sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))*
        cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord *
        (atan(1)/45) - m_x_coord * (atan(1)/45)) )
- The Hive's built-in functions used are acos, sin, cos and atan.
- for another try, I used Hive UDF, with Java's math library (Math.acos,
Math.atan etc)
- All variables used are double.

I expected the value from Hadoop UDF (and Built-in functions) to be
identical with that got from plain Java code in Linux. But they are not.
The built-in function (as well as UDF) gives 49493083698542000 whereas
simple Java program running in Linux gives 49508857976933000. The linux
machine is similar to the Hadoop cluster machines.

Linux version - Red Hat 5.5
Java - latest.
Hive - 0.7.1
Hadoop - 0.20.2

This discrepancy is very consistent across thousands of zip-code distances.
It is not a one-off occurrence. In some cases, I see the difference from
the 4th decimal place. Some more examples:

zip1          zip 2          Hadoop Built-in function
SAS                      R                                       Linux +
Java
   00602   00617   42.79095253903410000 42.79072812 42.79072812185650
42.79072812185640000  00603   00617   40.24044016655180000 40.2402289
40.24022889740920 40.24022889740910000  00605   00617   40.19191761288380000
40.19186416 40.19186415807060 40.19186415807060000
I have not tested the individual sin, cos, atan function returns. That will
be my next test. But, at the very least, why is there a difference in the
values between Hadoop's UDF/built-ins and that from Linux + Java?  I am
assuming that Hive's built-in mathematical functions are nothing but the
underlying Java functions.

Thanks,
PD.
+
Lauren Yang 2012-12-07, 22:12
+
Periya.Data 2012-12-07, 22:36
+
Johnny Zhang 2012-12-07, 23:29
+
Periya.Data 2012-12-08, 19:23
+
Ashutosh Chauhan 2012-12-08, 20:01
+
Johnny Zhang 2012-12-18, 00:13
+
Tom Brown 2012-12-18, 02:08
+
Mark Grover 2012-12-07, 22:02
+
Periya.Data 2012-12-07, 23:05
+
Johnny Zhang 2012-12-07, 21:32