|
Periya.Data
2012-12-07, 21:27
Johnny Zhang
2012-12-07, 21:32
Mark Grover
2012-12-07, 22:02
Lauren Yang
2012-12-07, 22:12
Periya.Data
2012-12-07, 22:36
Periya.Data
2012-12-07, 23:05
Johnny Zhang
2012-12-07, 23:29
Periya.Data
2012-12-08, 19:23
Ashutosh Chauhan
2012-12-08, 20:01
Johnny Zhang
2012-12-18, 00:13
Tom Brown
2012-12-18, 02:08
|
-
Hive double-precision questionPeriya.Data 2012-12-07, 21:27
Hi Hive Users,
I recently noticed an interesting behavior with Hive and I am unable to find the reason for it. Your insights into this is much appreciated. I am trying to compute the distance between two zip codes. I have the distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF and using Hive's built-in functions. There are some discrepancies from the 3rd decimal place when I see the output got from using Hive UDF and Hive's built-in functions. Here is an example: zip1 zip 2 Hadoop Built-in function SAS R Linux + Java 00501 11720 4.49493083698542000 4.49508858 4.49508858054005 4.49508857976933000 The formula used to compute distance is this (UDF): double long1 = Math.atan(1)/45 * ux; double lat1 = Math.atan(1)/45 * uy; double long2 = Math.atan(1)/45 * mx; double lat2 = Math.atan(1)/45 * my; double X1 = long1; double Y1 = lat1; double X2 = long2; double Y2 = lat2; double distance = 3949.99 * Math.acos(Math.sin(Y1) * Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 - X2)); The one used using built-in functions (same as above): 3949.99*acos( sin(u_y_coord * (atan(1)/45 )) * sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))* cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord * (atan(1)/45) - m_x_coord * (atan(1)/45)) ) - The Hive's built-in functions used are acos, sin, cos and atan. - for another try, I used Hive UDF, with Java's math library (Math.acos, Math.atan etc) - All variables used are double. I expected the value from Hadoop UDF (and Built-in functions) to be identical with that got from plain Java code in Linux. But they are not. The built-in function (as well as UDF) gives 49493083698542000 whereas simple Java program running in Linux gives 49508857976933000. The linux machine is similar to the Hadoop cluster machines. Linux version - Red Hat 5.5 Java - latest. Hive - 0.7.1 Hadoop - 0.20.2 This discrepancy is very consistent across thousands of zip-code distances. It is not a one-off occurrence. In some cases, I see the difference from the 4th decimal place. Some more examples: zip1 zip 2 Hadoop Built-in function SAS R Linux + Java 00602 00617 42.79095253903410000 42.79072812 42.79072812185650 42.79072812185640000 00603 00617 40.24044016655180000 40.2402289 40.24022889740920 40.24022889740910000 00605 00617 40.19191761288380000 40.19186416 40.19186415807060 40.19186415807060000 I have not tested the individual sin, cos, atan function returns. That will be my next test. But, at the very least, why is there a difference in the values between Hadoop's UDF/built-ins and that from Linux + Java? I am assuming that Hive's built-in mathematical functions are nothing but the underlying Java functions. Thanks, PD.
-
Re: Hive double-precision questionJohnny Zhang 2012-12-07, 21:32
Hi, Periya:
This is a problem to me also. I filed https://issues.apache.org/jira/browse/HIVE-3715 I have a patch working in local. I am doing more tests right now will post it soon. Thanks, Johnny On Fri, Dec 7, 2012 at 1:27 PM, Periya.Data <[EMAIL PROTECTED]> wrote: > Hi Hive Users, > I recently noticed an interesting behavior with Hive and I am unable > to find the reason for it. Your insights into this is much appreciated. > > I am trying to compute the distance between two zip codes. I have the > distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF > and using Hive's built-in functions. There are some discrepancies from the > 3rd decimal place when I see the output got from using Hive UDF and Hive's > built-in functions. Here is an example: > > zip1 zip 2 Hadoop Built-in function > SAS R Linux + > Java > 00501 11720 4.49493083698542000 4.49508858 4.49508858054005 > 4.49508857976933000 > The formula used to compute distance is this (UDF): > > double long1 = Math.atan(1)/45 * ux; > double lat1 = Math.atan(1)/45 * uy; > double long2 = Math.atan(1)/45 * mx; > double lat2 = Math.atan(1)/45 * my; > > double X1 = long1; > double Y1 = lat1; > double X2 = long2; > double Y2 = lat2; > > double distance = 3949.99 * Math.acos(Math.sin(Y1) * > Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 - > X2)); > > > The one used using built-in functions (same as above): > 3949.99*acos( sin(u_y_coord * (atan(1)/45 )) * > sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))* > cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord * > (atan(1)/45) - m_x_coord * (atan(1)/45)) ) > > > > > - The Hive's built-in functions used are acos, sin, cos and atan. > - for another try, I used Hive UDF, with Java's math library (Math.acos, > Math.atan etc) > - All variables used are double. > > I expected the value from Hadoop UDF (and Built-in functions) to be > identical with that got from plain Java code in Linux. But they are not. > The built-in function (as well as UDF) gives 49493083698542000 whereas > simple Java program running in Linux gives 49508857976933000. The linux > machine is similar to the Hadoop cluster machines. > > Linux version - Red Hat 5.5 > Java - latest. > Hive - 0.7.1 > Hadoop - 0.20.2 > > This discrepancy is very consistent across thousands of zip-code > distances. It is not a one-off occurrence. In some cases, I see the > difference from the 4th decimal place. Some more examples: > > zip1 zip 2 Hadoop Built-in function > SAS R Linux + > Java > 00602 00617 42.79095253903410000 42.79072812 42.79072812185650 > 42.79072812185640000 00603 00617 40.24044016655180000 40.2402289 > 40.24022889740920 40.24022889740910000 00605 00617 > 40.19191761288380000 40.19186416 40.19186415807060 40.19186415807060000 > I have not tested the individual sin, cos, atan function returns. That > will be my next test. But, at the very least, why is there a difference in > the values between Hadoop's UDF/built-ins and that from Linux + Java? I am > assuming that Hive's built-in mathematical functions are nothing but the > underlying Java functions. > > Thanks, > PD. > >
-
Re: Hive double-precision questionMark Grover 2012-12-07, 22:02
Periya:
If you want to see what the built in Hive UDFs are doing, the code is here: https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic and https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf You can find out which UDF name maps to what class by looking at https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java If my memory serves me right, there was some "interesting" stuff Hive does when mapping Java types to Hive datatypes. I am not sure how relevant it is to this discussion but I will have to look further to comment more. In the meanwhile take a look at the UDF code and see if your personal Java code on Linux is equivalent to the Hive UDF code. Keep us posted! Mark On Fri, Dec 7, 2012 at 1:27 PM, Periya.Data <[EMAIL PROTECTED]> wrote: > Hi Hive Users, > I recently noticed an interesting behavior with Hive and I am unable > to find the reason for it. Your insights into this is much appreciated. > > I am trying to compute the distance between two zip codes. I have the > distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF > and using Hive's built-in functions. There are some discrepancies from the > 3rd decimal place when I see the output got from using Hive UDF and Hive's > built-in functions. Here is an example: > > zip1 zip 2 Hadoop Built-in function > SAS R Linux + > Java > 00501 11720 4.49493083698542000 4.49508858 4.49508858054005 > 4.49508857976933000 > The formula used to compute distance is this (UDF): > > double long1 = Math.atan(1)/45 * ux; > double lat1 = Math.atan(1)/45 * uy; > double long2 = Math.atan(1)/45 * mx; > double lat2 = Math.atan(1)/45 * my; > > double X1 = long1; > double Y1 = lat1; > double X2 = long2; > double Y2 = lat2; > > double distance = 3949.99 * Math.acos(Math.sin(Y1) * > Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 - > X2)); > > > The one used using built-in functions (same as above): > 3949.99*acos( sin(u_y_coord * (atan(1)/45 )) * > sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))* > cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord * > (atan(1)/45) - m_x_coord * (atan(1)/45)) ) > > > > > - The Hive's built-in functions used are acos, sin, cos and atan. > - for another try, I used Hive UDF, with Java's math library (Math.acos, > Math.atan etc) > - All variables used are double. > > I expected the value from Hadoop UDF (and Built-in functions) to be > identical with that got from plain Java code in Linux. But they are not. > The built-in function (as well as UDF) gives 49493083698542000 whereas > simple Java program running in Linux gives 49508857976933000. The linux > machine is similar to the Hadoop cluster machines. > > Linux version - Red Hat 5.5 > Java - latest. > Hive - 0.7.1 > Hadoop - 0.20.2 > > This discrepancy is very consistent across thousands of zip-code > distances. It is not a one-off occurrence. In some cases, I see the > difference from the 4th decimal place. Some more examples: > > zip1 zip 2 Hadoop Built-in function > SAS R Linux + > Java > 00602 00617 42.79095253903410000 42.79072812 42.79072812185650 > 42.79072812185640000 00603 00617 40.24044016655180000 40.2402289 > 40.24022889740920 40.24022889740910000 00605 00617 > 40.19191761288380000 40.19186416 40.19186415807060 40.19186415807060000 > I have not tested the individual sin, cos, atan function returns. That > will be my next test. But, at the very least, why is there a difference in > the values between Hadoop's UDF/built-ins and that from Linux + Java? I am > assuming that Hive's built-in mathematical functions are nothing but the > underlying Java functions.
-
RE: Hive double-precision questionLauren Yang 2012-12-07, 22:12
This sounds like https://issues.apache.org/jira/browse/HIVE-2586 , where comparing float/doubles will not work because of the way floating point numbers are represented.
Perhaps there is a comparison between a float and double type because of some internal representation in the Java library, or the UDF. Ed Capriolo's book has a good section about workarounds and caveats for working with floats/doubles in hive. Thanks, Lauren From: Periya.Data [mailto:[EMAIL PROTECTED]] Sent: Friday, December 07, 2012 1:28 PM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Hive double-precision question Hi Hive Users, I recently noticed an interesting behavior with Hive and I am unable to find the reason for it. Your insights into this is much appreciated. I am trying to compute the distance between two zip codes. I have the distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF and using Hive's built-in functions. There are some discrepancies from the 3rd decimal place when I see the output got from using Hive UDF and Hive's built-in functions. Here is an example: zip1 zip 2 Hadoop Built-in function SAS R Linux + Java 00501 11720 4.49493083698542000 4.49508858 4.49508858054005 4.49508857976933000 The formula used to compute distance is this (UDF): double long1 = Math.atan(1)/45 * ux; double lat1 = Math.atan(1)/45 * uy; double long2 = Math.atan(1)/45 * mx; double lat2 = Math.atan(1)/45 * my; double X1 = long1; double Y1 = lat1; double X2 = long2; double Y2 = lat2; double distance = 3949.99 * Math.acos(Math.sin(Y1) * Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 - X2)); The one used using built-in functions (same as above): 3949.99*acos( sin(u_y_coord * (atan(1)/45 )) * sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))* cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord * (atan(1)/45) - m_x_coord * (atan(1)/45)) ) - The Hive's built-in functions used are acos, sin, cos and atan. - for another try, I used Hive UDF, with Java's math library (Math.acos, Math.atan etc) - All variables used are double. I expected the value from Hadoop UDF (and Built-in functions) to be identical with that got from plain Java code in Linux. But they are not. The built-in function (as well as UDF) gives 49493083698542000 whereas simple Java program running in Linux gives 49508857976933000. The linux machine is similar to the Hadoop cluster machines. Linux version - Red Hat 5.5 Java - latest. Hive - 0.7.1 Hadoop - 0.20.2 This discrepancy is very consistent across thousands of zip-code distances. It is not a one-off occurrence. In some cases, I see the difference from the 4th decimal place. Some more examples: zip1 zip 2 Hadoop Built-in function SAS R Linux + Java 00602 00617 42.79095253903410000 42.79072812 42.79072812185650 42.79072812185640000 00603 00617 40.24044016655180000 40.2402289 40.24022889740920 40.24022889740910000 00605 00617 40.19191761288380000 40.19186416 40.19186415807060 40.19186415807060000 I have not tested the individual sin, cos, atan function returns. That will be my next test. But, at the very least, why is there a difference in the values between Hadoop's UDF/built-ins and that from Linux + Java? I am assuming that Hive's built-in mathematical functions are nothing but the underlying Java functions. Thanks, PD.
-
Re: Hive double-precision questionPeriya.Data 2012-12-07, 22:36
Thanks Lauren, Mark Grover and Zhang. Will have to see the source code in
Hive to see what is happening and if I can make the results consistent... Interested to see Zhang's patch. I shall watch that Jira. -PD On Fri, Dec 7, 2012 at 2:12 PM, Lauren Yang <[EMAIL PROTECTED]>wrote: > This sounds like https://issues.apache.org/jira/browse/HIVE-2586 , where > comparing float/doubles will not work because of the way floating point > numbers are represented.**** > > ** ** > > Perhaps there is a comparison between a float and double type because of > some internal representation in the Java library, or the UDF.**** > > ** ** > > Ed Capriolo’s book has a good section about workarounds and caveats for > working with floats/doubles in hive.**** > > ** ** > > Thanks,**** > > Lauren**** > > *From:* Periya.Data [mailto:[EMAIL PROTECTED]] > *Sent:* Friday, December 07, 2012 1:28 PM > *To:* [EMAIL PROTECTED]; [EMAIL PROTECTED] > *Subject:* Hive double-precision question**** > > ** ** > > Hi Hive Users, > I recently noticed an interesting behavior with Hive and I am unable > to find the reason for it. Your insights into this is much appreciated. > > I am trying to compute the distance between two zip codes. I have the > distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF > and using Hive's built-in functions. There are some discrepancies from the > 3rd decimal place when I see the output got from using Hive UDF and Hive's > built-in functions. Here is an example: > > zip1 zip 2 Hadoop Built-in function > SAS R Linux + > Java**** > > 00501 **** > > 11720 **** > > 4.49493083698542000**** > > 4.49508858**** > > 4.49508858054005**** > > 4.49508857976933000**** > > > The formula used to compute distance is this (UDF): > > double long1 = Math.atan(1)/45 * ux; > double lat1 = Math.atan(1)/45 * uy; > double long2 = Math.atan(1)/45 * mx; > double lat2 = Math.atan(1)/45 * my; > > double X1 = long1; > double Y1 = lat1; > double X2 = long2; > double Y2 = lat2; > > double distance = 3949.99 * Math.acos(Math.sin(Y1) * > Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 - > X2)); > > > The one used using built-in functions (same as above): > 3949.99*acos( sin(u_y_coord * (atan(1)/45 )) * > sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))* > cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord * > (atan(1)/45) - m_x_coord * (atan(1)/45)) ) > > > > > - The Hive's built-in functions used are acos, sin, cos and atan. > - for another try, I used Hive UDF, with Java's math library (Math.acos, > Math.atan etc) > - All variables used are double. > > I expected the value from Hadoop UDF (and Built-in functions) to be > identical with that got from plain Java code in Linux. But they are not. > The built-in function (as well as UDF) gives 49493083698542000 whereas > simple Java program running in Linux gives 49508857976933000. The linux > machine is similar to the Hadoop cluster machines. > > Linux version - Red Hat 5.5 > Java - latest. > Hive - 0.7.1 > Hadoop - 0.20.2 > > This discrepancy is very consistent across thousands of zip-code > distances. It is not a one-off occurrence. In some cases, I see the > difference from the 4th decimal place. Some more examples: > > zip1 zip 2 Hadoop Built-in function > SAS R Linux + > Java**** > > 00602 **** > > 00617 **** > > 42.79095253903410000**** > > 42.79072812**** > > 42.79072812185650**** > > 42.79072812185640000**** > > 00603 **** > > 00617 **** > > 40.24044016655180000**** > > 40.2402289**** > > 40.24022889740920**** > > 40.24022889740910000**** > > 00605 **** > > 00617 **** > > 40.19191761288380000**** > > 40.19186416**** > > 40.19186415807060**** > > 40.19186415807060000**** > > > I have not tested the individual sin, cos, atan function returns. That
-
Re: Hive double-precision questionPeriya.Data 2012-12-07, 23:05
Hi Mark,
Thanks for the pointers. I looked at the code and it looks like my Java code and the Hive code are similar...(I am a basic-level Java guy). The UDF below uses Math.sin....which is what I used to test "linux + Java" result. I have to see what this DoubleWritable and Serde2 is all about... package org.apache.hadoop.hive.ql.udf; import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.hive.serde2.io.DoubleWritable; /** * UDFSin. * */ @Description(name = "sin", value = "_FUNC_(x) - returns the sine of x (x is in radians)", extended = "Example:\n " + " > SELECT _FUNC_(0) FROM src LIMIT 1;\n" + " 0") public class UDFSin extends UDF { private DoubleWritable result = new DoubleWritable(); public UDFSin() { } public DoubleWritable evaluate(DoubleWritable a) { if (a == null) { return null; } else { result.set(Math.sin(a.get())); return result; } } } On Fri, Dec 7, 2012 at 2:02 PM, Mark Grover <[EMAIL PROTECTED]>wrote: > Periya: > If you want to see what the built in Hive UDFs are doing, the code is here: > > https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic > and > > https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf > > You can find out which UDF name maps to what class by looking at > https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java > > If my memory serves me right, there was some "interesting" stuff Hive does > when mapping Java types to Hive datatypes. I am not sure how relevant it is > to this discussion but I will have to look further to comment more. > > In the meanwhile take a look at the UDF code and see if your personal Java > code on Linux is equivalent to the Hive UDF code. > > Keep us posted! > Mark > > On Fri, Dec 7, 2012 at 1:27 PM, Periya.Data <[EMAIL PROTECTED]> wrote: > >> Hi Hive Users, >> I recently noticed an interesting behavior with Hive and I am unable >> to find the reason for it. Your insights into this is much appreciated. >> >> I am trying to compute the distance between two zip codes. I have the >> distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF >> and using Hive's built-in functions. There are some discrepancies from the >> 3rd decimal place when I see the output got from using Hive UDF and Hive's >> built-in functions. Here is an example: >> >> zip1 zip 2 Hadoop Built-in function >> SAS R Linux + >> Java >> 00501 11720 4.49493083698542000 4.49508858 4.49508858054005 >> 4.49508857976933000 >> The formula used to compute distance is this (UDF): >> >> double long1 = Math.atan(1)/45 * ux; >> double lat1 = Math.atan(1)/45 * uy; >> double long2 = Math.atan(1)/45 * mx; >> double lat2 = Math.atan(1)/45 * my; >> >> double X1 = long1; >> double Y1 = lat1; >> double X2 = long2; >> double Y2 = lat2; >> >> double distance = 3949.99 * Math.acos(Math.sin(Y1) * >> Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 >> - X2)); >> >> >> The one used using built-in functions (same as above): >> 3949.99*acos( sin(u_y_coord * (atan(1)/45 )) * >> sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))* >> cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord * >> (atan(1)/45) - m_x_coord * (atan(1)/45)) ) >> >> >> >> >> - The Hive's built-in functions used are acos, sin, cos and atan. >> - for another try, I used Hive UDF, with Java's math library (Math.acos, >> Math.atan etc) >> - All variables used are double. >> >> I expected the value from Hadoop UDF (and Built-in functions) to be >> identical with that got from plain Java code in Linux. But they are not. >> The built-in function (as well as UDF) gives 49493083698542000 whereas
-
Re: Hive double-precision questionJohnny Zhang 2012-12-07, 23:29
Hi, Periya:
I think it is also worth checking the workaround in "Programming Hive" (Ed Capriolo’s) first instead of waiting for the fix. I am right now stuck at converting the accurate value to DoubleWritable/FloatWritable without losing accuracy, which may take a while to resolve. Thanks for Yang's tips. Johnny On Fri, Dec 7, 2012 at 2:36 PM, Periya.Data <[EMAIL PROTECTED]> wrote: > Thanks Lauren, Mark Grover and Zhang. Will have to see the source code in > Hive to see what is happening and if I can make the results consistent... > > Interested to see Zhang's patch. I shall watch that Jira. > > -PD > > > On Fri, Dec 7, 2012 at 2:12 PM, Lauren Yang <[EMAIL PROTECTED]>wrote: > >> This sounds like https://issues.apache.org/jira/browse/HIVE-2586 , >> where comparing float/doubles will not work because of the way floating >> point numbers are represented.**** >> >> ** ** >> >> Perhaps there is a comparison between a float and double type because of >> some internal representation in the Java library, or the UDF.**** >> >> ** ** >> >> Ed Capriolo’s book has a good section about workarounds and caveats for >> working with floats/doubles in hive.**** >> >> ** ** >> >> Thanks,**** >> >> Lauren**** >> >> *From:* Periya.Data [mailto:[EMAIL PROTECTED]] >> *Sent:* Friday, December 07, 2012 1:28 PM >> *To:* [EMAIL PROTECTED]; [EMAIL PROTECTED] >> *Subject:* Hive double-precision question**** >> >> ** ** >> >> Hi Hive Users, >> I recently noticed an interesting behavior with Hive and I am unable >> to find the reason for it. Your insights into this is much appreciated. >> >> I am trying to compute the distance between two zip codes. I have the >> distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF >> and using Hive's built-in functions. There are some discrepancies from the >> 3rd decimal place when I see the output got from using Hive UDF and Hive's >> built-in functions. Here is an example: >> >> zip1 zip 2 Hadoop Built-in function >> SAS R Linux + >> Java**** >> >> 00501 **** >> >> 11720 **** >> >> 4.49493083698542000**** >> >> 4.49508858**** >> >> 4.49508858054005**** >> >> 4.49508857976933000**** >> >> >> The formula used to compute distance is this (UDF): >> >> double long1 = Math.atan(1)/45 * ux; >> double lat1 = Math.atan(1)/45 * uy; >> double long2 = Math.atan(1)/45 * mx; >> double lat2 = Math.atan(1)/45 * my; >> >> double X1 = long1; >> double Y1 = lat1; >> double X2 = long2; >> double Y2 = lat2; >> >> double distance = 3949.99 * Math.acos(Math.sin(Y1) * >> Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 >> - X2)); >> >> >> The one used using built-in functions (same as above): >> 3949.99*acos( sin(u_y_coord * (atan(1)/45 )) * >> sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))* >> cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord * >> (atan(1)/45) - m_x_coord * (atan(1)/45)) ) >> >> >> >> >> - The Hive's built-in functions used are acos, sin, cos and atan. >> - for another try, I used Hive UDF, with Java's math library (Math.acos, >> Math.atan etc) >> - All variables used are double. >> >> I expected the value from Hadoop UDF (and Built-in functions) to be >> identical with that got from plain Java code in Linux. But they are not. >> The built-in function (as well as UDF) gives 49493083698542000 whereas >> simple Java program running in Linux gives 49508857976933000. The linux >> machine is similar to the Hadoop cluster machines. >> >> Linux version - Red Hat 5.5 >> Java - latest. >> Hive - 0.7.1 >> Hadoop - 0.20.2 >> >> This discrepancy is very consistent across thousands of zip-code >> distances. It is not a one-off occurrence. In some cases, I see the >> difference from the 4th decimal place. Some more examples: >> >> zip1 zip 2 Hadoop Built-in function
-
Re: Hive double-precision questionPeriya.Data 2012-12-08, 19:23
Hi Lauren and Zhang,
The book "Programming Hive" suggests to use Double (instead of float) and also to cast any literal value to double. I am already using double for all my computations (both in hive table schema as well as in my UDF). Furthermore, I am not comparing two floats/doubles. I am doing some computations involving doubles...and those minor differences are adding up. It looks like what Mark Grover was telling - mapping between Java datatypes to Hive data-types. I am yet to look at that portion of the source-code. Thanks and will keep you posted, /PD On Fri, Dec 7, 2012 at 2:12 PM, Lauren Yang <[EMAIL PROTECTED]>wrote: > This sounds like https://issues.apache.org/jira/browse/HIVE-2586 , where > comparing float/doubles will not work because of the way floating point > numbers are represented.**** > > ** ** > > Perhaps there is a comparison between a float and double type because of > some internal representation in the Java library, or the UDF.**** > > ** ** > > Ed Capriolo’s book has a good section about workarounds and caveats for > working with floats/doubles in hive.**** > > ** ** > > Thanks,**** > > Lauren**** > > *From:* Periya.Data [mailto:[EMAIL PROTECTED]] > *Sent:* Friday, December 07, 2012 1:28 PM > *To:* [EMAIL PROTECTED]; [EMAIL PROTECTED] > *Subject:* Hive double-precision question**** > > ** ** > > Hi Hive Users, > I recently noticed an interesting behavior with Hive and I am unable > to find the reason for it. Your insights into this is much appreciated. > > I am trying to compute the distance between two zip codes. I have the > distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF > and using Hive's built-in functions. There are some discrepancies from the > 3rd decimal place when I see the output got from using Hive UDF and Hive's > built-in functions. Here is an example: > > zip1 zip 2 Hadoop Built-in function > SAS R Linux + > Java**** > > 00501 **** > > 11720 **** > > 4.49493083698542000**** > > 4.49508858**** > > 4.49508858054005**** > > 4.49508857976933000**** > > > The formula used to compute distance is this (UDF): > > double long1 = Math.atan(1)/45 * ux; > double lat1 = Math.atan(1)/45 * uy; > double long2 = Math.atan(1)/45 * mx; > double lat2 = Math.atan(1)/45 * my; > > double X1 = long1; > double Y1 = lat1; > double X2 = long2; > double Y2 = lat2; > > double distance = 3949.99 * Math.acos(Math.sin(Y1) * > Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 - > X2)); > > > The one used using built-in functions (same as above): > 3949.99*acos( sin(u_y_coord * (atan(1)/45 )) * > sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))* > cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord * > (atan(1)/45) - m_x_coord * (atan(1)/45)) ) > > > > > - The Hive's built-in functions used are acos, sin, cos and atan. > - for another try, I used Hive UDF, with Java's math library (Math.acos, > Math.atan etc) > - All variables used are double. > > I expected the value from Hadoop UDF (and Built-in functions) to be > identical with that got from plain Java code in Linux. But they are not. > The built-in function (as well as UDF) gives 49493083698542000 whereas > simple Java program running in Linux gives 49508857976933000. The linux > machine is similar to the Hadoop cluster machines. > > Linux version - Red Hat 5.5 > Java - latest. > Hive - 0.7.1 > Hadoop - 0.20.2 > > This discrepancy is very consistent across thousands of zip-code > distances. It is not a one-off occurrence. In some cases, I see the > difference from the 4th decimal place. Some more examples: > > zip1 zip 2 Hadoop Built-in function > SAS R Linux + > Java**** > > 00602 **** > > 00617 **** > > 42.79095253903410000****
-
Re: Hive double-precision questionAshutosh Chauhan 2012-12-08, 20:01
There is a work going on at
https://issues.apache.org/jira/browse/HIVE-2693to add support for BigDecimal in Hive. I think your use-case will benefit from it. Ashutosh On Sat, Dec 8, 2012 at 11:23 AM, Periya.Data <[EMAIL PROTECTED]> wrote: > Hi Lauren and Zhang, > The book "Programming Hive" suggests to use Double (instead of float) > and also to cast any literal value to double. I am already using double for > all my computations (both in hive table schema as well as in my UDF). > Furthermore, I am not comparing two floats/doubles. I am doing some > computations involving doubles...and those minor differences are adding up. > > It looks like what Mark Grover was telling - mapping between Java > datatypes to Hive data-types. I am yet to look at that portion of the > source-code. > > Thanks and will keep you posted, > /PD > > > > On Fri, Dec 7, 2012 at 2:12 PM, Lauren Yang <[EMAIL PROTECTED]>wrote: > >> This sounds like https://issues.apache.org/jira/browse/HIVE-2586 , >> where comparing float/doubles will not work because of the way floating >> point numbers are represented.**** >> >> ** ** >> >> Perhaps there is a comparison between a float and double type because of >> some internal representation in the Java library, or the UDF.**** >> >> ** ** >> >> Ed Capriolo’s book has a good section about workarounds and caveats for >> working with floats/doubles in hive.**** >> >> ** ** >> >> Thanks,**** >> >> Lauren**** >> >> *From:* Periya.Data [mailto:[EMAIL PROTECTED]] >> *Sent:* Friday, December 07, 2012 1:28 PM >> *To:* [EMAIL PROTECTED]; [EMAIL PROTECTED] >> *Subject:* Hive double-precision question**** >> >> ** ** >> >> Hi Hive Users, >> I recently noticed an interesting behavior with Hive and I am unable >> to find the reason for it. Your insights into this is much appreciated. >> >> I am trying to compute the distance between two zip codes. I have the >> distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF >> and using Hive's built-in functions. There are some discrepancies from the >> 3rd decimal place when I see the output got from using Hive UDF and Hive's >> built-in functions. Here is an example: >> >> zip1 zip 2 Hadoop Built-in function >> SAS R Linux + >> Java**** >> >> 00501 **** >> >> 11720 **** >> >> 4.49493083698542000**** >> >> 4.49508858**** >> >> 4.49508858054005**** >> >> 4.49508857976933000**** >> >> >> The formula used to compute distance is this (UDF): >> >> double long1 = Math.atan(1)/45 * ux; >> double lat1 = Math.atan(1)/45 * uy; >> double long2 = Math.atan(1)/45 * mx; >> double lat2 = Math.atan(1)/45 * my; >> >> double X1 = long1; >> double Y1 = lat1; >> double X2 = long2; >> double Y2 = lat2; >> >> double distance = 3949.99 * Math.acos(Math.sin(Y1) * >> Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 >> - X2)); >> >> >> The one used using built-in functions (same as above): >> 3949.99*acos( sin(u_y_coord * (atan(1)/45 )) * >> sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))* >> cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord * >> (atan(1)/45) - m_x_coord * (atan(1)/45)) ) >> >> >> >> >> - The Hive's built-in functions used are acos, sin, cos and atan. >> - for another try, I used Hive UDF, with Java's math library (Math.acos, >> Math.atan etc) >> - All variables used are double. >> >> I expected the value from Hadoop UDF (and Built-in functions) to be >> identical with that got from plain Java code in Linux. But they are not. >> The built-in function (as well as UDF) gives 49493083698542000 whereas >> simple Java program running in Linux gives 49508857976933000. The linux >> machine is similar to the Hadoop cluster machines. >> >> Linux version - Red Hat 5.5 >> Java - latest. >> Hive - 0.7.1 >> Hadoop - 0.20.2 >> >> This discrepancy is very consistent across thousands of zip-code
-
Re: Hive double-precision questionJohnny Zhang 2012-12-18, 00:13
Hi, Periya:
Can you take a look at the patch of https://issues.apache.org/jira/browse/HIVE-3715 and see if you can apply the similar change to make sinc/cons more accurate for your use case? Feel free to comments on the jira as well. Thanks. Johnny On Sat, Dec 8, 2012 at 11:23 AM, Periya.Data <[EMAIL PROTECTED]> wrote: > Hi Lauren and Zhang, > The book "Programming Hive" suggests to use Double (instead of float) > and also to cast any literal value to double. I am already using double for > all my computations (both in hive table schema as well as in my UDF). > Furthermore, I am not comparing two floats/doubles. I am doing some > computations involving doubles...and those minor differences are adding up. > > It looks like what Mark Grover was telling - mapping between Java > datatypes to Hive data-types. I am yet to look at that portion of the > source-code. > > Thanks and will keep you posted, > /PD > > > > On Fri, Dec 7, 2012 at 2:12 PM, Lauren Yang <[EMAIL PROTECTED]>wrote: > >> This sounds like https://issues.apache.org/jira/browse/HIVE-2586 , >> where comparing float/doubles will not work because of the way floating >> point numbers are represented.**** >> >> ** ** >> >> Perhaps there is a comparison between a float and double type because of >> some internal representation in the Java library, or the UDF.**** >> >> ** ** >> >> Ed Capriolo’s book has a good section about workarounds and caveats for >> working with floats/doubles in hive.**** >> >> ** ** >> >> Thanks,**** >> >> Lauren**** >> >> *From:* Periya.Data [mailto:[EMAIL PROTECTED]] >> *Sent:* Friday, December 07, 2012 1:28 PM >> *To:* [EMAIL PROTECTED]; [EMAIL PROTECTED] >> *Subject:* Hive double-precision question**** >> >> ** ** >> >> Hi Hive Users, >> I recently noticed an interesting behavior with Hive and I am unable >> to find the reason for it. Your insights into this is much appreciated. >> >> I am trying to compute the distance between two zip codes. I have the >> distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF >> and using Hive's built-in functions. There are some discrepancies from the >> 3rd decimal place when I see the output got from using Hive UDF and Hive's >> built-in functions. Here is an example: >> >> zip1 zip 2 Hadoop Built-in function >> SAS R Linux + >> Java**** >> >> 00501 **** >> >> 11720 **** >> >> 4.49493083698542000**** >> >> 4.49508858**** >> >> 4.49508858054005**** >> >> 4.49508857976933000**** >> >> >> The formula used to compute distance is this (UDF): >> >> double long1 = Math.atan(1)/45 * ux; >> double lat1 = Math.atan(1)/45 * uy; >> double long2 = Math.atan(1)/45 * mx; >> double lat2 = Math.atan(1)/45 * my; >> >> double X1 = long1; >> double Y1 = lat1; >> double X2 = long2; >> double Y2 = lat2; >> >> double distance = 3949.99 * Math.acos(Math.sin(Y1) * >> Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 >> - X2)); >> >> >> The one used using built-in functions (same as above): >> 3949.99*acos( sin(u_y_coord * (atan(1)/45 )) * >> sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))* >> cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord * >> (atan(1)/45) - m_x_coord * (atan(1)/45)) ) >> >> >> >> >> - The Hive's built-in functions used are acos, sin, cos and atan. >> - for another try, I used Hive UDF, with Java's math library (Math.acos, >> Math.atan etc) >> - All variables used are double. >> >> I expected the value from Hadoop UDF (and Built-in functions) to be >> identical with that got from plain Java code in Linux. But they are not. >> The built-in function (as well as UDF) gives 49493083698542000 whereas >> simple Java program running in Linux gives 49508857976933000. The linux >> machine is similar to the Hadoop cluster machines. >> >> Linux version - Red Hat 5.5
-
Re: Hive double-precision questionTom Brown 2012-12-18, 02:08
Doubles are not perfect fractional numbers. Because of rounding errors, a
set of doubles added in different orders can produce different results (e.g., a+b+c != b+c+a) Because of this, if your computation is happening in a different order locally than on the hive server, you might end up with different results. I don't think hive supports a native decimal type, unfortunately, so it's difficult to verify this. --Tom On Monday, December 17, 2012, Johnny Zhang wrote: > Hi, Periya: > Can you take a look at the patch of > https://issues.apache.org/jira/browse/HIVE-3715 and see if you can apply > the similar change to make sinc/cons more accurate for your use case? Feel > free to comments on the jira as well. Thanks. > > Johnny > > > On Sat, Dec 8, 2012 at 11:23 AM, Periya.Data <[EMAIL PROTECTED]<javascript:_e({}, 'cvml', '[EMAIL PROTECTED]');> > > wrote: > >> Hi Lauren and Zhang, >> The book "Programming Hive" suggests to use Double (instead of float) >> and also to cast any literal value to double. I am already using double for >> all my computations (both in hive table schema as well as in my UDF). >> Furthermore, I am not comparing two floats/doubles. I am doing some >> computations involving doubles...and those minor differences are adding up. >> >> It looks like what Mark Grover was telling - mapping between Java >> datatypes to Hive data-types. I am yet to look at that portion of the >> source-code. >> >> Thanks and will keep you posted, >> /PD >> >> >> >> On Fri, Dec 7, 2012 at 2:12 PM, Lauren Yang <[EMAIL PROTECTED]>wrote: >> >> This sounds like https://issues.apache.org/jira/browse/HIVE-2586 , >> where comparing float/doubles will not work because of the way floating >> point numbers are represented.**** >> >> ** ** >> >> Perhaps there is a comparison between a float and double type because of >> some internal representation in the Java library, or the UDF.**** >> >> ** ** >> >> Ed Capriolo’s book has a good section about workarounds and caveats for >> working with floats/doubles in hive.**** >> >> ** ** >> >> Thanks,**** >> >> Lauren**** >> >> *From:* Periya.Data [mailto:[EMAIL PROTECTED]] >> *Sent:* Friday, December 07, 2012 1:28 PM >> *To:* [EMAIL PROTECTED]; [EMAIL PROTECTED] >> *Subject:* Hive double-precision question**** >> >> ** ** >> >> Hi Hive Users, >> I recently noticed an interesting behavior with Hive and I am unable >> to find the reason for it. Your insights into this is much appreciated. >> >> I am trying to compute the distance between two zip codes. I have the >> distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF >> and using Hive's built-in functions. There are some discrepancies from the >> 3rd decimal place when I see the output got from using Hive UDF and Hive's >> built-in functions. Here is an example: >> >> zip1 zip 2 Hadoop Built-in function >> SAS R Linux + >> Java**** >> >> 00501 **** >> >> 11720 **** >> >> 4.49493083698542000**** >> >> 4.49508858**** >> >> 4.49508858054005**** >> >> -- >> >> >> >> > > |