Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Hive double-precision question


+
Periya.Data 2012-12-07, 21:27
+
Lauren Yang 2012-12-07, 22:12
+
Periya.Data 2012-12-07, 22:36
Copy link to this message
-
Re: Hive double-precision question
Hi, Periya:
I think it is also worth checking the workaround in "Programming Hive" (Ed
Capriolo’s) first instead of waiting for the fix. I am right now stuck at
converting the accurate value to DoubleWritable/FloatWritable without
losing accuracy, which may take a while to resolve.

Thanks for Yang's tips.

Johnny
On Fri, Dec 7, 2012 at 2:36 PM, Periya.Data <[EMAIL PROTECTED]> wrote:

> Thanks Lauren, Mark Grover and Zhang. Will have to see the source code in
> Hive to see what is happening and if I can make the results consistent...
>
> Interested to see Zhang's patch. I shall watch that Jira.
>
> -PD
>
>
> On Fri, Dec 7, 2012 at 2:12 PM, Lauren Yang <[EMAIL PROTECTED]>wrote:
>
>>  This sounds like https://issues.apache.org/jira/browse/HIVE-2586 ,
>> where comparing float/doubles will not work because of the way floating
>> point numbers are represented.****
>>
>> ** **
>>
>> Perhaps there is a comparison between a  float and double type because of
>> some internal representation in the Java library, or the UDF.****
>>
>> ** **
>>
>> Ed Capriolo’s book has a good section about workarounds and caveats for
>> working with floats/doubles in hive.****
>>
>> ** **
>>
>> Thanks,****
>>
>> Lauren****
>>
>> *From:* Periya.Data [mailto:[EMAIL PROTECTED]]
>> *Sent:* Friday, December 07, 2012 1:28 PM
>> *To:* [EMAIL PROTECTED]; [EMAIL PROTECTED]
>> *Subject:* Hive double-precision question****
>>
>> ** **
>>
>> Hi Hive Users,
>>     I recently noticed an interesting behavior with Hive and I am unable
>> to find the reason for it. Your insights into this is much appreciated.
>>
>> I am trying to compute the distance between two zip codes. I have the
>> distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF
>> and using Hive's built-in functions. There are some discrepancies from the
>> 3rd decimal place when I see the output got from using Hive UDF and Hive's
>> built-in functions. Here is an example:
>>
>> zip1          zip 2          Hadoop Built-in function
>> SAS                      R                                       Linux +
>> Java****
>>
>> 00501  ****
>>
>> 11720  ****
>>
>> 4.49493083698542000****
>>
>> 4.49508858****
>>
>> 4.49508858054005****
>>
>> 4.49508857976933000****
>>
>>
>> The formula used to compute distance is this (UDF):
>>
>>         double long1 = Math.atan(1)/45 * ux;
>>         double lat1 = Math.atan(1)/45 * uy;
>>         double long2 = Math.atan(1)/45 * mx;
>>         double lat2 = Math.atan(1)/45 * my;
>>
>>         double X1 = long1;
>>         double Y1 = lat1;
>>         double X2 = long2;
>>         double Y2 = lat2;
>>
>>         double distance = 3949.99 * Math.acos(Math.sin(Y1) *
>>                 Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1
>> - X2));
>>
>>
>> The one used using built-in functions (same as above):
>> 3949.99*acos(  sin(u_y_coord * (atan(1)/45 )) *
>>         sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))*
>>         cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord *
>>         (atan(1)/45) - m_x_coord * (atan(1)/45)) )
>>
>>
>>
>>
>> - The Hive's built-in functions used are acos, sin, cos and atan.
>> - for another try, I used Hive UDF, with Java's math library (Math.acos,
>> Math.atan etc)
>> - All variables used are double.
>>
>> I expected the value from Hadoop UDF (and Built-in functions) to be
>> identical with that got from plain Java code in Linux. But they are not.
>> The built-in function (as well as UDF) gives 49493083698542000 whereas
>> simple Java program running in Linux gives 49508857976933000. The linux
>> machine is similar to the Hadoop cluster machines.
>>
>> Linux version - Red Hat 5.5
>> Java - latest.
>> Hive - 0.7.1
>> Hadoop - 0.20.2
>>
>> This discrepancy is very consistent across thousands of zip-code
>> distances. It is not a one-off occurrence. In some cases, I see the
>> difference from the 4th decimal place. Some more examples:
>>
>> zip1          zip 2          Hadoop Built-in function
+
Periya.Data 2012-12-08, 19:23
+
Ashutosh Chauhan 2012-12-08, 20:01
+
Johnny Zhang 2012-12-18, 00:13
+
Tom Brown 2012-12-18, 02:08
+
Mark Grover 2012-12-07, 22:02
+
Periya.Data 2012-12-07, 23:05
+
Johnny Zhang 2012-12-07, 21:32
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB