|
Johnny Zhang
2012-12-18, 00:10
Johnny Zhang
2012-12-18, 00:37
Mark Grover
2012-12-18, 00:38
Johnny Zhang
2012-12-18, 01:13
Bharath Mundlapudi
2012-12-18, 08:51
Mark Grover
2012-12-21, 21:43
Johnny Zhang
2012-12-18, 01:00
Mark Grover
2012-12-21, 21:54
|
-
Review Request: float and double calculation is inaccurate in HiveJohnny Zhang 2012-12-18, 00:10
----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/8653/ ----------------------------------------------------------- Review request for hive. Description ------- I found this during debug the e2e test failures. I found Hive miss calculate the float and double value. Take float calculation as an example: hive> select f from all100k limit 1; 48308.98 hive> select f/10 from all100k limit 1; 4830.898046875 <--added 04875 in the end hive> select f*1.01 from all100k limit 1; 48792.0702734375 <--should be 48792.0698 It might be essentially the same problem as http://effbot.org/pyfaq/why-are-floating-point-calculations-so-inaccurate.htm. But since e2e test compare the results with mysql and seems mysql does it right, so it is worthy fixing it in Hive. This addresses bug HIVE-3715. https://issues.apache.org/jira/browse/HIVE-3715 Diffs ----- http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPDivide.java 1423224 http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPMultiply.java 1423224 Diff: https://reviews.apache.org/r/8653/diff/ Testing ------- I did test to compare the result with mysql default float precision setting, the result is identical. query: select f, f*1.01, f/10 from all100k limit 1; mysql result: 48309 48792.0702734375 4830.898046875 hive result: 48308.98 48792.0702734375 4830.898046875 I apply this patch and run the hive e2e test, and the tests all pass (without this patch, 5 related failures) Thanks, Johnny Zhang +
Johnny Zhang 2012-12-18, 00:10
-
Re: Review Request: float and double calculation is inaccurate in HiveJohnny Zhang 2012-12-18, 00:37
----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/8653/ ----------------------------------------------------------- (Updated Dec. 18, 2012, 12:37 a.m.) Review request for hive. Description (updated) ------- I found this during debug the e2e test failures. I found Hive miss calculate the float and double value. Take float calculation as an example: hive> select f from all100k limit 1; 48308.98 hive> select f/10 from all100k limit 1; 4830.898046875 <--added 04875 in the end hive> select f*1.01 from all100k limit 1; 48792.0702734375 <--should be 48792.0698 It might be essentially the same problem as http://effbot.org/pyfaq/why-are-floating-point-calculations-so-inaccurate.htm But since e2e test compare the results with mysql and seems mysql does it right, so it is worthy fixing it in Hive. This addresses bug HIVE-3715. https://issues.apache.org/jira/browse/HIVE-3715 Diffs ----- http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPDivide.java 1423224 http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPMultiply.java 1423224 Diff: https://reviews.apache.org/r/8653/diff/ Testing ------- I did test to compare the result with mysql default float precision setting, the result is identical. query: select f, f*1.01, f/10 from all100k limit 1; mysql result: 48309 48792.0702734375 4830.898046875 hive result: 48308.98 48792.0702734375 4830.898046875 I apply this patch and run the hive e2e test, and the tests all pass (without this patch, 5 related failures) Thanks, Johnny Zhang +
Johnny Zhang 2012-12-18, 00:37
-
Re: Review Request: float and double calculation is inaccurate in HiveMark Grover 2012-12-18, 00:38
----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/8653/#review14625 ----------------------------------------------------------- http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPDivide.java <https://reviews.apache.org/r/8653/#comment31047> 10 seems to be a rather arbitrary number for scale. Any particular reason you are using it? Maybe we should invoke the method where no scale needs to be specified. http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPMultiply.java <https://reviews.apache.org/r/8653/#comment31048> You seem to be doing DoubleWritable->String->BigDecimal There probably is a way to do: DoubleWritable->Double->BigDecimal I am not sure if it's any more efficient the present case. So, take this suggestion with a grain of salt:-) - Mark Grover On Dec. 18, 2012, 12:37 a.m., Johnny Zhang wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/8653/ > ----------------------------------------------------------- > > (Updated Dec. 18, 2012, 12:37 a.m.) > > > Review request for hive. > > > Description > ------- > > I found this during debug the e2e test failures. I found Hive miss calculate the float and double value. Take float calculation as an example: > hive> select f from all100k limit 1; > 48308.98 > hive> select f/10 from all100k limit 1; > 4830.898046875 <--added 04875 in the end > hive> select f*1.01 from all100k limit 1; > 48792.0702734375 <--should be 48792.0698 > It might be essentially the same problem as http://effbot.org/pyfaq/why-are-floating-point-calculations-so-inaccurate.htm But since e2e test compare the results with mysql and seems mysql does it right, so it is worthy fixing it in Hive. > > > This addresses bug HIVE-3715. > https://issues.apache.org/jira/browse/HIVE-3715 > > > Diffs > ----- > > http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPDivide.java 1423224 > http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPMultiply.java 1423224 > > Diff: https://reviews.apache.org/r/8653/diff/ > > > Testing > ------- > > I did test to compare the result with mysql default float precision setting, the result is identical. > > query: select f, f*1.01, f/10 from all100k limit 1; > mysql result: 48309 48792.0702734375 4830.898046875 > hive result: 48308.98 48792.0702734375 4830.898046875 > > > I apply this patch and run the hive e2e test, and the tests all pass (without this patch, 5 related failures) > > > Thanks, > > Johnny Zhang > > +
Mark Grover 2012-12-18, 00:38
-
Re: Review Request: float and double calculation is inaccurate in HiveJohnny Zhang 2012-12-18, 01:13
> On Dec. 18, 2012, 12:38 a.m., Mark Grover wrote: > > http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPDivide.java, line 50 > > <https://reviews.apache.org/r/8653/diff/1/?file=240423#file240423line50> > > > > 10 seems to be a rather arbitrary number for scale. Any particular reason you are using it? Maybe we should invoke the method where no scale needs to be specified. > > Johnny Zhang wrote: > Hi, Mark, thanks for reviewing it. The reason using 10 is because it is the same as mysql default precision setting. Just want to make the calculation result identical to mysql's I think I did tried without specify scale, and the result is different from mysql. I agree hard coding the scale is not a good way. Open to other suggestions. - Johnny ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/8653/#review14625 ----------------------------------------------------------- On Dec. 18, 2012, 12:37 a.m., Johnny Zhang wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/8653/ > ----------------------------------------------------------- > > (Updated Dec. 18, 2012, 12:37 a.m.) > > > Review request for hive. > > > Description > ------- > > I found this during debug the e2e test failures. I found Hive miss calculate the float and double value. Take float calculation as an example: > hive> select f from all100k limit 1; > 48308.98 > hive> select f/10 from all100k limit 1; > 4830.898046875 <--added 04875 in the end > hive> select f*1.01 from all100k limit 1; > 48792.0702734375 <--should be 48792.0698 > It might be essentially the same problem as http://effbot.org/pyfaq/why-are-floating-point-calculations-so-inaccurate.htm But since e2e test compare the results with mysql and seems mysql does it right, so it is worthy fixing it in Hive. > > > This addresses bug HIVE-3715. > https://issues.apache.org/jira/browse/HIVE-3715 > > > Diffs > ----- > > http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPDivide.java 1423224 > http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPMultiply.java 1423224 > > Diff: https://reviews.apache.org/r/8653/diff/ > > > Testing > ------- > > I did test to compare the result with mysql default float precision setting, the result is identical. > > query: select f, f*1.01, f/10 from all100k limit 1; > mysql result: 48309 48792.0702734375 4830.898046875 > hive result: 48308.98 48792.0702734375 4830.898046875 > > > I apply this patch and run the hive e2e test, and the tests all pass (without this patch, 5 related failures) > > > Thanks, > > Johnny Zhang > > +
Johnny Zhang 2012-12-18, 01:13
-
Re: Review Request: float and double calculation is inaccurate in HiveBharath Mundlapudi 2012-12-18, 08:51
We have solved this issue recently. It is not just a problem in Hive. Contact me offline if you need more details.
-Bharath ________________________________ From: Johnny Zhang <[EMAIL PROTECTED]> To: Johnny Zhang <[EMAIL PROTECTED]>; Mark Grover <[EMAIL PROTECTED]>; hive <[EMAIL PROTECTED]> Sent: Monday, December 17, 2012 5:13 PM Subject: Re: Review Request: float and double calculation is inaccurate in Hive > On Dec. 18, 2012, 12:38 a.m., Mark Grover wrote: > > http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPDivide.java, line 50 > > <https://reviews.apache.org/r/8653/diff/1/?file=240423#file240423line50> > > > > 10 seems to be a rather arbitrary number for scale. Any particular reason you are using it? Maybe we should invoke the method where no scale needs to be specified. > > Johnny Zhang wrote: > Hi, Mark, thanks for reviewing it. The reason using 10 is because it is the same as mysql default precision setting. Just want to make the calculation result identical to mysql's I think I did tried without specify scale, and the result is different from mysql. I agree hard coding the scale is not a good way. Open to other suggestions. - Johnny ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/8653/#review14625 ----------------------------------------------------------- On Dec. 18, 2012, 12:37 a.m., Johnny Zhang wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/8653/ > ----------------------------------------------------------- > > (Updated Dec. 18, 2012, 12:37 a.m.) > > > Review request for hive. > > > Description > ------- > > I found this during debug the e2e test failures. I found Hive miss calculate the float and double value. Take float calculation as an example: > hive> select f from all100k limit 1; > 48308.98 > hive> select f/10 from all100k limit 1; > 4830.898046875 <--added 04875 in the end > hive> select f*1.01 from all100k limit 1; > 48792.0702734375 <--should be 48792.0698 > It might be essentially the same problem as http://effbot.org/pyfaq/why-are-floating-point-calculations-so-inaccurate.htm But since e2e test compare the results with mysql and seems mysql does it right, so it is worthy fixing it in Hive. > > > This addresses bug HIVE-3715. > https://issues.apache.org/jira/browse/HIVE-3715 > > > Diffs > ----- > > http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPDivide.java 1423224 > http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPMultiply.java 1423224 > > Diff: https://reviews.apache.org/r/8653/diff/ > > > Testing > ------- > > I did test to compare the result with mysql default float precision setting, the result is identical. > > query: select f, f*1.01, f/10 from all100k limit 1; > mysql result: 48309 48792.0702734375 4830.898046875 > hive result: 48308.98 48792.0702734375 4830.898046875 > > > I apply this patch and run the hive e2e test, and the tests all pass (without this patch, 5 related failures) > > > Thanks, > > Johnny Zhang > > +
Bharath Mundlapudi 2012-12-18, 08:51
-
Re: Review Request: float and double calculation is inaccurate in HiveMark Grover 2012-12-21, 21:43
Bharath,
I am interesting in hearing more as well. Could you please comment on https://issues.apache.org/jira/browse/HIVE-2693 ? Thanks in advance! On Tue, Dec 18, 2012 at 12:51 AM, Bharath Mundlapudi <[EMAIL PROTECTED]> wrote: > We have solved this issue recently. It is not just a problem in Hive. > Contact me offline if you need more details. > > -Bharath > > ________________________________ > From: Johnny Zhang <[EMAIL PROTECTED]> > To: Johnny Zhang <[EMAIL PROTECTED]>; Mark Grover > <[EMAIL PROTECTED]>; hive <[EMAIL PROTECTED]> > Sent: Monday, December 17, 2012 5:13 PM > Subject: Re: Review Request: float and double calculation is inaccurate in > Hive > > > >> On Dec. 18, 2012, 12:38 a.m., Mark Grover wrote: >> > >> > http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPDivide.java, >> > line 50 >> > <https://reviews.apache.org/r/8653/diff/1/?file=240423#file240423line50> > >> > >> > 10 seems to be a rather arbitrary number for scale. Any particular >> > reason you are using it? Maybe we should invoke the method where no scale >> > needs to be specified. >> >> Johnny Zhang wrote: >> Hi, Mark, thanks for reviewing it. The reason using 10 is because it is >> the same as mysql default precision setting. Just want to make the >> calculation result identical to mysql's > > I think I did tried without specify scale, and the result is different from > mysql. I agree hard coding the scale is not a good way. Open to other > suggestions. > > > - Johnny > > > ----------------------------------------------------------- > > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/8653/#review14625 > ----------------------------------------------------------- > > > On Dec. 18, 2012, 12:37 a.m., Johnny Zhang wrote: >> >> ----------------------------------------------------------- > >> This is an automatically generated e-mail. To reply, visit: >> https://reviews.apache.org/r/8653/ >> ----------------------------------------------------------- >> >> (Updated Dec. 18, 2012, 12:37 a.m.) > >> >> >> Review request for hive. >> >> >> Description >> ------- >> >> I found this during debug the e2e test failures. I found Hive miss >> calculate the float and double value. Take float calculation as an example: >> hive> select f from all100k limit 1; >> 48308.98 >> hive> select f/10 from all100k limit 1; >> 4830.898046875 <--added 04875 in the end >> hive> select f*1.01 from all100k limit 1; >> 48792.0702734375 <--should be 48792.0698 >> It might be essentially the same problem as >> http://effbot.org/pyfaq/why-are-floating-point-calculations-so-inaccurate.htm >> But since e2e test compare the results with mysql and seems mysql does it >> right, so it is worthy fixing it in Hive. >> >> >> This addresses bug HIVE-3715. >> https://issues.apache.org/jira/browse/HIVE-3715 >> >> >> Diffs >> ----- >> >> >> http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPDivide.java >> 1423224 >> >> http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPMultiply.java >> 1423224 >> >> Diff: https://reviews.apache.org/r/8653/diff/ >> >> >> Testing >> ------- >> >> I did test to compare the result with mysql default float precision >> setting, the result is identical. >> >> query: select f, f*1.01, f/10 from all100k limit 1; >> mysql result: 48309 48792.0702734375 4830.898046875 >> hive result: 48308.98 48792.0702734375 4830.898046875 >> >> >> I apply this patch and run the hive e2e test, and the tests all pass >> (without this patch, 5 related failures) >> >> >> Thanks, >> >> Johnny Zhang >> >> > > > +
Mark Grover 2012-12-21, 21:43
-
Re: Review Request: float and double calculation is inaccurate in HiveJohnny Zhang 2012-12-18, 01:00
> On Dec. 18, 2012, 12:38 a.m., Mark Grover wrote: > > http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPDivide.java, line 50 > > <https://reviews.apache.org/r/8653/diff/1/?file=240423#file240423line50> > > > > 10 seems to be a rather arbitrary number for scale. Any particular reason you are using it? Maybe we should invoke the method where no scale needs to be specified. Hi, Mark, thanks for reviewing it. The reason using 10 is because it is the same as mysql default precision setting. Just want to make the calculation result identical to mysql's > On Dec. 18, 2012, 12:38 a.m., Mark Grover wrote: > > http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPMultiply.java, line 112 > > <https://reviews.apache.org/r/8653/diff/1/?file=240424#file240424line112> > > > > You seem to be doing > > DoubleWritable->String->BigDecimal > > > > There probably is a way to do: > > DoubleWritable->Double->BigDecimal > > > > I am not sure if it's any more efficient the present case. So, take this suggestion with a grain of salt:-) > > the reason using constructor with String parameter is because using constructor with double parameter would reduce the precision before calculation. There is a similar discussion regarding it http://www.coderanch.com/t/408226/java/java/Double-BigDecimal-Conversion-problems "you will see the difference between creating an instance using a double (whose precision has already been compromised by forcing it into IEEE 754 standards) and creating an instance using a String (which can be translated accurately). " - Johnny ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/8653/#review14625 ----------------------------------------------------------- On Dec. 18, 2012, 12:37 a.m., Johnny Zhang wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/8653/ > ----------------------------------------------------------- > > (Updated Dec. 18, 2012, 12:37 a.m.) > > > Review request for hive. > > > Description > ------- > > I found this during debug the e2e test failures. I found Hive miss calculate the float and double value. Take float calculation as an example: > hive> select f from all100k limit 1; > 48308.98 > hive> select f/10 from all100k limit 1; > 4830.898046875 <--added 04875 in the end > hive> select f*1.01 from all100k limit 1; > 48792.0702734375 <--should be 48792.0698 > It might be essentially the same problem as http://effbot.org/pyfaq/why-are-floating-point-calculations-so-inaccurate.htm But since e2e test compare the results with mysql and seems mysql does it right, so it is worthy fixing it in Hive. > > > This addresses bug HIVE-3715. > https://issues.apache.org/jira/browse/HIVE-3715 > > > Diffs > ----- > > http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPDivide.java 1423224 > http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPMultiply.java 1423224 > > Diff: https://reviews.apache.org/r/8653/diff/ > > > Testing > ------- > > I did test to compare the result with mysql default float precision setting, the result is identical. > > query: select f, f*1.01, f/10 from all100k limit 1; > mysql result: 48309 48792.0702734375 4830.898046875 > hive result: 48308.98 48792.0702734375 4830.898046875 > > > I apply this patch and run the hive e2e test, and the tests all pass (without this patch, 5 related failures) > > > Thanks, > > Johnny Zhang > > +
Johnny Zhang 2012-12-18, 01:00
-
Re: Review Request: float and double calculation is inaccurate in HiveMark Grover 2012-12-21, 21:54
> On Dec. 18, 2012, 12:38 a.m., Mark Grover wrote: > > http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPDivide.java, line 50 > > <https://reviews.apache.org/r/8653/diff/1/?file=240423#file240423line50> > > > > 10 seems to be a rather arbitrary number for scale. Any particular reason you are using it? Maybe we should invoke the method where no scale needs to be specified. > > Johnny Zhang wrote: > Hi, Mark, thanks for reviewing it. The reason using 10 is because it is the same as mysql default precision setting. Just want to make the calculation result identical to mysql's > > Johnny Zhang wrote: > I think I did tried without specify scale, and the result is different from mysql. I agree hard coding the scale is not a good way. Open to other suggestions. Fair enough. Thanks - Mark ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/8653/#review14625 ----------------------------------------------------------- On Dec. 18, 2012, 12:37 a.m., Johnny Zhang wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/8653/ > ----------------------------------------------------------- > > (Updated Dec. 18, 2012, 12:37 a.m.) > > > Review request for hive. > > > Description > ------- > > I found this during debug the e2e test failures. I found Hive miss calculate the float and double value. Take float calculation as an example: > hive> select f from all100k limit 1; > 48308.98 > hive> select f/10 from all100k limit 1; > 4830.898046875 <--added 04875 in the end > hive> select f*1.01 from all100k limit 1; > 48792.0702734375 <--should be 48792.0698 > It might be essentially the same problem as http://effbot.org/pyfaq/why-are-floating-point-calculations-so-inaccurate.htm But since e2e test compare the results with mysql and seems mysql does it right, so it is worthy fixing it in Hive. > > > This addresses bug HIVE-3715. > https://issues.apache.org/jira/browse/HIVE-3715 > > > Diffs > ----- > > http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPDivide.java 1423224 > http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPMultiply.java 1423224 > > Diff: https://reviews.apache.org/r/8653/diff/ > > > Testing > ------- > > I did test to compare the result with mysql default float precision setting, the result is identical. > > query: select f, f*1.01, f/10 from all100k limit 1; > mysql result: 48309 48792.0702734375 4830.898046875 > hive result: 48308.98 48792.0702734375 4830.898046875 > > > I apply this patch and run the hive e2e test, and the tests all pass (without this patch, 5 related failures) > > > Thanks, > > Johnny Zhang > > +
Mark Grover 2012-12-21, 21:54
|