BUG: Strings like '2E' are incorrectly parsed as valid floats #12215

bennorth · 2016-02-03T00:12:18Z

A work colleague, David Chase, encountered some surprising behaviour, which can be reduced to the following. The data-frame

DataFrame({'x': [2.5], 'y': [42], 'z': ['2E']})

does not round-trip correctly. The string '2E' is interpreted as a valid float, but it should not be (according to man strtod(3), which seems a reasonable spec).

This PR changes the three variants of xstrtod() to reject a string where no digits follow the 'e' or 'E', and includes tests for this case.

jreback · 2016-02-03T00:25:54Z

pandas/tests/frame/test_to_csv.py

@@ -1108,3 +1108,13 @@ def test_to_csv_with_dst_transitions(self):
            df.to_pickle(path)
            result = pd.read_pickle(path)
            assert_frame_equal(result, df)
+
+    def test_round_trip_scientific_no_exponent(self):
+        df = DataFrame({'x': [2.5], 'y': [42], 'z': ['2E']})


this needs to go in io/test_parsers.py

and use self.read_csv

as this test under all engines (this might fail under other parsers)

bennorth · 2016-02-03T08:01:17Z

Thanks for feedback. Will tidy up the following:

move test to io/test_parsers.py
use self.read_csv in test
remove redundant assertion on behaviour with '2E3'

and force-push new commits.

bennorth · 2016-02-03T15:23:38Z

New commits address feedback from @jreback. Can squash or otherwise re-order if preferred, but for review, separate commits seemed cleaner.

jreback · 2016-02-05T15:54:52Z

pandas/tests/test_tseries.py

@@ -337,6 +337,12 @@ def test_convert_infs():
    assert (result.dtype == np.float64)


+def test_scientific_no_exponent():
+    arr = np.array(['42E', '2E', '99e', '6e'], dtype='O')


i don't see anything changed in the routines which support this, why is this test needed?

The changes this PR proposes to the xstrtod() functions mean that strings like that are rejected as floats. The current behaviour converts them as if they were '42E0' etc:

In [1]: import pandas as pd, numpy as np In [2]: pd.__version__ Out[2]: '0.17.1' In [3]: arr = np.array(['42E', '2E', '99e', '6e'], dtype='O') In [4]: pd.lib.maybe_convert_numeric(arr, set(), False, True) Out[4]: array([ 42., 2., 99., 6.])

which I believe is incorrect. The test checks that such strings are rejected as floats, and so give NaN.

jreback · 2016-02-05T15:56:19Z

pls add a whatsnew note (bug fixes)

bennorth · 2016-02-05T17:29:49Z

Have created an issue (#12237), to have something to refer to in the whatsnew entry.

jreback · 2016-02-05T17:30:30Z

@bennorth you didn't actually have to create an issue as you can just refer to the PR number itself (but ok)

bennorth · 2016-02-05T17:33:43Z

(Ah, OK, sorry. I saw that the other entries in whatsnew did refer to issues so followed that pattern.)

jreback · 2016-02-05T17:51:26Z

pandas/tests/test_tseries.py

+    # See PR 12215
+    arr = np.array(['42E', '2E', '99e', '6e'], dtype='O')
+    result = lib.maybe_convert_numeric(arr, set(), False, True)
+    assert np.all(np.isnan(result))


ahh, you fixed floatify, I get it now

use self.assertTrue(.....)

OK. (There are plenty of other bare asserts in that file though; shall I create an issue to update them?)

just pls create another issue, we shouldn't have any bare asserts.

bennorth · 2016-02-05T20:47:41Z

I should have seen that one --- there is no self in that top-level function. Will revert.

bennorth · 2016-02-06T07:29:32Z

(Build/test failure matches failure in current master and appears unrelated to this change.)

jreback · 2016-02-06T19:43:06Z

pandas/io/tests/test_parsers.py

@@ -210,6 +210,18 @@ def test_read_csv(self):
            # it works!
            read_csv(fname, index_col=0, parse_dates=True)

+    def float_precision_choices(self):
+        raise pd.core.common.AbstractMethodError(self)


import this at the top and just raise AbstractMethodError(self)

jreback · 2016-02-06T19:45:11Z

ok, couple of small comments. pls rebase on master & squash. ping when green.

bennorth · 2016-02-06T22:03:21Z

(Rebased on master and squashed.)

bennorth · 2016-02-07T09:26:33Z

(Force-push of b51a35a is fresh re-base onto master.)

bennorth · 2016-02-07T10:03:22Z

@jreback Pinging as requested --- squashed, rebased, and build is green.

jreback · 2016-02-08T15:08:37Z

@bennorth lgtm. But have a look here: #10133

maybe move your tests near this one (if they are not there now)

sorry, rereading this had to do with the decimal places in the mantissa I guess.

bennorth · 2016-02-08T16:36:43Z

Yes, it does make sense to move the new tests near related ones. New commit on the way. Thanks.

The man page for strode(3) says: "A decimal exponent consists of an 'E' or 'e', followed by an optional plus or minus sign, followed by a NONEMPTY sequence of decimal digits". (Emphasis on 'nonempty' added.) Currently, Pandas parses the string '2E' as a valid float, interpreting it as '2E0', i.e., 2.0. It should reject '2E'. Update the functions precise_xstrtod() xstrtod() (two copies) such that they require at least one digit after the 'e' or 'E'. If there are no digits, then there is not a valid exponent, and in that case, we rewind the 'next character' pointer back to point to the 'e' or 'E'. Add tests: test_scientific_no_exponent() in tests/test_tseries.py ParserTests.test_scientific_no_exponent in io/tests/test_parsers.py (tests behaviour under C and Python engines; and for the three float_precision variants under the C engine)

bennorth · 2016-02-09T15:48:32Z

Recent build failures were in master and unrelated. Have just force-pushed newly-rebased commit.

jreback · 2016-02-09T22:03:44Z

thanks @bennorth

bennorth · 2016-02-09T22:15:54Z

Thanks!

DataFrame({'x': [2.5], 'y': [42], 'z': ['2E']}) does not round-trip correctly. The string '2E' is interpreted as a valid float, but it should not be This PR changes the three variants of `xstrtod()` to reject a string where no digits follow the 'e' or 'E', and includes tests for this case. Author: Ben North <[email protected]> Closes pandas-dev#12215 from bennorth/BUG-float-parsing and squashes the following commits: 8d2b583 [Ben North] BUG: Reject empty-exponent strings as non-floats

jreback reviewed Feb 3, 2016
View reviewed changes

jreback reviewed Feb 5, 2016
View reviewed changes

jreback added Bug IO CSV read_csv, to_csv labels Feb 5, 2016

bennorth mentioned this pull request Feb 5, 2016

When reading CSV, float-like strings lacking exponent digits are accepted as floats #12237

Closed

jreback reviewed Feb 5, 2016
View reviewed changes

jreback reviewed Feb 6, 2016
View reviewed changes

jreback added this to the 0.18.0 milestone Feb 6, 2016

bennorth force-pushed the BUG-float-parsing branch 2 times, most recently from 5449922 to 89c5ad9 Compare February 6, 2016 22:03

bennorth force-pushed the BUG-float-parsing branch from 89c5ad9 to b51a35a Compare February 7, 2016 09:25

bennorth force-pushed the BUG-float-parsing branch from b51a35a to 12b05a2 Compare February 8, 2016 16:37

bennorth force-pushed the BUG-float-parsing branch from 12b05a2 to 8d2b583 Compare February 9, 2016 15:48

jreback closed this in 517c559 Feb 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Strings like '2E' are incorrectly parsed as valid floats #12215

BUG: Strings like '2E' are incorrectly parsed as valid floats #12215

bennorth commented Feb 3, 2016

jreback Feb 3, 2016

bennorth commented Feb 3, 2016

bennorth commented Feb 3, 2016

jreback Feb 5, 2016

bennorth Feb 5, 2016

jreback commented Feb 5, 2016

bennorth commented Feb 5, 2016

jreback commented Feb 5, 2016

bennorth commented Feb 5, 2016

jreback Feb 5, 2016

bennorth Feb 5, 2016

jreback Feb 6, 2016

bennorth commented Feb 5, 2016

bennorth commented Feb 6, 2016

jreback Feb 6, 2016

jreback commented Feb 6, 2016

bennorth commented Feb 6, 2016

bennorth commented Feb 7, 2016

bennorth commented Feb 7, 2016

jreback commented Feb 8, 2016

bennorth commented Feb 8, 2016

bennorth commented Feb 9, 2016

jreback commented Feb 9, 2016

bennorth commented Feb 9, 2016

BUG: Strings like '2E' are incorrectly parsed as valid floats #12215

BUG: Strings like '2E' are incorrectly parsed as valid floats #12215

Conversation

bennorth commented Feb 3, 2016

jreback Feb 3, 2016

Choose a reason for hiding this comment

bennorth commented Feb 3, 2016

bennorth commented Feb 3, 2016

jreback Feb 5, 2016

Choose a reason for hiding this comment

bennorth Feb 5, 2016

Choose a reason for hiding this comment

jreback commented Feb 5, 2016

bennorth commented Feb 5, 2016

jreback commented Feb 5, 2016

bennorth commented Feb 5, 2016

jreback Feb 5, 2016

Choose a reason for hiding this comment

bennorth Feb 5, 2016

Choose a reason for hiding this comment

jreback Feb 6, 2016

Choose a reason for hiding this comment

bennorth commented Feb 5, 2016

bennorth commented Feb 6, 2016

jreback Feb 6, 2016

Choose a reason for hiding this comment

jreback commented Feb 6, 2016

bennorth commented Feb 6, 2016

bennorth commented Feb 7, 2016

bennorth commented Feb 7, 2016

jreback commented Feb 8, 2016

bennorth commented Feb 8, 2016

bennorth commented Feb 9, 2016

jreback commented Feb 9, 2016

bennorth commented Feb 9, 2016