Skip to content

BUG: Strings like '2E' are incorrectly parsed as valid floats #12215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

bennorth
Copy link
Contributor

@bennorth bennorth commented Feb 3, 2016

A work colleague, David Chase, encountered some surprising behaviour, which can be reduced to the following. The data-frame

DataFrame({'x': [2.5], 'y': [42], 'z': ['2E']})

does not round-trip correctly. The string '2E' is interpreted as a valid float, but it should not be (according to man strtod(3), which seems a reasonable spec).

This PR changes the three variants of xstrtod() to reject a string where no digits follow the 'e' or 'E', and includes tests for this case.

@@ -1108,3 +1108,13 @@ def test_to_csv_with_dst_transitions(self):
df.to_pickle(path)
result = pd.read_pickle(path)
assert_frame_equal(result, df)

def test_round_trip_scientific_no_exponent(self):
df = DataFrame({'x': [2.5], 'y': [42], 'z': ['2E']})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to go in io/test_parsers.py

and use self.read_csv

as this test under all engines (this might fail under other parsers)

@bennorth
Copy link
Contributor Author

bennorth commented Feb 3, 2016

Thanks for feedback. Will tidy up the following:

  • move test to io/test_parsers.py
  • use self.read_csv in test
  • remove redundant assertion on behaviour with '2E3'

and force-push new commits.

@bennorth
Copy link
Contributor Author

bennorth commented Feb 3, 2016

New commits address feedback from @jreback. Can squash or otherwise re-order if preferred, but for review, separate commits seemed cleaner.

@@ -337,6 +337,12 @@ def test_convert_infs():
assert (result.dtype == np.float64)


def test_scientific_no_exponent():
arr = np.array(['42E', '2E', '99e', '6e'], dtype='O')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't see anything changed in the routines which support this, why is this test needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes this PR proposes to the xstrtod() functions mean that strings like that are rejected as floats. The current behaviour converts them as if they were '42E0' etc:

In [1]: import pandas as pd, numpy as np

In [2]: pd.__version__
Out[2]: '0.17.1'

In [3]: arr = np.array(['42E', '2E', '99e', '6e'], dtype='O')

In [4]: pd.lib.maybe_convert_numeric(arr, set(), False, True)
Out[4]: array([ 42.,   2.,  99.,   6.])

which I believe is incorrect. The test checks that such strings are rejected as floats, and so give NaN.

@jreback
Copy link
Contributor

jreback commented Feb 5, 2016

pls add a whatsnew note (bug fixes)

@bennorth
Copy link
Contributor Author

bennorth commented Feb 5, 2016

Have created an issue (#12237), to have something to refer to in the whatsnew entry.

@jreback
Copy link
Contributor

jreback commented Feb 5, 2016

@bennorth you didn't actually have to create an issue as you can just refer to the PR number itself (but ok)

@bennorth
Copy link
Contributor Author

bennorth commented Feb 5, 2016

(Ah, OK, sorry. I saw that the other entries in whatsnew did refer to issues so followed that pattern.)

# See PR 12215
arr = np.array(['42E', '2E', '99e', '6e'], dtype='O')
result = lib.maybe_convert_numeric(arr, set(), False, True)
assert np.all(np.isnan(result))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, you fixed floatify, I get it now

use self.assertTrue(.....)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. (There are plenty of other bare asserts in that file though; shall I create an issue to update them?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just pls create another issue, we shouldn't have any bare asserts.

@bennorth
Copy link
Contributor Author

bennorth commented Feb 5, 2016

I should have seen that one --- there is no self in that top-level function. Will revert.

@bennorth
Copy link
Contributor Author

bennorth commented Feb 6, 2016

(Build/test failure matches failure in current master and appears unrelated to this change.)

@@ -210,6 +210,18 @@ def test_read_csv(self):
# it works!
read_csv(fname, index_col=0, parse_dates=True)

def float_precision_choices(self):
raise pd.core.common.AbstractMethodError(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import this at the top and just raise AbstractMethodError(self)

@jreback jreback added this to the 0.18.0 milestone Feb 6, 2016
@jreback
Copy link
Contributor

jreback commented Feb 6, 2016

ok, couple of small comments. pls rebase on master & squash. ping when green.

@bennorth bennorth force-pushed the BUG-float-parsing branch 2 times, most recently from 5449922 to 89c5ad9 Compare February 6, 2016 22:03
@bennorth
Copy link
Contributor Author

bennorth commented Feb 6, 2016

(Rebased on master and squashed.)

@bennorth
Copy link
Contributor Author

bennorth commented Feb 7, 2016

(Force-push of b51a35a is fresh re-base onto master.)

@bennorth
Copy link
Contributor Author

bennorth commented Feb 7, 2016

@jreback Pinging as requested --- squashed, rebased, and build is green.

@jreback
Copy link
Contributor

jreback commented Feb 8, 2016

@bennorth lgtm. But have a look here: #10133

maybe move your tests near this one (if they are not there now)

sorry, rereading this had to do with the decimal places in the mantissa I guess.

@bennorth
Copy link
Contributor Author

bennorth commented Feb 8, 2016

Yes, it does make sense to move the new tests near related ones. New commit on the way. Thanks.

The man page for strode(3) says: "A decimal exponent consists of an 'E'
or 'e', followed by an optional plus or minus sign, followed by a
NONEMPTY sequence of decimal digits".  (Emphasis on 'nonempty' added.)
Currently, Pandas parses the string '2E' as a valid float, interpreting
it as '2E0', i.e., 2.0.  It should reject '2E'.

Update the functions

    precise_xstrtod()
    xstrtod() (two copies)

such that they require at least one digit after the 'e' or 'E'.  If
there are no digits, then there is not a valid exponent, and in that
case, we rewind the 'next character' pointer back to point to the 'e' or
'E'.

Add tests:

    test_scientific_no_exponent() in tests/test_tseries.py

    ParserTests.test_scientific_no_exponent in io/tests/test_parsers.py
    (tests behaviour under C and Python engines; and for the three
    float_precision variants under the C engine)
@bennorth
Copy link
Contributor Author

bennorth commented Feb 9, 2016

Recent build failures were in master and unrelated. Have just force-pushed newly-rebased commit.

@jreback jreback closed this in 517c559 Feb 9, 2016
@jreback
Copy link
Contributor

jreback commented Feb 9, 2016

thanks @bennorth

@bennorth
Copy link
Contributor Author

bennorth commented Feb 9, 2016

Thanks!

cldy pushed a commit to cldy/pandas that referenced this pull request Feb 11, 2016
DataFrame({'x': [2.5], 'y': [42], 'z': ['2E']}) does not round-trip correctly.
The string '2E' is interpreted as a valid float, but it should not be
This PR changes the three variants of `xstrtod()` to reject a string where no
digits follow the 'e' or 'E', and includes tests for this case.

Author: Ben North <[email protected]>

Closes pandas-dev#12215 from bennorth/BUG-float-parsing and squashes the following commits:

8d2b583 [Ben North] BUG: Reject empty-exponent strings as non-floats
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants