-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Strings like '2E' are incorrectly parsed as valid floats #12215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -1108,3 +1108,13 @@ def test_to_csv_with_dst_transitions(self): | |||
df.to_pickle(path) | |||
result = pd.read_pickle(path) | |||
assert_frame_equal(result, df) | |||
|
|||
def test_round_trip_scientific_no_exponent(self): | |||
df = DataFrame({'x': [2.5], 'y': [42], 'z': ['2E']}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to go in io/test_parsers.py
and use self.read_csv
as this test under all engines (this might fail under other parsers)
Thanks for feedback. Will tidy up the following:
and force-push new commits. |
New commits address feedback from @jreback. Can squash or otherwise re-order if preferred, but for review, separate commits seemed cleaner. |
@@ -337,6 +337,12 @@ def test_convert_infs(): | |||
assert (result.dtype == np.float64) | |||
|
|||
|
|||
def test_scientific_no_exponent(): | |||
arr = np.array(['42E', '2E', '99e', '6e'], dtype='O') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't see anything changed in the routines which support this, why is this test needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes this PR proposes to the xstrtod()
functions mean that strings like that are rejected as floats. The current behaviour converts them as if they were '42E0' etc:
In [1]: import pandas as pd, numpy as np
In [2]: pd.__version__
Out[2]: '0.17.1'
In [3]: arr = np.array(['42E', '2E', '99e', '6e'], dtype='O')
In [4]: pd.lib.maybe_convert_numeric(arr, set(), False, True)
Out[4]: array([ 42., 2., 99., 6.])
which I believe is incorrect. The test checks that such strings are rejected as floats, and so give NaN.
pls add a whatsnew note (bug fixes) |
Have created an issue (#12237), to have something to refer to in the whatsnew entry. |
@bennorth you didn't actually have to create an issue as you can just refer to the PR number itself (but ok) |
(Ah, OK, sorry. I saw that the other entries in whatsnew did refer to issues so followed that pattern.) |
# See PR 12215 | ||
arr = np.array(['42E', '2E', '99e', '6e'], dtype='O') | ||
result = lib.maybe_convert_numeric(arr, set(), False, True) | ||
assert np.all(np.isnan(result)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, you fixed floatify
, I get it now
use self.assertTrue(.....)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. (There are plenty of other bare assert
s in that file though; shall I create an issue to update them?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just pls create another issue, we shouldn't have any bare asserts.
I should have seen that one --- there is no |
(Build/test failure matches failure in current |
@@ -210,6 +210,18 @@ def test_read_csv(self): | |||
# it works! | |||
read_csv(fname, index_col=0, parse_dates=True) | |||
|
|||
def float_precision_choices(self): | |||
raise pd.core.common.AbstractMethodError(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import this at the top and just raise AbstractMethodError(self)
ok, couple of small comments. pls rebase on master & squash. ping when green. |
5449922
to
89c5ad9
Compare
(Rebased on master and squashed.) |
89c5ad9
to
b51a35a
Compare
(Force-push of b51a35a is fresh re-base onto master.) |
@jreback Pinging as requested --- squashed, rebased, and build is green. |
Yes, it does make sense to move the new tests near related ones. New commit on the way. Thanks. |
b51a35a
to
12b05a2
Compare
The man page for strode(3) says: "A decimal exponent consists of an 'E' or 'e', followed by an optional plus or minus sign, followed by a NONEMPTY sequence of decimal digits". (Emphasis on 'nonempty' added.) Currently, Pandas parses the string '2E' as a valid float, interpreting it as '2E0', i.e., 2.0. It should reject '2E'. Update the functions precise_xstrtod() xstrtod() (two copies) such that they require at least one digit after the 'e' or 'E'. If there are no digits, then there is not a valid exponent, and in that case, we rewind the 'next character' pointer back to point to the 'e' or 'E'. Add tests: test_scientific_no_exponent() in tests/test_tseries.py ParserTests.test_scientific_no_exponent in io/tests/test_parsers.py (tests behaviour under C and Python engines; and for the three float_precision variants under the C engine)
12b05a2
to
8d2b583
Compare
Recent build failures were in |
thanks @bennorth |
Thanks! |
DataFrame({'x': [2.5], 'y': [42], 'z': ['2E']}) does not round-trip correctly. The string '2E' is interpreted as a valid float, but it should not be This PR changes the three variants of `xstrtod()` to reject a string where no digits follow the 'e' or 'E', and includes tests for this case. Author: Ben North <[email protected]> Closes pandas-dev#12215 from bennorth/BUG-float-parsing and squashes the following commits: 8d2b583 [Ben North] BUG: Reject empty-exponent strings as non-floats
A work colleague, David Chase, encountered some surprising behaviour, which can be reduced to the following. The data-frame
does not round-trip correctly. The string '2E' is interpreted as a valid float, but it should not be (according to man strtod(3), which seems a reasonable spec).
This PR changes the three variants of
xstrtod()
to reject a string where no digits follow the 'e' or 'E', and includes tests for this case.