-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: read_csv not recognizing numbers appropriately when decimal is set #38420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/io/parsers.py
Outdated
@@ -2346,9 +2346,17 @@ def __init__(self, f: Union[FilePathOrBuffer, List], **kwds): | |||
raise ValueError("Only length-1 decimal markers supported") | |||
|
|||
if self.thousands is None: | |||
self.nonnum = re.compile(fr"[^-^0-9^{self.decimal}]+") | |||
regex = fr"^\-?[0-9]*({self.decimal}[0-9]*)?([0-9](E|e)\-?[0-9]*)?$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is self.decimal already escaped by the time we get here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, seems not to be the case. Escaped it.
pandas/io/parsers.py
Outdated
@@ -3036,7 +3044,7 @@ def _search_replace_num_columns(self, lines, search, replace): | |||
not isinstance(x, str) | |||
or search not in x | |||
or (self._no_thousands_columns and i in self._no_thousands_columns) | |||
or self.nonnum.search(x.strip()) | |||
or not self.nonnum.search(x.strip()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this flips the meaning of nonnum right? should rename?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep is better, done
pandas/io/parsers.py
Outdated
if "." == thousands: | ||
thousands = fr"\{thousands}" | ||
regex = ( | ||
fr"^\-?([0-9]+{thousands}|[0-9])*({self.decimal}[0-9]*)?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be out of scope, but do we want to restrict to 3 digits between thousands characters (IIRC in some locales they separate on ten-thousands, not sure if we support that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a thing I was wondering too when opening this PR. Not quite sure how strict we would want to be here. We could maybe check for at least three digits between? The old version did not check anything related to this
pandas/io/parsers.py
Outdated
else: | ||
self.nonnum = re.compile(fr"[^-^0-9^{self.thousands}^{self.decimal}]+") | ||
thousands = self.thousands | ||
if "." == thousands: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe re.escape is more appropriate here? Although we do not expect many different inputs, so special casing is maybe more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this raises a good point. Independently of the expected input (we do not say anything about this in the docs), maybe we should escape decimals too as @jbrockmendel mentioned above
� Conflicts: � doc/source/whatsnew/v1.3.0.rst
� Conflicts: � doc/source/whatsnew/v1.3.0.rst
""" | ||
) | ||
result = python_parser_only.read_csv( | ||
data, "\t", decimal=",", engine="python", thousands=thousands |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data, "\t", decimal=",", engine="python", thousands=thousands | |
data, "\t", decimal=",", thousands=thousands |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry wrong commit button above. C works perfectly here. Already have tests therefore. Would probably makes sense unifying them as a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO would do it in this PR, but follow-up also works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is fine as a followup
Co-authored-by: gfyoung <[email protected]>
� Conflicts: � doc/source/whatsnew/v1.3.0.rst
doc/source/whatsnew/v1.3.0.rst
Outdated
@@ -255,6 +255,7 @@ I/O | |||
^^^ | |||
|
|||
- Bug in :meth:`Index.__repr__` when ``display.max_seq_items=1`` (:issue:`38415`) | |||
- Bug in :func:`read_csv` not recognizing scientific notation if decimal is set (:issue:`31920`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should say engine='python' right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
great ping on green (suggestion to combine the python/c tests as a followup) |
@jreback green |
thanks @phofl if you can do a PR or issue for the followup would be great |
Will open an issue. Hopefully I will be able to get back to this in the coming days |
Opened #38926 |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
The old regex was shorter but pretty buggy.
Let's assume
decimal=","
andthousands="."
For example something like
1a.2,3
was interpreted as numeric by the regex and converted to1a2.3
, because the.
was not escaped in the regex.Also something like
1,2,3
was interpreted as numeric by the regex and converted to1.2.3
.This one is not quite finished. We have to define a few thousand separator related things:
How strict would we want to be? Should
1.2,3
be interpreted as numeric and be converted to12.3
or is only something like1.234,5
relevant as thousands separator? C Engine validation is not strictAdditionally should
,2
be the number0.2
? -> Currently it is, because the C engine behaves the same