Skip to content

BUG: read_csv not recognizing numbers appropriately when decimal is set #38420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jan 3, 2021

Conversation

phofl
Copy link
Member

@phofl phofl commented Dec 12, 2020

The old regex was shorter but pretty buggy.

Let's assume decimal="," and thousands="."

For example something like 1a.2,3 was interpreted as numeric by the regex and converted to 1a2.3, because the . was not escaped in the regex.
Also something like 1,2,3 was interpreted as numeric by the regex and converted to 1.2.3.

This one is not quite finished. We have to define a few thousand separator related things:

How strict would we want to be? Should 1.2,3 be interpreted as numeric and be converted to 12.3 or is only something like 1.234,5 relevant as thousands separator? C Engine validation is not strict
Additionally should ,2 be the number 0.2? -> Currently it is, because the C engine behaves the same

@pep8speaks
Copy link

pep8speaks commented Dec 12, 2020

Hello @phofl! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-01-01 22:36:56 UTC

@phofl phofl added the IO CSV read_csv, to_csv label Dec 12, 2020
@@ -2346,9 +2346,17 @@ def __init__(self, f: Union[FilePathOrBuffer, List], **kwds):
raise ValueError("Only length-1 decimal markers supported")

if self.thousands is None:
self.nonnum = re.compile(fr"[^-^0-9^{self.decimal}]+")
regex = fr"^\-?[0-9]*({self.decimal}[0-9]*)?([0-9](E|e)\-?[0-9]*)?$"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is self.decimal already escaped by the time we get here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, seems not to be the case. Escaped it.

@@ -3036,7 +3044,7 @@ def _search_replace_num_columns(self, lines, search, replace):
not isinstance(x, str)
or search not in x
or (self._no_thousands_columns and i in self._no_thousands_columns)
or self.nonnum.search(x.strip())
or not self.nonnum.search(x.strip())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this flips the meaning of nonnum right? should rename?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep is better, done

if "." == thousands:
thousands = fr"\{thousands}"
regex = (
fr"^\-?([0-9]+{thousands}|[0-9])*({self.decimal}[0-9]*)?"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be out of scope, but do we want to restrict to 3 digits between thousands characters (IIRC in some locales they separate on ten-thousands, not sure if we support that)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a thing I was wondering too when opening this PR. Not quite sure how strict we would want to be here. We could maybe check for at least three digits between? The old version did not check anything related to this

else:
self.nonnum = re.compile(fr"[^-^0-9^{self.thousands}^{self.decimal}]+")
thousands = self.thousands
if "." == thousands:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe re.escape is more appropriate here? Although we do not expect many different inputs, so special casing is maybe more explicit.

Copy link
Member Author

@phofl phofl Dec 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this raises a good point. Independently of the expected input (we do not say anything about this in the docs), maybe we should escape decimals too as @jbrockmendel mentioned above

"""
)
result = python_parser_only.read_csv(
data, "\t", decimal=",", engine="python", thousands=thousands
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data, "\t", decimal=",", engine="python", thousands=thousands
data, "\t", decimal=",", thousands=thousands

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry wrong commit button above. C works perfectly here. Already have tests therefore. Would probably makes sense unifying them as a follow up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO would do it in this PR, but follow-up also works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine as a followup

@jreback jreback added this to the 1.3 milestone Dec 29, 2020
@@ -255,6 +255,7 @@ I/O
^^^

- Bug in :meth:`Index.__repr__` when ``display.max_seq_items=1`` (:issue:`38415`)
- Bug in :func:`read_csv` not recognizing scientific notation if decimal is set (:issue:`31920`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should say engine='python' right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

@jreback
Copy link
Contributor

jreback commented Jan 1, 2021

great ping on green (suggestion to combine the python/c tests as a followup)

@phofl
Copy link
Member Author

phofl commented Jan 2, 2021

@jreback green

@jreback jreback merged commit b337b61 into pandas-dev:master Jan 3, 2021
@jreback
Copy link
Contributor

jreback commented Jan 3, 2021

thanks @phofl if you can do a PR or issue for the followup would be great

@phofl phofl deleted the 31920 branch January 3, 2021 18:19
@phofl
Copy link
Member Author

phofl commented Jan 3, 2021

Will open an issue. Hopefully I will be able to get back to this in the coming days

@phofl
Copy link
Member Author

phofl commented Jan 3, 2021

Opened #38926

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.read_csv does not recognize scientific notation if 'decimal' attribute is set with engine=python
6 participants