BUG: read_csv not recognizing numbers appropriately when decimal is set #38420

phofl · 2020-12-12T00:54:19Z

closes pd.read_csv does not recognize scientific notation if 'decimal' attribute is set with engine=python #31920
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

The old regex was shorter but pretty buggy.

Let's assume decimal="," and thousands="."

For example something like 1a.2,3 was interpreted as numeric by the regex and converted to 1a2.3, because the . was not escaped in the regex.
Also something like 1,2,3 was interpreted as numeric by the regex and converted to 1.2.3.

This one is not quite finished. We have to define a few thousand separator related things:

How strict would we want to be? Should 1.2,3 be interpreted as numeric and be converted to 12.3 or is only something like 1.234,5 relevant as thousands separator? C Engine validation is not strict
Additionally should ,2 be the number 0.2? -> Currently it is, because the C engine behaves the same

pep8speaks · 2020-12-12T00:54:23Z

Hello @phofl! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-01-01 22:36:56 UTC

jbrockmendel · 2020-12-17T18:39:03Z

pandas/io/parsers.py

@@ -2346,9 +2346,17 @@ def __init__(self, f: Union[FilePathOrBuffer, List], **kwds):
            raise ValueError("Only length-1 decimal markers supported")

        if self.thousands is None:
-            self.nonnum = re.compile(fr"[^-^0-9^{self.decimal}]+")
+            regex = fr"^\-?[0-9]*({self.decimal}[0-9]*)?([0-9](E|e)\-?[0-9]*)?$"


is self.decimal already escaped by the time we get here?

No, seems not to be the case. Escaped it.

jbrockmendel · 2020-12-17T18:40:34Z

pandas/io/parsers.py

@@ -3036,7 +3044,7 @@ def _search_replace_num_columns(self, lines, search, replace):
                    not isinstance(x, str)
                    or search not in x
                    or (self._no_thousands_columns and i in self._no_thousands_columns)
-                    or self.nonnum.search(x.strip())
+                    or not self.nonnum.search(x.strip())


this flips the meaning of nonnum right? should rename?

Yep is better, done

jbrockmendel · 2020-12-17T18:42:23Z

pandas/io/parsers.py

+            if "." == thousands:
+                thousands = fr"\{thousands}"
+            regex = (
+                fr"^\-?([0-9]+{thousands}|[0-9])*({self.decimal}[0-9]*)?"


might be out of scope, but do we want to restrict to 3 digits between thousands characters (IIRC in some locales they separate on ten-thousands, not sure if we support that)

That was a thing I was wondering too when opening this PR. Not quite sure how strict we would want to be here. We could maybe check for at least three digits between? The old version did not check anything related to this

cmulders · 2020-12-17T06:48:24Z

pandas/io/parsers.py

        else:
-            self.nonnum = re.compile(fr"[^-^0-9^{self.thousands}^{self.decimal}]+")
+            thousands = self.thousands
+            if "." == thousands:


Maybe re.escape is more appropriate here? Although we do not expect many different inputs, so special casing is maybe more explicit.

Actually this raises a good point. Independently of the expected input (we do not say anything about this in the docs), maybe we should escape decimals too as @jbrockmendel mentioned above

� Conflicts: � doc/source/whatsnew/v1.3.0.rst

pandas/tests/io/parser/test_python_parser_only.py

gfyoung · 2020-12-29T10:40:11Z

pandas/tests/io/parser/test_python_parser_only.py

+    """
+    )
+    result = python_parser_only.read_csv(
+        data, "\t", decimal=",", engine="python", thousands=thousands


Suggested change

data, "\t", decimal=",", engine="python", thousands=thousands

data, "\t", decimal=",", thousands=thousands

Sorry wrong commit button above. C works perfectly here. Already have tests therefore. Would probably makes sense unifying them as a follow up

IMO would do it in this PR, but follow-up also works.

this is fine as a followup

Co-authored-by: gfyoung <[email protected]>

� Conflicts: � doc/source/whatsnew/v1.3.0.rst

jreback · 2020-12-29T23:20:11Z

doc/source/whatsnew/v1.3.0.rst

@@ -255,6 +255,7 @@ I/O
 ^^^

 - Bug in :meth:`Index.__repr__` when ``display.max_seq_items=1`` (:issue:`38415`)
+- Bug in :func:`read_csv` not recognizing scientific notation if decimal is set (:issue:`31920`)


this should say engine='python' right?

jreback · 2021-01-01T22:41:18Z

great ping on green (suggestion to combine the python/c tests as a followup)

phofl · 2021-01-02T00:15:18Z

@jreback green

jreback · 2021-01-03T16:36:34Z

thanks @phofl if you can do a PR or issue for the followup would be great

phofl · 2021-01-03T18:19:51Z

Will open an issue. Hopefully I will be able to get back to this in the coming days

phofl · 2021-01-03T18:47:35Z

Opened #38926

…et (pandas-dev#38420)

phofl added 3 commits December 12, 2020 01:44

BUG: read_csv not recognizing numbers appropriately when decimal is set

f49d007

Add corner case

cc7dd1b

Add another corner case

189ca80

phofl added 2 commits December 12, 2020 01:54

Run black

567423f

Fix pep8 issue

ba163cf

phofl added the IO CSV read_csv, to_csv label Dec 12, 2020

phofl and others added 3 commits December 12, 2020 02:08

Fix pep8

a5e568b

Merge branch 'master' into 31920

8dc8cb7

Merge branch 'master' into 31920

b98954b

jbrockmendel reviewed Dec 17, 2020

View reviewed changes

cmulders reviewed Dec 18, 2020

View reviewed changes

phofl added 3 commits December 19, 2020 19:40

Refactor code

514f45a

Merge branch 'master' of https://github.com/pandas-dev/pandas into 31920

c20767b

� Conflicts: � doc/source/whatsnew/v1.3.0.rst

Merge branch '31920' of https://github.com/phofl/pandas into 31920

d8d94af

� Conflicts: � doc/source/whatsnew/v1.3.0.rst

gfyoung reviewed Dec 29, 2020

View reviewed changes

pandas/tests/io/parser/test_python_parser_only.py Outdated Show resolved Hide resolved

gfyoung reviewed Dec 29, 2020

View reviewed changes

phofl and others added 4 commits December 29, 2020 21:58

Update pandas/tests/io/parser/test_python_parser_only.py

a0eced5

Co-authored-by: gfyoung <[email protected]>

Merge branch 'master' of https://github.com/pandas-dev/pandas into 31920

efd78d7

� Conflicts: � doc/source/whatsnew/v1.3.0.rst

Merge branch '31920' of https://github.com/phofl/pandas into 31920

e9d08c4

Fix black bug from autocommit

a611dad

jreback added this to the 1.3 milestone Dec 29, 2020

jreback requested changes Dec 29, 2020

View reviewed changes

phofl added 2 commits January 1, 2021 23:36

Modify whatsnew

9ec9954

Merge branch 'master' of https://github.com/pandas-dev/pandas into 31920

56e7702

jreback approved these changes Jan 1, 2021

View reviewed changes

jreback merged commit b337b61 into pandas-dev:master Jan 3, 2021

phofl deleted the 31920 branch January 3, 2021 18:19

phofl mentioned this pull request Jan 3, 2021

CLN: Unify number recognition tests in read_csv for all parsers #38926

Closed

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021

BUG: read_csv not recognizing numbers appropriately when decimal is s…

12ee1db

…et (pandas-dev#38420)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_csv not recognizing numbers appropriately when decimal is set #38420

BUG: read_csv not recognizing numbers appropriately when decimal is set #38420

phofl commented Dec 12, 2020 •

edited

Loading

pep8speaks commented Dec 12, 2020 •

edited

Loading

jbrockmendel Dec 17, 2020

phofl Dec 19, 2020

jbrockmendel Dec 17, 2020

phofl Dec 19, 2020

jbrockmendel Dec 17, 2020

phofl Dec 19, 2020

cmulders Dec 17, 2020

phofl Dec 19, 2020 •

edited

Loading

gfyoung Dec 29, 2020

phofl Dec 29, 2020

gfyoung Dec 29, 2020

jreback Jan 1, 2021

jreback Dec 29, 2020

phofl Jan 1, 2021

phofl Jan 1, 2021

jreback commented Jan 1, 2021

phofl commented Jan 2, 2021

jreback commented Jan 3, 2021

phofl commented Jan 3, 2021

phofl commented Jan 3, 2021

	data, "\t", decimal=",", engine="python", thousands=thousands
	data, "\t", decimal=",", thousands=thousands

BUG: read_csv not recognizing numbers appropriately when decimal is set #38420

BUG: read_csv not recognizing numbers appropriately when decimal is set #38420

Conversation

phofl commented Dec 12, 2020 • edited Loading

pep8speaks commented Dec 12, 2020 • edited Loading

Comment last updated at 2021-01-01 22:36:56 UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl Dec 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 1, 2021

phofl commented Jan 2, 2021

jreback commented Jan 3, 2021

phofl commented Jan 3, 2021

phofl commented Jan 3, 2021

phofl commented Dec 12, 2020 •

edited

Loading

pep8speaks commented Dec 12, 2020 •

edited

Loading

phofl Dec 19, 2020 •

edited

Loading