Skip to content

BUG: read_csv not recognizing numbers appropriately when decimal is set #38420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jan 3, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -259,6 +259,7 @@ I/O
^^^

- Bug in :meth:`Index.__repr__` when ``display.max_seq_items=1`` (:issue:`38415`)
- Bug in :func:`read_csv` not recognizing scientific notation if decimal is set for ``engine="python"`` (:issue:`31920`)
- Bug in :func:`read_csv` interpreting ``NA`` value as comment, when ``NA`` does contain the comment string fixed for ``engine="python"`` (:issue:`34002`)
- Bug in :func:`read_csv` raising ``IndexError`` with multiple header columns and ``index_col`` specified when file has no data rows (:issue:`38292`)
- Bug in :func:`read_csv` not accepting ``usecols`` with different length than ``names`` for ``engine="python"`` (:issue:`16469`)
Expand Down
12 changes: 9 additions & 3 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -2344,10 +2344,16 @@ def __init__(self, f: Union[FilePathOrBuffer, List], **kwds):
if len(self.decimal) != 1:
raise ValueError("Only length-1 decimal markers supported")

decimal = re.escape(self.decimal)
if self.thousands is None:
self.nonnum = re.compile(fr"[^-^0-9^{self.decimal}]+")
regex = fr"^\-?[0-9]*({decimal}[0-9]*)?([0-9](E|e)\-?[0-9]*)?$"
else:
self.nonnum = re.compile(fr"[^-^0-9^{self.thousands}^{self.decimal}]+")
thousands = re.escape(self.thousands)
regex = (
fr"^\-?([0-9]+{thousands}|[0-9])*({decimal}[0-9]*)?"
fr"([0-9](E|e)\-?[0-9]*)?$"
)
self.num = re.compile(regex)

def _set_no_thousands_columns(self):
# Create a set of column ids that are not to be stripped of thousands
Expand Down Expand Up @@ -3039,7 +3045,7 @@ def _search_replace_num_columns(self, lines, search, replace):
not isinstance(x, str)
or search not in x
or (self._no_thousands_columns and i in self._no_thousands_columns)
or self.nonnum.search(x.strip())
or not self.num.search(x.strip())
):
rl.append(x)
else:
Expand Down
46 changes: 46 additions & 0 deletions pandas/tests/io/parser/test_python_parser_only.py
Original file line number Diff line number Diff line change
Expand Up @@ -305,3 +305,49 @@ def test_malformed_skipfooter(python_parser_only):
msg = "Expected 3 fields in line 4, saw 5"
with pytest.raises(ParserError, match=msg):
parser.read_csv(StringIO(data), header=1, comment="#", skipfooter=1)


@pytest.mark.parametrize("thousands", [None, "."])
@pytest.mark.parametrize(
"value, result_value",
[
("1,2", 1.2),
("1,2e-1", 0.12),
("1,2E-1", 0.12),
("1,2e-10", 0.0000000012),
("1,2e1", 12.0),
("1,2E1", 12.0),
("-1,2e-1", -0.12),
("0,2", 0.2),
(",2", 0.2),
],
)
def test_decimal_and_exponential(python_parser_only, thousands, value, result_value):
# GH#31920
data = StringIO(
f"""a b
1,1 {value}
"""
)
result = python_parser_only.read_csv(
data, "\t", decimal=",", engine="python", thousands=thousands
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data, "\t", decimal=",", engine="python", thousands=thousands
data, "\t", decimal=",", thousands=thousands

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry wrong commit button above. C works perfectly here. Already have tests therefore. Would probably makes sense unifying them as a follow up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO would do it in this PR, but follow-up also works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine as a followup

)
expected = DataFrame({"a": [1.1], "b": [result_value]})
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize("thousands", [None, "."])
@pytest.mark.parametrize(
"value",
["e11,2", "1e11,2", "1,2,2", "1,2.1", "1,2e-10e1", "--1,2", "1a.2,1", "1..2,3"],
)
def test_decimal_and_exponential_erroneous(python_parser_only, thousands, value):
# GH#31920
data = StringIO(
f"""a b
1,1 {value}
"""
)
result = python_parser_only.read_csv(data, "\t", decimal=",", thousands=thousands)
expected = DataFrame({"a": [1.1], "b": [value]})
tm.assert_frame_equal(result, expected)