Skip to content

BUG: Read CSV on python engine fails when skiprows and chunk size are specified (#55677, #56323) #56250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Dec 5, 2023
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -420,6 +420,7 @@ MultiIndex

I/O
^^^
- Bug in :func:`read_csv` where ``engine="python"`` was causing a ``TypeError`` when a callable skiprows and a chunk size was specified. (:issue:`55677`)
- Bug in :func:`read_csv` where ``on_bad_lines="warn"`` would write to ``stderr`` instead of raise a Python warning. This now yields a :class:`.errors.ParserWarning` (:issue:`54296`)
- Bug in :func:`read_csv` with ``engine="pyarrow"`` where ``usecols`` wasn't working with a csv with no headers (:issue:`54459`)
- Bug in :func:`read_excel`, with ``engine="xlrd"`` (``xls`` files) erroring when file contains NaNs/Infs (:issue:`54564`)
Expand Down
45 changes: 30 additions & 15 deletions pandas/io/parsers/python_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -1117,18 +1117,33 @@ def _get_lines(self, rows: int | None = None) -> list[list[Scalar]]:
new_rows = []
try:
if rows is not None:
rows_to_skip = 0
if self.skiprows is not None and self.pos is not None:
# Only read additional rows if pos is in skiprows
rows_to_skip = len(
set(self.skiprows) - set(range(self.pos))
)

for _ in range(rows + rows_to_skip):
# assert for mypy, data is Iterator[str] or None, would
# error in next
assert self.data is not None
new_rows.append(next(self.data))
if callable(self.skiprows):
row_index = 0
row_ct = 0
offset = self.pos if self.pos is not None else 0
while row_ct < rows:
# assert for mypy, data is Iterator[str] or None, would
# error in next
assert self.data is not None
new_row = next(self.data)
if not self.skipfunc(offset + row_index):
row_ct += 1
row_index += 1
new_rows.append(new_row)
else:
# Maintain legacy chunking behavior
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't planning on removing this branch - it just serves the non-callable case right? If so I think this comment is a bit confusing so can just be removed

Copy link
Contributor Author

@Flytre Flytre Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some weird behavior in this branch, and I'd argue to remove it.

Consider this csv file:

col_a
0
1
2
3
4
5
6
7
8
9

If we read this file in via read_csv:

text_file_reader = pd.read_csv("dummy.csv",
                               engine='$ENGINE',
                               skiprows=[1, 2, 3, 7, 10],
                               chunksize=2)

With the python engine we get the following result:

   col_a
0      3
1      4
   col_a
2      5
3      7
4      8

With c engine we get a different result:

   col_a
0      3
1      4
   col_a
2      5
3      7
   col_a
4      8

With the python engine and skiprows=lambda x: x in [1, 2, 3, 7, 10] (this PR adds this behavior, currently with these parameters pandas raises an exception):

   col_a
0      3
1      4
   col_a
2      5
3      7
   col_a
4      8

If we remove the entire else branch and use the logic this PR adds for the 'callable' case, the python engine will match the c engine result in both cases, whether given a list or callable skiprows.

Thoughts? If you'd like to go ahead with the proposed update, should I create a separate issue for it and also link it to this PR?

Copy link
Member

@WillAyd WillAyd Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I would go ahead and do your proposed change now. If it makes the behavior exactly match the c engine might as well do the fix all at once

rows_to_skip = 0
if self.skiprows is not None and self.pos is not None:
# Only read additional rows if pos is in skiprows
rows_to_skip = len(
set(self.skiprows) - set(range(self.pos))
)

for _ in range(rows + rows_to_skip):
# assert for mypy, data is Iterator[str] or None, would
# error in next
assert self.data is not None
new_rows.append(next(self.data))

len_new_rows = len(new_rows)
new_rows = self._remove_skipped_rows(new_rows)
Expand All @@ -1137,11 +1152,11 @@ def _get_lines(self, rows: int | None = None) -> list[list[Scalar]]:
rows = 0

while True:
new_row = self._next_iter_line(row_num=self.pos + rows + 1)
next_row = self._next_iter_line(row_num=self.pos + rows + 1)
rows += 1

if new_row is not None:
new_rows.append(new_row)
if next_row is not None:
new_rows.append(next_row)
len_new_rows = len(new_rows)

except StopIteration:
Expand Down
28 changes: 28 additions & 0 deletions pandas/tests/io/parser/test_skiprows.py
Original file line number Diff line number Diff line change
Expand Up @@ -301,3 +301,31 @@ def test_skip_rows_and_n_rows(all_parsers):
result = parser.read_csv(StringIO(data), nrows=5, skiprows=[2, 4, 6])
expected = DataFrame({"a": [1, 3, 5, 7, 8], "b": ["a", "c", "e", "g", "h"]})
tm.assert_frame_equal(result, expected)


@xfail_pyarrow
def test_skip_rows_with_chunks(all_parsers):
# GH 55677
data = """col_a
10
20
30
40
50
60
70
80
90
100
"""
parser = all_parsers
reader = parser.read_csv(
StringIO(data), engine=parser, skiprows=lambda x: x in [1, 4, 5], chunksize=4
)
df1 = next(reader)
df2 = next(reader)

tm.assert_frame_equal(
df1, DataFrame({"col_a": [20, 30, 60, 70]}, index=[0, 1, 2, 3])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but you don't need to specify index here; will help condense this to one line

)
tm.assert_frame_equal(df2, DataFrame({"col_a": [80, 90, 100]}, index=[4, 5, 6]))