-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: read_csv with specified kwargs #21176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @r00ta! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on June 19, 2018 at 03:30 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #21176 +/- ##
=======================================
Coverage 91.92% 91.92%
=======================================
Files 153 153
Lines 49594 49594
=======================================
Hits 45590 45590
Misses 4004 4004
Continue to review full report at Codecov.
|
pandas/tests/io/parser/common.py
Outdated
@@ -238,6 +238,17 @@ def test_csv_mixed_type(self): | |||
out = self.read_csv(StringIO(data)) | |||
tm.assert_frame_equal(out, expected) | |||
|
|||
def test_csv_index_col_and_nrows(self): | |||
data = """A,B,C |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment with the GitHub issue number here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think it needs a more descriptive test name - how about test_read_csv_low_memory_no_index_cols_rows
?
pandas/tests/io/parser/common.py
Outdated
""" | ||
out = self.read_csv(StringIO(data), low_memory=True, index_col=0, | ||
nrows=0) | ||
tm.assert_index_equal(out.columns, pd.Index(['A', 'B', 'C'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you instead specify an expected DataFrame
and use tm.assert_frame_equal
?
result = self.read_csv(...)
expected = pd.DataFrame(...)
tm.assert_frame_equal(result, expected)
pandas/io/parsers.py
Outdated
@@ -3208,8 +3208,7 @@ def _get_empty_meta(columns, index_col, index_names, dtype=None): | |||
for k, v in compat.iteritems(_dtype): | |||
col = columns[k] if is_integer(k) else k | |||
dtype[col] = v | |||
|
|||
if index_col is None or index_col is False: | |||
if index_col is None or index_col is False or index_names is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to simplify this to just be if not (index_col or index_names)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. index_col
can be a list ( like in the issue #21141 ) and bool([0] or None)
is True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment here on what is going on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I can help, the reason the error happens is because index_names
is None
and it is being iterated over in https://github.com/r00ta/pandas/blob/master/pandas/io/parsers.py#L3215
To prevent this, there's an additional check for it being not None.
@r00ta you could add a comment here saying # also create empty index if index_names is None
@jreback is that clear enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean in the code itself, e.g. why we are checking these things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of a good use case for nrows
to be 0 so alternately could just raise if that is the case instead of all of the changes here (unless someone else does have a use case for that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd The parser already checks that the parameter nrows
is not a negative integer ( https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L382 ). It raises ValueError: 'nrows' must be an integer >=0
. For this reason i thought that by contract the case nrows=0
is allowable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cgopalan do you want to submit that as a separate PR? I think that would be preferable to this.
Just glancing at your code / tests make sure you cover the case that nrows
equals 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd sure I can do that. Yes, I will definitely add a test for nrows
equal 0.
pandas/io/parsers.py
Outdated
@@ -3208,8 +3208,7 @@ def _get_empty_meta(columns, index_col, index_names, dtype=None): | |||
for k, v in compat.iteritems(_dtype): | |||
col = columns[k] if is_integer(k) else k | |||
dtype[col] = v | |||
|
|||
if index_col is None or index_col is False: | |||
if index_col is None or index_col is False or index_names is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment here on what is going on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments from @mroeschke and need a whatsnew entry
pandas/io/parsers.py
Outdated
@@ -3208,8 +3208,7 @@ def _get_empty_meta(columns, index_col, index_names, dtype=None): | |||
for k, v in compat.iteritems(_dtype): | |||
col = columns[k] if is_integer(k) else k | |||
dtype[col] = v | |||
|
|||
if index_col is None or index_col is False: | |||
if index_col is None or index_col is False or index_names is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean in the code itself, e.g. why we are checking these things
Superseded by #21431. I think given the discussions above, I don't think there is much objection to that... (if there is, please comment, and we can reopen no problem) |
Per #21431 (comment), we're back! Let's see if we can clean this up then. |
Marking this for |
@gfyoung the main difference between the |
The patch should be in |
@gfyoung |
@cgopalan : The handling I think should be independent of |
@gfyoung so the current changes in the PR are fine? |
@cgopalan : Sorry for not responding sooner! Yes, these changes are fine, and we will try to merge this. @jschendel @mroeschke @jreback : All comments that you guys made have been addressed. Waiting for CI to confirm whether it likes what I did. 🙏 |
@jreback : All is green. |
@jreback : Friendly ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments.
@@ -238,6 +238,21 @@ def test_csv_mixed_type(self): | |||
out = self.read_csv(StringIO(data)) | |||
tm.assert_frame_equal(out, expected) | |||
|
|||
def test_read_csv_low_memory_no_rows_with_index(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a class that you can put the will specifically only have low_memory set, just search for self.low_memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, there is, but I deliberately allowed this test to run for both the C and Python engines. As the Python engine doesn't support low_memory
, I need to do this.
doc/source/whatsnew/v0.23.2.txt
Outdated
@@ -59,6 +59,7 @@ Bug Fixes | |||
|
|||
**I/O** | |||
|
|||
- Bug in :func:`read_csv` when ``nrows=0``, ``low_memory=True``, and ``index_col`` was not ``None`` (:issue:`21141`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a bit about that has changed from a user POV
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing. Done.
* nrows = 0 * low_memory=True * index_col != None Closes pandas-devgh-21141
git diff upstream/master -u -- "*.py" | flake8 --diff
Solves the issue 21141.