Skip to content

BUG: read_csv with specified kwargs #21176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 19, 2018
Merged

Conversation

r00ta
Copy link
Contributor

@r00ta r00ta commented May 22, 2018

Solves the issue 21141.

@pep8speaks
Copy link

pep8speaks commented May 22, 2018

Hello @r00ta! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 19, 2018 at 03:30 Hours UTC

@codecov
Copy link

codecov bot commented May 22, 2018

Codecov Report

Merging #21176 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21176   +/-   ##
=======================================
  Coverage   91.92%   91.92%           
=======================================
  Files         153      153           
  Lines       49594    49594           
=======================================
  Hits        45590    45590           
  Misses       4004     4004
Flag Coverage Δ
#multiple 90.32% <100%> (ø) ⬆️
#single 41.82% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.46% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6131a59...eedd45e. Read the comment docs.

@jschendel jschendel added Bug IO CSV read_csv, to_csv labels May 23, 2018
@@ -238,6 +238,17 @@ def test_csv_mixed_type(self):
out = self.read_csv(StringIO(data))
tm.assert_frame_equal(out, expected)

def test_csv_index_col_and_nrows(self):
data = """A,B,C
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment with the GitHub issue number here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think it needs a more descriptive test name - how about test_read_csv_low_memory_no_index_cols_rows?

"""
out = self.read_csv(StringIO(data), low_memory=True, index_col=0,
nrows=0)
tm.assert_index_equal(out.columns, pd.Index(['A', 'B', 'C']))
Copy link
Member

@mroeschke mroeschke May 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you instead specify an expected DataFrame and use tm.assert_frame_equal?

result = self.read_csv(...)
expected = pd.DataFrame(...)
tm.assert_frame_equal(result, expected)

@jreback jreback changed the title Issue 21141 BUG: read_csv with specified kwargs May 23, 2018
@@ -3208,8 +3208,7 @@ def _get_empty_meta(columns, index_col, index_names, dtype=None):
for k, v in compat.iteritems(_dtype):
col = columns[k] if is_integer(k) else k
dtype[col] = v

if index_col is None or index_col is False:
if index_col is None or index_col is False or index_names is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to simplify this to just be if not (index_col or index_names)?

Copy link
Contributor Author

@r00ta r00ta May 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. index_colcan be a list ( like in the issue #21141 ) and bool([0] or None) is True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here on what is going on

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I can help, the reason the error happens is because index_names is None and it is being iterated over in https://github.com/r00ta/pandas/blob/master/pandas/io/parsers.py#L3215
To prevent this, there's an additional check for it being not None.
@r00ta you could add a comment here saying # also create empty index if index_names is None
@jreback is that clear enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean in the code itself, e.g. why we are checking these things

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a good use case for nrows to be 0 so alternately could just raise if that is the case instead of all of the changes here (unless someone else does have a use case for that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd The parser already checks that the parameter nrowsis not a negative integer ( https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L382 ). It raises ValueError: 'nrows' must be an integer >=0. For this reason i thought that by contract the case nrows=0 is allowable

Copy link

@cgopalan cgopalan Jun 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd @r00ta The solution where we validate nrows to be >= 1 looks good to me. This will take care of this weird corner case.

The changes for that can be made here to pass in a min_val argument of 1.
Also the tests need to be changed here and here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgopalan do you want to submit that as a separate PR? I think that would be preferable to this.

Just glancing at your code / tests make sure you cover the case that nrows equals 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd sure I can do that. Yes, I will definitely add a test for nrows equal 0.

@@ -3208,8 +3208,7 @@ def _get_empty_meta(columns, index_col, index_names, dtype=None):
for k, v in compat.iteritems(_dtype):
col = columns[k] if is_integer(k) else k
dtype[col] = v

if index_col is None or index_col is False:
if index_col is None or index_col is False or index_names is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here on what is going on

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments from @mroeschke and need a whatsnew entry

@@ -3208,8 +3208,7 @@ def _get_empty_meta(columns, index_col, index_names, dtype=None):
for k, v in compat.iteritems(_dtype):
col = columns[k] if is_integer(k) else k
dtype[col] = v

if index_col is None or index_col is False:
if index_col is None or index_col is False or index_names is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean in the code itself, e.g. why we are checking these things

@gfyoung
Copy link
Member

gfyoung commented Jun 11, 2018

Superseded by #21431. I think given the discussions above, I don't think there is much objection to that...

(if there is, please comment, and we can reopen no problem)

@gfyoung gfyoung closed this Jun 11, 2018
@gfyoung gfyoung removed this from the 0.24.0 milestone Jun 11, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 12, 2018

Per #21431 (comment), we're back!

Let's see if we can clean this up then.

@gfyoung gfyoung reopened this Jun 12, 2018
@gfyoung gfyoung added this to the 0.23.2 milestone Jun 13, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 13, 2018

Marking this for 0.23.2 because if this still isn't merged by then, I can help bring it to the finish line.

@cgopalan
Copy link

@gfyoung the main difference between the low_memory=True and low_memory=False case seems to be that in the True case, the read method here raises an exception, so it goes into _get_empty_meta. Would it be good to check for the combination of args before that read method is called? Or would it be somewhere much earlier?

@gfyoung
Copy link
Member

gfyoung commented Jun 13, 2018

The patch should be in _get_empty_meta.

@cgopalan
Copy link

cgopalan commented Jun 13, 2018

@gfyoung _get_empty_meta does not have any knowledge of n_rows. How exactly do you propose to patch it up there?

@cgopalan
Copy link

@gfyoung @r00ta one possible solution is to add nrows parameter to _get_empty_meta and then check nrows=0 instead of checking index_names in the changed if clause. Would that be an acceptable solution?

@gfyoung
Copy link
Member

gfyoung commented Jun 15, 2018

@cgopalan : The handling I think should be independent of nrows as it is in the current PR.

@cgopalan
Copy link

@gfyoung so the current changes in the PR are fine?

@gfyoung
Copy link
Member

gfyoung commented Jun 16, 2018

@cgopalan : Sorry for not responding sooner! Yes, these changes are fine, and we will try to merge this.

@jschendel @mroeschke @jreback : All comments that you guys made have been addressed. Waiting for CI to confirm whether it likes what I did. 🙏

@gfyoung
Copy link
Member

gfyoung commented Jun 16, 2018

@jreback : All is green.

@gfyoung gfyoung self-assigned this Jun 16, 2018
@gfyoung gfyoung removed their assignment Jun 16, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 18, 2018

@jreback : Friendly ping.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments.

@@ -238,6 +238,21 @@ def test_csv_mixed_type(self):
out = self.read_csv(StringIO(data))
tm.assert_frame_equal(out, expected)

def test_read_csv_low_memory_no_rows_with_index(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a class that you can put the will specifically only have low_memory set, just search for self.low_memory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, there is, but I deliberately allowed this test to run for both the C and Python engines. As the Python engine doesn't support low_memory, I need to do this.

@@ -59,6 +59,7 @@ Bug Fixes

**I/O**

- Bug in :func:`read_csv` when ``nrows=0``, ``low_memory=True``, and ``index_col`` was not ``None`` (:issue:`21141`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a bit about that has changed from a user POV

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. Done.

* nrows = 0
* low_memory=True
* index_col != None

Closes pandas-devgh-21141
@jreback jreback merged commit c2da06c into pandas-dev:master Jun 19, 2018
@jreback
Copy link
Contributor

jreback commented Jun 19, 2018

thanks @r00ta @gfyoung

jorisvandenbossche pushed a commit that referenced this pull request Jun 29, 2018
jorisvandenbossche pushed a commit that referenced this pull request Jul 2, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_csv errors when low_memory=True, index_col is not None, and nrows=0
9 participants