Skip to content

BUG: Nrows cannot be zero for read_csv. Fixes #21141 #21431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

cgopalan
Copy link

@cgopalan cgopalan commented Jun 11, 2018

@gfyoung
Copy link
Member

gfyoung commented Jun 11, 2018

@jreback : Do we need to deprecate nrow=0 before disallowing? It's a (small, yet annoying) corner case, which is why I ask about it.

@cgopalan : This will be important, since it determines the type of whatsnew entry you're going to add.

@gfyoung gfyoung added this to the 0.24.0 milestone Jun 11, 2018
@gfyoung gfyoung added the IO CSV read_csv, to_csv label Jun 11, 2018
@@ -995,11 +995,18 @@ def test_read_excel_nrows_greater_than_nrows_in_file(self, ext):

def test_read_excel_nrows_non_integer_parameter(self, ext):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to parametrize this test for cases like [-1, 0, '1'] instead of having separate tests

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd sure, I will make these changes as soon as we decide if we are keeping this PR :)

@@ -375,6 +375,9 @@ def test_read_nrows(self):
with tm.assert_raises_regex(ValueError, msg):
self.read_csv(StringIO(self.data1), nrows=-1)

with tm.assert_raises_regex(ValueError, msg):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reference the issue number in a comment above this line.

@gfyoung gfyoung removed this from the 0.24.0 milestone Jun 11, 2018
@jschendel
Copy link
Member

I'm not sure about disallowing nrows=0. It's generally useful when you need to dynamically read a subset of columns, where you do a two pass procedure along the lines of:

  1. Read in the columns only, e.g. all_cols = pd.read_csv('file.csv', nrows=0).columns
  2. Dynamically filter the columns, e.g. good_cols = [c for c in all_cols if c.startswith('foo')]
  3. Read in only the desired columns, e.g. df = pd.read_csv('file.csv', usecols=good_cols)

I doubt you lose much performance by using nrows=1 instead, so it's not the end of the world if this does get removed. However, there are some established StackOverflow answers with nrows=0 similar to my procedure above, so I imagine there are some users that have nrows=0 in their codebase. If this is going to be removed, it definitely seems like it should be deprecated first.

@gfyoung
Copy link
Member

gfyoung commented Jun 12, 2018

@jschendel : Good to know. In the case of your example, couldn't you just read the first line of the file using the builtin open ? Don't really need pandas for that. 🤷‍♂️

@jschendel
Copy link
Member

Yeah, I'm not saying that it's the right solution, just that it's a solution people are likely using. Certainly more concise than using open if you method chain to do the filtering. I wouldn't complain if this was removed, but doing so without deprecating seems likely to break things for users.

@gfyoung
Copy link
Member

gfyoung commented Jun 12, 2018

I wouldn't complain if this was removed, but doing so without deprecating seems likely to break things for users.

@jschendel : Nope, that's totally fair. Let's deprecate then.

@cgopalan : Can you update your PR to deprecate nrows=0 ?

@gfyoung gfyoung added the Deprecate Functionality to remove in pandas label Jun 12, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nrows=0 is valid
the combinations of options is invalid

pr needs to limit scope

@gfyoung
Copy link
Member

gfyoung commented Jun 12, 2018

nrows=0 is valid
the combinations of options is invalid

@jreback : nrows=0 has limited use and was a clean way to patch the corner case (though it seems like @jschendel has a case to keep it).

In any case, should we just disallow that combination of inputs then? If nrows=0 is considered valid, I'm a little uncertain as to why we would consider the combination invalid (it shouldn't depend on what the value of low_memory is IMO).

@codecov
Copy link

codecov bot commented Jun 12, 2018

Codecov Report

Merging #21431 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21431   +/-   ##
=======================================
  Coverage   91.89%   91.89%           
=======================================
  Files         153      153           
  Lines       49596    49596           
=======================================
  Hits        45576    45576           
  Misses       4020     4020
Flag Coverage Δ
#multiple 90.29% <100%> (ø) ⬆️
#single 41.86% <100%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.46% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4807905...947168b. Read the comment docs.

@cgopalan
Copy link
Author

@jreback @gfyoung i too agree that just the value of low_memory should not render the combination of values invalid. Because of this corner case, this is actually exhibiting weird behavior on an unrelated set of conditions. Better to suppress this corner case by removing/deprecating nrows=0.
Is there a way to deprecate just a value for an arg? I would assume we want to keep nrows as an argument.

@jreback
Copy link
Contributor

jreback commented Jun 12, 2018

@jreback @gfyoung i too agree that just the value of low_memory should not render the combination of values invalid. Because of this corner case, this is actually exhibiting weird behavior on an unrelated set of conditions. Better to suppress this corner case by removing/deprecating nrows=0.
Is there a way to deprecate just a value for an arg? I would assume we want to keep nrows as an argument.

no reason to remove or even deprecate nrows=0, you simply need to addees for this particular combination of options

@cgopalan
Copy link
Author

cgopalan commented Jun 12, 2018

@jreback so read_csv(StringIO(data), low_memory=False, index_col=0, nrows=0) is allowed but read_csv(StringIO(data), low_memory=True, index_col=0, nrows=0) is not?

@gfyoung
Copy link
Member

gfyoung commented Jun 12, 2018

no reason to remove or even deprecate nrows=0, you simply need to addees for this particular combination of options

@jreback : What you're saying sounds like we should revisit #21176?

@jreback
Copy link
Contributor

jreback commented Jun 12, 2018

yeah i thought that was fine

@gfyoung
Copy link
Member

gfyoung commented Jun 12, 2018

@jreback : 🤦‍♂️ well, I guess we're heading back there then. 😄

@gfyoung gfyoung closed this Jun 12, 2018
@cgopalan
Copy link
Author

@gfyoung should i delete this branch?

@gfyoung
Copy link
Member

gfyoung commented Jun 13, 2018

@cgopalan : Sure thing.

@cgopalan cgopalan deleted the read_csv_nrows_arg branch June 13, 2018 00:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Deprecate Functionality to remove in pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_csv errors when low_memory=True, index_col is not None, and nrows=0
5 participants