BUG: read_csv with specified kwargs #21176

r00ta · 2018-05-22T20:36:04Z

[+] closes read_csv errors when low_memory=True, index_col is not None, and nrows=0 #21141
[+] tests added / passed
[+] passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Solves the issue 21141.

pep8speaks · 2018-05-22T20:36:07Z

Hello @r00ta! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 19, 2018 at 03:30 Hours UTC

codecov · 2018-05-22T21:38:01Z

Codecov Report

Merging #21176 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #21176   +/-   ##
=======================================
  Coverage   91.92%   91.92%           
=======================================
  Files         153      153           
  Lines       49594    49594           
=======================================
  Hits        45590    45590           
  Misses       4004     4004

Flag	Coverage Δ
#multiple	`90.32% <100%> (ø)`	⬆️
#single	`41.82% <0%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/io/parsers.py	`95.46% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6131a59...eedd45e. Read the comment docs.

jschendel · 2018-05-23T04:57:56Z

pandas/tests/io/parser/common.py

@@ -238,6 +238,17 @@ def test_csv_mixed_type(self):
        out = self.read_csv(StringIO(data))
        tm.assert_frame_equal(out, expected)

+    def test_csv_index_col_and_nrows(self):
+        data = """A,B,C


Can you add a comment with the GitHub issue number here?

Also, I think it needs a more descriptive test name - how about test_read_csv_low_memory_no_index_cols_rows?

mroeschke · 2018-05-23T05:09:51Z

pandas/tests/io/parser/common.py

+"""
+        out = self.read_csv(StringIO(data), low_memory=True, index_col=0,
+                            nrows=0)
+        tm.assert_index_equal(out.columns, pd.Index(['A', 'B', 'C']))


Could you instead specify an expected DataFrame and use tm.assert_frame_equal?

result = self.read_csv(...) expected = pd.DataFrame(...) tm.assert_frame_equal(result, expected)

WillAyd · 2018-05-24T08:00:06Z

pandas/io/parsers.py

@@ -3208,8 +3208,7 @@ def _get_empty_meta(columns, index_col, index_names, dtype=None):
        for k, v in compat.iteritems(_dtype):
            col = columns[k] if is_integer(k) else k
            dtype[col] = v
-
-    if index_col is None or index_col is False:
+    if index_col is None or index_col is False or index_names is None:


Is it possible to simplify this to just be if not (index_col or index_names)?

Nope. index_colcan be a list ( like in the issue #21141 ) and bool([0] or None) is True

can you add a comment here on what is going on

If I can help, the reason the error happens is because index_names is None and it is being iterated over in https://github.com/r00ta/pandas/blob/master/pandas/io/parsers.py#L3215
To prevent this, there's an additional check for it being not None.
@r00ta you could add a comment here saying # also create empty index if index_names is None
@jreback is that clear enough?

I mean in the code itself, e.g. why we are checking these things

I can't think of a good use case for nrows to be 0 so alternately could just raise if that is the case instead of all of the changes here (unless someone else does have a use case for that)

@WillAyd The parser already checks that the parameter nrowsis not a negative integer ( https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L382 ). It raises ValueError: 'nrows' must be an integer >=0. For this reason i thought that by contract the case nrows=0 is allowable

@WillAyd @r00ta The solution where we validate nrows to be >= 1 looks good to me. This will take care of this weird corner case.

The changes for that can be made here to pass in a min_val argument of 1.
Also the tests need to be changed here and here

@cgopalan do you want to submit that as a separate PR? I think that would be preferable to this.

Just glancing at your code / tests make sure you cover the case that nrows equals 0

@WillAyd sure I can do that. Yes, I will definitely add a test for nrows equal 0.

jreback · 2018-05-29T10:45:46Z

pandas/io/parsers.py

@@ -3208,8 +3208,7 @@ def _get_empty_meta(columns, index_col, index_names, dtype=None):
        for k, v in compat.iteritems(_dtype):
            col = columns[k] if is_integer(k) else k
            dtype[col] = v
-
-    if index_col is None or index_col is False:
+    if index_col is None or index_col is False or index_names is None:


can you add a comment here on what is going on

jreback

some comments from @mroeschke and need a whatsnew entry

jreback · 2018-06-01T00:15:49Z

pandas/io/parsers.py

@@ -3208,8 +3208,7 @@ def _get_empty_meta(columns, index_col, index_names, dtype=None):
        for k, v in compat.iteritems(_dtype):
            col = columns[k] if is_integer(k) else k
            dtype[col] = v
-
-    if index_col is None or index_col is False:
+    if index_col is None or index_col is False or index_names is None:


I mean in the code itself, e.g. why we are checking these things

gfyoung · 2018-06-11T22:44:56Z

Superseded by #21431. I think given the discussions above, I don't think there is much objection to that...

(if there is, please comment, and we can reopen no problem)

gfyoung · 2018-06-12T23:03:07Z

Per #21431 (comment), we're back!

Let's see if we can clean this up then.

gfyoung · 2018-06-13T03:58:11Z

Marking this for 0.23.2 because if this still isn't merged by then, I can help bring it to the finish line.

cgopalan · 2018-06-13T13:55:13Z

@gfyoung the main difference between the low_memory=True and low_memory=False case seems to be that in the True case, the read method here raises an exception, so it goes into _get_empty_meta. Would it be good to check for the combination of args before that read method is called? Or would it be somewhere much earlier?

gfyoung · 2018-06-13T17:41:13Z

The patch should be in _get_empty_meta.

cgopalan · 2018-06-13T17:48:42Z

@gfyoung _get_empty_meta does not have any knowledge of n_rows. How exactly do you propose to patch it up there?

cgopalan · 2018-06-15T19:00:26Z

@gfyoung @r00ta one possible solution is to add nrows parameter to _get_empty_meta and then check nrows=0 instead of checking index_names in the changed if clause. Would that be an acceptable solution?

gfyoung · 2018-06-15T20:04:19Z

@cgopalan : The handling I think should be independent of nrows as it is in the current PR.

cgopalan · 2018-06-15T20:16:26Z

@gfyoung so the current changes in the PR are fine?

gfyoung · 2018-06-16T08:25:46Z

@cgopalan : Sorry for not responding sooner! Yes, these changes are fine, and we will try to merge this.

@jschendel @mroeschke @jreback : All comments that you guys made have been addressed. Waiting for CI to confirm whether it likes what I did. 🙏

gfyoung · 2018-06-16T19:34:08Z

@jreback : All is green.

gfyoung · 2018-06-18T23:53:06Z

@jreback : Friendly ping.

jreback

minor comments.

jreback · 2018-06-19T01:27:34Z

pandas/tests/io/parser/common.py

@@ -238,6 +238,21 @@ def test_csv_mixed_type(self):
        out = self.read_csv(StringIO(data))
        tm.assert_frame_equal(out, expected)

+    def test_read_csv_low_memory_no_rows_with_index(self):


I think there is a class that you can put the will specifically only have low_memory set, just search for self.low_memory

Indeed, there is, but I deliberately allowed this test to run for both the C and Python engines. As the Python engine doesn't support low_memory, I need to do this.

jreback · 2018-06-19T01:27:57Z

doc/source/whatsnew/v0.23.2.txt

@@ -59,6 +59,7 @@ Bug Fixes

 **I/O**

+- Bug in :func:`read_csv` when ``nrows=0``, ``low_memory=True``, and ``index_col`` was not ``None`` (:issue:`21141`)


can you add a bit about that has changed from a user POV

Sure thing. Done.

* nrows = 0 * low_memory=True * index_col != None Closes pandas-devgh-21141

jreback · 2018-06-19T11:26:54Z

thanks @r00ta @gfyoung

Closes gh-21141 (cherry picked from commit c2da06c)

Closes pandas-devgh-21141

jschendel added Bug IO CSV read_csv, to_csv labels May 23, 2018

jschendel reviewed May 23, 2018

View reviewed changes

mroeschke reviewed May 23, 2018

View reviewed changes

jreback changed the title ~~Issue 21141~~ BUG: read_csv with specified kwargs May 23, 2018

WillAyd reviewed May 24, 2018

View reviewed changes

jreback requested changes May 29, 2018

View reviewed changes

jreback requested changes Jun 1, 2018

View reviewed changes

jorisvandenbossche added this to the 0.24.0 milestone Jun 6, 2018

gfyoung mentioned this pull request Jun 11, 2018

read_csv errors when low_memory=True, index_col is not None, and nrows=0 #21141

Closed

gfyoung closed this Jun 11, 2018

gfyoung removed this from the 0.24.0 milestone Jun 11, 2018

gfyoung mentioned this pull request Jun 12, 2018

BUG: Nrows cannot be zero for read_csv. Fixes #21141 #21431

Closed

4 tasks

gfyoung reopened this Jun 12, 2018

gfyoung added this to the 0.23.2 milestone Jun 13, 2018

gfyoung force-pushed the issue_21141 branch from 1d680be to a99a122 Compare June 16, 2018 08:24

gfyoung self-assigned this Jun 16, 2018

gfyoung removed their assignment Jun 16, 2018

jreback requested changes Jun 19, 2018

View reviewed changes

jreback added the Needs Backport label Jun 19, 2018

BUG: Handle read_csv corner case

eedd45e

* nrows = 0 * low_memory=True * index_col != None Closes pandas-devgh-21141

gfyoung force-pushed the issue_21141 branch from a99a122 to eedd45e Compare June 19, 2018 03:30

jreback approved these changes Jun 19, 2018

View reviewed changes

jreback merged commit c2da06c into pandas-dev:master Jun 19, 2018

jorisvandenbossche removed the Needs Backport label Jun 29, 2018

jorisvandenbossche pushed a commit that referenced this pull request Jun 29, 2018

BUG: Handle read_csv corner case (#21176)

930617c

Closes gh-21141 (cherry picked from commit c2da06c)

jorisvandenbossche pushed a commit that referenced this pull request Jul 2, 2018

BUG: Handle read_csv corner case (#21176)

030a058

Closes gh-21141 (cherry picked from commit c2da06c)

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

BUG: Handle read_csv corner case (pandas-dev#21176)

20a430a

Closes pandas-devgh-21141

		@@ -59,6 +59,7 @@ Bug Fixes

		I/O

		- Bug in :func:`read_csv` when ``nrows=0``, ``low_memory=True``, and ``index_col`` was not ``None`` (:issue:`21141`)

BUG: read_csv with specified kwargs #21176

BUG: read_csv with specified kwargs #21176

Conversation

r00ta commented May 22, 2018

pep8speaks commented May 22, 2018 • edited Loading

Comment last updated on June 19, 2018 at 03:30 Hours UTC

codecov bot commented May 22, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke May 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

r00ta May 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgopalan Jun 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung commented Jun 11, 2018

gfyoung commented Jun 12, 2018 • edited Loading

gfyoung commented Jun 13, 2018 • edited Loading

cgopalan commented Jun 13, 2018

gfyoung commented Jun 13, 2018

cgopalan commented Jun 13, 2018 • edited Loading

cgopalan commented Jun 15, 2018

gfyoung commented Jun 15, 2018

cgopalan commented Jun 15, 2018

gfyoung commented Jun 16, 2018

gfyoung commented Jun 16, 2018

gfyoung commented Jun 18, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jun 19, 2018

pep8speaks commented May 22, 2018 •

edited

Loading

codecov bot commented May 22, 2018 •

edited

Loading

mroeschke May 23, 2018 •

edited

Loading

r00ta May 24, 2018 •

edited

Loading

cgopalan Jun 8, 2018 •

edited

Loading

gfyoung commented Jun 12, 2018 •

edited

Loading

gfyoung commented Jun 13, 2018 •

edited

Loading

cgopalan commented Jun 13, 2018 •

edited

Loading