Skip to content

BUG: Respect usecols even with empty data #12506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Mar 2, 2016

closes #12493

in which the usecols argument was not being respected for empty data. This is because no filtering was applied when the first (and only) chunk was being read.

@gfyoung
Copy link
Member Author

gfyoung commented Mar 2, 2016

Tests are passing. Should be good to merge.


def check_usecols(stringIO):
df = read_csv(stringIO, names=names, usecols=usecols)
assert_array_equal(df.columns, usecols)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NEVER use numpy.testing imports!!!!!

ALWAYS use pandas testing functions. These are much more comprehensive and fully test all behavior. It is also confusing to a reader.

Fully test vs an expected frame and use assert_frame_equal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got it with the first exclamation point. ;) - fixed.

@jreback jreback added Bug IO CSV read_csv, to_csv labels Mar 2, 2016
@gfyoung gfyoung force-pushed the empty_usecols branch 6 times, most recently from e58ddd0 to f25f3c6 Compare March 3, 2016 22:17
@gfyoung
Copy link
Member Author

gfyoung commented Mar 3, 2016

If some could cancel this build #18023 (it's an old build), that would be great. Thanks!

@gfyoung gfyoung force-pushed the empty_usecols branch 2 times, most recently from 4fb3b0e to 84d7cb3 Compare March 6, 2016 08:15
@gfyoung
Copy link
Member Author

gfyoung commented Mar 6, 2016

@jreback : Tests are passing once more. Should be good to merge AFAICT.

while self.line_pos <= hr:
line = self._next_line()

except StopIteration:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you adding all of this code and making this very complicated?

the idea is to make things simpler. pls see if you can find a general soln.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How complicated is this code? It's essentially boilerplate and logic from the C engine code. I'm not writing anything new here. I don't understand why people would have a hard time understanding this code but yet the C engine code is considered comprehensible.

@jreback
Copy link
Contributor

jreback commented Mar 6, 2016

@gfyoung you had to change a whole bunch of things to get your code to work (e.g. changing long established behavior in excel/html parsing), the StopIteration can now also be a ValueError. That is just crazy. My point is to make minimal changes that get the job done, use general methods where possible, and don't duplicate code.

@gfyoung gfyoung force-pushed the empty_usecols branch 4 times, most recently from 3304d2e to 5addc69 Compare March 8, 2016 06:27
@gfyoung
Copy link
Member Author

gfyoung commented Mar 8, 2016

@jreback : Simplified things a little by changing the error from ValueError to StopIteration. I still kept some of the error messages from the C engine because I find them to be quite useful.

@jreback
Copy link
Contributor

jreback commented Mar 8, 2016

at a glance looks ok. still working on 0.18.0, so will get to next week.

@jreback jreback added this to the 0.18.1 milestone Mar 8, 2016
@gfyoung
Copy link
Member Author

gfyoung commented Mar 8, 2016

👍 will ping on all of them next week then.

@gfyoung gfyoung force-pushed the empty_usecols branch 3 times, most recently from efc8523 to c5bdde3 Compare March 9, 2016 02:57
@gfyoung gfyoung force-pushed the empty_usecols branch 7 times, most recently from 45d5876 to e4bc4ec Compare April 12, 2016 01:48
@gfyoung
Copy link
Member Author

gfyoung commented Apr 12, 2016

@jorisvandenbossche : Any update on this PR? I think @jreback is waiting for your feedback before he merges this in.

@jorisvandenbossche
Copy link
Member

Sorry, slowly getting back on track :-) I will take a look now.

Just a quick question, my previous comment was about HTML and Excel parser. The whatsnew note now does not mention those anymore. There are no longer changes to to read_html and read_excel ?

@gfyoung
Copy link
Member Author

gfyoung commented Apr 12, 2016

@jorisvandenbossche : No worries. 😄 There are in fact changes to read_html and read_excel. If you look at the changed files, both changed the exception they are catching from StopIteration to EmptyDataError when there are empty tables. However, those changes are a result of the changes made to the parsers and are not direct changes to read_html and read_excel.

df = self.read_csv(StringIO(''), names=names, usecols=usecols)

def test_read_with_bad_header(self):
name = self.__class__.__name__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait...why did flake8 not catch that? No, I don't think so AFAICT.

@jorisvandenbossche
Copy link
Member

@gfyoung But they only catch it under the hood? (I mean, it doesn't bubble up to the user?)


In [1]: df = pd.read_csv(StringIO(''), engine='c')
...
pandas.io.common.EmptyDataError: No columns to parse from file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it actually shown with the full name? (just a question, didn't test it, didn't fetch the PR, but I would just show the same as in an actual console)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is what I did. 😄 - FYI you can observe this full out name thing if you trigger any current CParserError.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, perfect! (I just wondered if it was the case)

@gfyoung
Copy link
Member Author

gfyoung commented Apr 12, 2016

@jorisvandenbossche : Not AFAICT (it's an explicit try-except block for read_html and read_excel)

@jorisvandenbossche
Copy link
Member

@jreback @gfyoung I didn't look at the code changes in detail, but the whatsnew is in any case very clear and sounds logical! So good to go for me

@@ -179,6 +179,43 @@ New Behavior:
# Output is a DataFrame
df.groupby(pd.TimeGrouper(key='date', freq='M')).apply(lambda x: x[['value']].sum())

Changes in ``read_csv`` errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sub-section should be in API changes.

add a direction above for referncding this: .._whatsnew.......

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say Change inread_csvexceptions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jreback
Copy link
Contributor

jreback commented Apr 13, 2016

@gfyoung ok just a couple of minor doc comments. Pls also have a look thru io.rst and see if any exceptions are mentioned (and if so fix them), as well as doc-strings.

In Python, when reading an empty file, it used to throw a StopIteration
error with no error message. This PR helps to differentiate the case
when no columns are inferable, which now leads to an EmptyDataError for
both the C and Python engines.

[ci skip]

In addition to this error change, several others have been made as well:

- ``CParserError`` is now a ``ValueError`` instead of just an ``Exception`` (:issue:`12551`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this whatsnew just not put in before? (the PR was already merged)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I moved it into this section because it's related

@jreback jreback closed this in 827745d Apr 13, 2016
@jreback
Copy link
Contributor

jreback commented Apr 13, 2016

@gfyoung thanks!

@gfyoung
Copy link
Member Author

gfyoung commented Apr 13, 2016

@jreback : Sure thing! FYI, you can also cancel my build here.

@gfyoung gfyoung deleted the empty_usecols branch April 13, 2016 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

usecols is not respected when the file is empty
4 participants