Skip to content

ENH: Accept callable for skiprows in read_csv #15059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Jan 4, 2017

Title is self-explanatory.

xref #10882.

@gfyoung
Copy link
Member Author

gfyoung commented Jan 4, 2017

Right now, I have added [ci skip] because I cannot get the C engine implementation to work yet. Currently, if you pass in a callable, I get a segfault. However, I'm not sure how to fix it. Suggestions / help on what to do? Is it some reference thing that I am missing?

@chris-b1
Copy link
Contributor

chris-b1 commented Jan 4, 2017

One possibility - is the GIL released when you're trying to call that function in the C code?

@gfyoung
Copy link
Member Author

gfyoung commented Jan 4, 2017

@chris-b1 : Can't check at the moment, but could you explain why that would cause a segfault?

@chris-b1
Copy link
Contributor

chris-b1 commented Jan 4, 2017

To be honest, not sure if/how it would cause a segfault, but the general rule is you can't do anything involving a PyObject without the GIL.
https://docs.python.org/2/c-api/init.html#thread-state-and-the-global-interpreter-lock

@gfyoung gfyoung force-pushed the skiprows-callable branch from aeca2bd to c8bb881 Compare January 4, 2017 22:29
@gfyoung
Copy link
Member Author

gfyoung commented Jan 4, 2017

Might be of interest or relevance: scipy/scipy #6509

@gfyoung
Copy link
Member Author

gfyoung commented Jan 5, 2017

@chris-b1 : You were right! I needed to ensure that we were in a state to call PyObject_CallFunction. The C engine is now happy. PR should be ready to be reviewed soon.

@gfyoung gfyoung force-pushed the skiprows-callable branch 2 times, most recently from 5f24be0 to 3581921 Compare January 5, 2017 06:00
@gfyoung gfyoung changed the title WIP: Accept callable for skiprows in read_csv ENH: Accept callable for skiprows in read_csv Jan 5, 2017
@gfyoung gfyoung force-pushed the skiprows-callable branch from 3581921 to bca4a63 Compare January 5, 2017 06:13
@codecov-io
Copy link

codecov-io commented Jan 5, 2017

Current coverage is 84.75% (diff: 100%)

Merging #15059 into master will decrease coverage by <.01%

@@             master     #15059   diff @@
==========================================
  Files           145        145          
  Lines         51220      51238    +18   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43414      43426    +12   
- Misses         7806       7812     +6   
  Partials          0          0          

Powered by Codecov. Last update b8e7c34...d15e3a3

@sinhrks sinhrks added IO CSV read_csv, to_csv Enhancement labels Jan 5, 2017
@gfyoung
Copy link
Member Author

gfyoung commented Jan 10, 2017

@jreback @jorisvandenbossche @chris-b1 : Thoughts on this?

Copy link
Contributor

@chris-b1 chris-b1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, couple comments on the c-engine calls


def test_skiprows_callable(self):
data = 'a\n1\n2\n3\n4\n5'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests with bad functions (raises and returns non-bool) - in particular want to make sure error propagates from c-engine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Done.


if (self->skipfunc != NULL) {
state = PyGILState_Ensure();
should_skip = PyObject_IsTrue(PyObject_CallFunction(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyObject_CallFunction returns a new reference, so I think you need to capture the result and decref it (Py_DECREF). Related to my comment on the tests, you may also need some error-trapping here in case the function fails.

https://docs.python.org/2/c-api/object.html#c.PyObject_CallFunction

Copy link
Member Author

@gfyoung gfyoung Jan 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch (in so many senses of the word 😄). You caught a bug with this, as attempts to trap this error revealed a bug in raise_parser_error in which we were assuming PyErr_Fetch always returned an Exception-type for *value, but that was not the case (nor is it specified in the documentation).

at the start of the file.

If callable, the callable function will be evaluated against the row
indices, returning indices where the callable function evaluates to True.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"returning indices where the callable function evaluates to True" -> should this be "False" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we never "return" any indices. That should be changed. Good catch.

@gfyoung
Copy link
Member Author

gfyoung commented Jan 13, 2017

@jreback : @jorisvandenbossche and @chris-b1 have approved these changes. I'm not sure if they're waiting for you to have a look, but if there are no other concerns, this should be good to merge I think.

if value != NULL:
old_exc = <object> value
Py_XDECREF(value)
raise old_exc

# PyErr_Fetch only returned the error message in *value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this not necessary before? this seems like lots of hoop jumping

Copy link
Member Author

@gfyoung gfyoung Jan 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think because we got lucky. This should have been caught before and processed accordingly. It's annoying, but that's what happens unfortunately when the Python C API can't promise you anything.

@jreback
Copy link
Contributor

jreback commented Jan 13, 2017

give a perf test on this, e.g. when the skiprows is NOT a callable perf should be roughly unchanged.

@gfyoung
Copy link
Member Author

gfyoung commented Jan 14, 2017

@jreback : Ran benchmarks, and no major change in performance when using skiprows.

@jreback jreback added this to the 0.20.0 milestone Jan 14, 2017
@jreback jreback closed this in 7ad6c65 Jan 14, 2017
@jreback
Copy link
Contributor

jreback commented Jan 14, 2017

thanks!

@gfyoung gfyoung deleted the skiprows-callable branch January 14, 2017 19:47
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
Title is self-explanatory.

xref pandas-dev#10882.

Author: gfyoung <[email protected]>

Closes pandas-dev#15059 from gfyoung/skiprows-callable and squashes the following commits:

d15e3a3 [gfyoung] ENH: Accept callable for skiprows
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants