Skip to content

BUG: GH11786 Thread safety issue with read_csv #11790

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jdeschenes
Copy link
Contributor

closes #11786

Fixed an issue with thread safety when calling read_csv with a StringIO object.

The issue was caused by a misplaced PyGilSate_Ensure()

@jreback
Copy link
Contributor

jreback commented Dec 7, 2015

@jdeschenes gr8! can you add in the example from the issue as a smoke test. (e.g. just have it run), then read in with a single trhead and compare.

and pls add a release note when you are satisified.

@jreback jreback added Bug IO CSV read_csv, to_csv labels Dec 7, 2015
@jreback jreback added this to the 0.18.0 milestone Dec 7, 2015
@jdeschenes
Copy link
Contributor Author

Alright, I did this quickly as I don't have time to work on this right now. How long until next release?

@jreback
Copy link
Contributor

jreback commented Dec 7, 2015

@jdeschenes oh have a while.....when you have a chance...thanks!

@jreback
Copy link
Contributor

jreback commented Dec 26, 2015

@jdeschenes if you can have a look at this again would be great.

@jdeschenes
Copy link
Contributor Author

ping! @jreback

files = [BytesIO(b) for b in bytes]

# Read all files in many threads
pool = ThreadPool(8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert that the read in values match a single threaded reader. (e.g. compare frames)

@mrocklin
Copy link
Contributor

mrocklin commented Jan 5, 2016

Thank you both for keeping up on this.

@jreback
Copy link
Contributor

jreback commented Jan 6, 2016

@jdeschenes IIRC this issue is repro with actual files. Is that not the case? is it only StringIO/BytesIO. are they not thread-safe?

@jreback
Copy link
Contributor

jreback commented Jan 11, 2016

@jdeschenes can you respond to my comments.

@jdeschenes
Copy link
Contributor Author

Hi @jreback,

the issue is solely reproducible with StringIO. The root cause of this bug is in function buffer_rd_bytes in
pandas/src/parser/io.c. This function is only used when a StringIO/BytesIO is passed to the read_csv function.

The function was calling Py_XDECREF before ensuring that the thread had the GIL. This behavior could not be seen before since the GIL was always locked throughout the read_csv function call.

I am not aware of any issues when reading from disk and this pull request will not fix any problem related to this.

I think that the release notes should be kept as is.

Let me know what you think.

@jreback
Copy link
Contributor

jreback commented Jan 19, 2016

ok, can you add a test that validates the issue that reading from a disk with multiple threads is ok (so we don't regress).

- Bug in vectorized ``DateOffset`` when ``n`` parameter is ``0`` (:issue:`11370`)
- Compat for numpy 1.11 w.r.t. ``NaT`` comparison changes (:issue:`12049`)

- Bug in ``read_csv`` when reading from a StringIO in threads (:issue:`11790`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double backticks around StringIO

@jreback
Copy link
Contributor

jreback commented Jan 19, 2016

pls run git diff master | flake8 --diff as much PEP checking has been one on these files.

@jdeschenes jdeschenes force-pushed the pandas-11786 branch 2 times, most recently from 1e257fe to a9a2513 Compare January 19, 2016 02:36
@mrocklin
Copy link
Contributor

FWIW using BytesIO has actual use cases in distributed computing, it isn't just a test case.

Many parallel storage systems won't give you access to the hard disk but will instead deliver a bunch of bytes. In this case the best way I've found to use pd.read_csv is to hand it a BytesIO object.

@jreback
Copy link
Contributor

jreback commented Jan 19, 2016

@mrocklin oh of course. just covering the bases. I suspect people have tried multi-threading to read files as well :)

…tringIO object., pandas-dev#11786

The issue was caused by a misplaced PyGilSate_Ensure()
@jdeschenes
Copy link
Contributor Author

It would be very interesting to see if there is any benefit in using a ThreadPool for reading from a BytesIO. We are spending a lot of time into the GIL, thanks to the buffer_rd_bytes function. It should probably be benchmarked.

I have a suspicion that it doesn't help at all(It might be even a net loss).

I added the test for the file read. I didn't do it for the BytesIO. The code would effectively look a lot like what I did up top... Grabbing a list of BytesIO and processing them in a ThreadPool. I can take a look at this a bit later, if that is required.

@jreback jreback closed this in 567bc5c Jan 19, 2016
@jreback
Copy link
Contributor

jreback commented Jan 19, 2016

@jdeschenes thanks!

certainly would take addtl benchmarks / fixes!

jreback added a commit that referenced this pull request Jan 19, 2016
@kayvonr
Copy link

kayvonr commented Jan 27, 2016

Hey all - any estimate of when this will be go out in a production release? Encountering this bug very very frequently with 0.17.1, and would like to get back up to a newer version of pandas again soon

Thanks

@jreback
Copy link
Contributor

jreback commented Jan 27, 2016

planning on a RC in about 2 weeks, so release should be roughly mid-feb or so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_csv in multiple theads causes segmentation fault
4 participants