-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: GH11786 Thread safety issue with read_csv #11790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jdeschenes gr8! can you add in the example from the issue as a smoke test. (e.g. just have it run), then read in with a single trhead and compare. and pls add a release note when you are satisified. |
Alright, I did this quickly as I don't have time to work on this right now. How long until next release? |
@jdeschenes oh have a while.....when you have a chance...thanks! |
@jdeschenes if you can have a look at this again would be great. |
ce82e4c
to
2c4df56
Compare
ping! @jreback |
files = [BytesIO(b) for b in bytes] | ||
|
||
# Read all files in many threads | ||
pool = ThreadPool(8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert that the read in values match a single threaded reader. (e.g. compare frames)
Thank you both for keeping up on this. |
@jdeschenes IIRC this issue is repro with actual files. Is that not the case? is it only |
@jdeschenes can you respond to my comments. |
2c4df56
to
6252eeb
Compare
Hi @jreback, the issue is solely reproducible with StringIO. The root cause of this bug is in function buffer_rd_bytes in The function was calling Py_XDECREF before ensuring that the thread had the GIL. This behavior could not be seen before since the GIL was always locked throughout the read_csv function call. I am not aware of any issues when reading from disk and this pull request will not fix any problem related to this. I think that the release notes should be kept as is. Let me know what you think. |
ok, can you add a test that validates the issue that reading from a disk with multiple threads is ok (so we don't regress). |
- Bug in vectorized ``DateOffset`` when ``n`` parameter is ``0`` (:issue:`11370`) | ||
- Compat for numpy 1.11 w.r.t. ``NaT`` comparison changes (:issue:`12049`) | ||
|
||
- Bug in ``read_csv`` when reading from a StringIO in threads (:issue:`11790`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use double backticks around StringIO
pls run |
1e257fe
to
a9a2513
Compare
FWIW using Many parallel storage systems won't give you access to the hard disk but will instead deliver a bunch of bytes. In this case the best way I've found to use |
@mrocklin oh of course. just covering the bases. I suspect people have tried multi-threading to read files as well :) |
…tringIO object., pandas-dev#11786 The issue was caused by a misplaced PyGilSate_Ensure()
a9a2513
to
505f6a6
Compare
It would be very interesting to see if there is any benefit in using a ThreadPool for reading from a BytesIO. We are spending a lot of time into the GIL, thanks to the buffer_rd_bytes function. It should probably be benchmarked. I have a suspicion that it doesn't help at all(It might be even a net loss). I added the test for the file read. I didn't do it for the BytesIO. The code would effectively look a lot like what I did up top... Grabbing a list of BytesIO and processing them in a ThreadPool. I can take a look at this a bit later, if that is required. |
@jdeschenes thanks! certainly would take addtl benchmarks / fixes! |
Hey all - any estimate of when this will be go out in a production release? Encountering this bug very very frequently with 0.17.1, and would like to get back up to a newer version of pandas again soon Thanks |
planning on a RC in about 2 weeks, so release should be roughly mid-feb or so |
closes #11786
Fixed an issue with thread safety when calling read_csv with a StringIO object.
The issue was caused by a misplaced PyGilSate_Ensure()