Skip to content

CI/BUG: pyarrow read_csv deadlock #43650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mzeitlin11 opened this issue Sep 18, 2021 · 5 comments
Closed

CI/BUG: pyarrow read_csv deadlock #43650

mzeitlin11 opened this issue Sep 18, 2021 · 5 comments
Labels
Arrow pyarrow functionality CI Continuous Integration IO CSV read_csv, to_csv Testing pandas testing functions or related to the test suite

Comments

@mzeitlin11
Copy link
Member

xref #43611, #43643

When trying to figure out azure timeout issues, deadlock appeared to be occurring in parser code, so pyarrow makes sense as the culprit. Seems like tests with weird input cause issues, for example some of the parse_dates tests, or for a specific reproducer the test:

pandas/tests/io/parser/common/test_ints.py::test_outside_int64_uint64_range

On current pyarrow I can't reproduce, but azure uses 0.17.0, with which can reproduce a deadlock (just running the command pandas/tests/io/parser/common/test_ints.py::test_outside_int64_uint64_range) on macOS. Doesn't happen consistently, but will deadlock (to the point that need to sigkill to stop, which explains why pytest-timeout didn't catch it).

cc @lithomas1 if any thoughts here

@mzeitlin11 mzeitlin11 added Arrow pyarrow functionality CI Continuous Integration IO CSV read_csv, to_csv Testing pandas testing functions or related to the test suite labels Sep 18, 2021
@jorisvandenbossche
Copy link
Member

We can increase the minimum required version of pyarrow for the CSV reading functionality?

It might also be good to make a reproducible example to report to Arrow. Although I suppose it is fixed now (given it only happens on older pyarrow versions), it can still be useful to add it as a test over there.

@mzeitlin11
Copy link
Member Author

Makes sense - from initial testing I know the issue is at least present for <= 1.0.0. Will try to find minimum working version and report to pyarrow if some previously failing cases don't look tested (but someone is welcome to beat me to it :)

@lithomas1
Copy link
Member

I'll try to test this some more. It is also possible that pyarrow is getting stuck because of the TextIoWrapper that we are using on our side to force pyarrow to read StringIO's, which would be a bug on our side.

@jbrockmendel
Copy link
Member

@lithomas1 @mzeitlin11 did this ever get sorted out?

@mroeschke
Copy link
Member

Now that our minimum version is 6.0 I believe we shouldn't hit this issue anyone as IIRC I was experiencing this with pyarrow 2.0 and had skipped those version in the CI due to the deadlock.

Closing since we haven't seen this in a while but we can reopen if this shows up again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality CI Continuous Integration IO CSV read_csv, to_csv Testing pandas testing functions or related to the test suite
Projects
None yet
Development

No branches or pull requests

5 participants