Skip to content

BUG: properly close files opened by parsers #13940

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from
Closed

BUG: properly close files opened by parsers #13940

wants to merge 14 commits into from

Conversation

agraboso
Copy link
Contributor

@agraboso agraboso commented Aug 8, 2016

Addresses all most of the ResourceWarnings from #13932. There were three tests that opened files but did not close them. The rest were due to parsers (CParserWrapper and PythonParser) opening files or streams but never closing them. There was one case — the mmap one — that was tricky because there was a file handle going into it and immediately reaching a refcount of zero.

I am at a loss about writing new tests for this. Is it even possible?

@agraboso
Copy link
Contributor Author

agraboso commented Aug 8, 2016

Argh, there are still some ResourceWarnings left... Let me investigate.

@jreback jreback added IO CSV read_csv, to_csv IO Stata read_stata, to_stata Clean labels Aug 8, 2016
@jreback
Copy link
Contributor

jreback commented Aug 8, 2016

@agraboso looks a lot cleaner! https://travis-ci.org/pydata/pandas/jobs/150768150

you can leave the mmap one (if its not easy to fix). If you can get the stata ones would be great!.

@@ -972,3 +972,5 @@ Bug Fixes

- Bug in ``Index`` raises ``KeyError`` displaying incorrect column when column is not in the df and columns contains duplicate values (:issue:`13822`)
- Bug in ``Period`` and ``PeriodIndex`` creating wrong dates when frequency has combined offset aliases (:issue:`13874`)

- Bug where files opened by parsers (used by `pd.read_csv`, e.g.) were not closed properly. (:issue:`13940`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double back ticks around pd.read_csv. Put this next to the other csv bug fixes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say when these were not closed properlty (e.g. nrows not None, when else?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure that on iteration (e.g. chunksize / iterator specified) we are closing properly? (e.g. I am not sure if it closes, then re-opens each iteration).

cc @gfyoung do you know?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not re-open after each iteration AFAIK. You can see for yourself by writing to a CSV file, opening it with read_csv and passing in chunksize. Try then deleting the file in a separate window. You should be able to do so.

Copy link
Contributor Author

@agraboso agraboso Aug 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, when getting an iterator (chunksize or iterator specified) we do not close the file. When the iterator has been consumed and raises StopIteration, we close the file. If a user requests an iterator but does not consume it fully, s/he must call .close() on it — though we could define a __del__ and take care of that.

@codecov-io
Copy link

codecov-io commented Aug 8, 2016

Current coverage is 85.31% (diff: 61.44%)

Merging #13940 into master will increase coverage by 0.02%

@@             master     #13940   diff @@
==========================================
  Files           139        139          
  Lines         50162      50370   +208   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42785      42975   +190   
- Misses         7377       7395    +18   
  Partials          0          0          

Powered by Codecov. Last update 5df9123...3fa7d25

self._engine._reader.close()
except:
pass
self._engine.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can we get rid of this try-except block?

Copy link
Contributor Author

@agraboso agraboso Aug 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shouldn't have removed it. But this was only used by CParserWrapper, so I have put it there now in c7e9c9c.

@agraboso
Copy link
Contributor Author

agraboso commented Aug 9, 2016

@jreback The mmap one is fixed since the first commit — I was just making the point that it took me a while to understand what was wrong.

I have fixed the stata cases too:

  • StataWriter was easy: I moved the opening of the file to the write_file method. The file is thus always opened and closed within it, whether errors are raised or not.
  • StataReader did not close the file when raising an error. I fixed it by wrapping most of the logic in its read method in a try/except block. This even takes care of closing the file when a StopIteration is raised.

if read_len <= 0:
# Iterator has finished, should never be here unless
# we are reading the file incrementally
try:
Copy link
Contributor

@jreback jreback Aug 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the try/except over everthing here. pls narrow down as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I narrowed it down to the few places where an explicit ValueError might come from another StataReader method. It catches those cases that are in the tests, but it is possible for other lines to raise an error that is not caught — which would leave an open file handler.

@jreback
Copy link
Contributor

jreback commented Aug 9, 2016

I don't want to just wrap things at a high level, these should be as granular as possible.These can be actual errors (or errors in the tests, its hard to tell). Let uncaught resource warnings show thru (and then can investigate these individually).

@agraboso
Copy link
Contributor Author

agraboso commented Aug 9, 2016

Notice that my proposal of wrapping most of StataReader.read in a try/except block had both a self.close() and a raise, so that we would still see all warnings and errors while making sure we do close the file.

@@ -253,6 +253,9 @@ def __init__(self, filepath_or_buffer, index=None, encoding='ISO-8859-1',

self._read_header()

def close(self):
self.filepath_or_buffer.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this always a filehandle-like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's either a file handle coming from L244 or a BytesIO object coming from L252. So it is always safe to close it.

@jreback
Copy link
Contributor

jreback commented Aug 10, 2016

hah! this turned into a rabbit hole!. lmk when you are ready.

@agraboso
Copy link
Contributor Author

agraboso commented Aug 11, 2016

hah! this turned into a rabbit hole!

Tell me about it! It's possible that other parsers leave open files too, but I'd leave that for another PR. I think this one is ready.

lmk when you are ready.

And it's all green!

@jreback jreback closed this in 4a80521 Aug 11, 2016
@jreback
Copy link
Contributor

jreback commented Aug 11, 2016

thanks @agraboso I merged. left a couple comments that might be relevant for next PR :)

we need to try to fix remaining ResourceWarnings then raise if ANY are detected (its actually pretty easy). in pandas.util.testing.TestCase.setUpClass you can set the warning to error. This way its not set globally when testing_mode() is called (in case testing is just imported by a user and not actually run in the test suite).

@jreback jreback added this to the 0.19.0 milestone Aug 11, 2016
@@ -393,11 +393,15 @@ def _read(filepath_or_buffer, kwds):
raise NotImplementedError("'nrows' and 'chunksize' cannot be used"
" together yet.")
elif nrows is not None:
return parser.read(nrows)
data = parser.read(nrows)
Copy link
Contributor

@jreback jreback Aug 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you do

try:
    return parser.read(nrows)
finally:
    parse.close()

I think you can eliminate the .close() calls where you raise on an exception (also needs changing at L402).
and slightly more idiomatic.

That way exceptions will bubble up, but we will still close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean IO CSV read_csv, to_csv IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CLN: resource warnings
4 participants