Skip to content

Commit e3ff389

Browse files
zhezherungfyoung
authored andcommitted
BUG: Fixing memory leaks in read_csv
* Move allocation of na_hashset down to avoid a leak on continue * Delete na_hashset if there is an exception * Clean up table before raising an exception Closes gh-21353.
1 parent 0ab8eb2 commit e3ff389

File tree

2 files changed

+21
-18
lines changed

2 files changed

+21
-18
lines changed

doc/source/whatsnew/v0.24.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -1382,6 +1382,7 @@ Notice how we now instead output ``np.nan`` itself instead of a stringified form
13821382
- Bug in :func:`DataFrame.to_string()` that caused representations of :class:`DataFrame` to not take up the whole window (:issue:`22984`)
13831383
- Bug in :func:`DataFrame.to_csv` where a single level MultiIndex incorrectly wrote a tuple. Now just the value of the index is written (:issue:`19589`).
13841384
- Bug in :meth:`HDFStore.append` when appending a :class:`DataFrame` with an empty string column and ``min_itemsize`` < 8 (:issue:`12242`)
1385+
- Bug in :func:`read_csv()` in which memory leaks occurred in the C engine when parsing ``NaN`` values due to insufficient cleanup on completion or error (:issue:`21353`)
13851386
- Bug in :func:`read_csv()` in which incorrect error messages were being raised when ``skipfooter`` was passed in along with ``nrows``, ``iterator``, or ``chunksize`` (:issue:`23711`)
13861387
- Bug in :meth:`read_csv()` in which :class:`MultiIndex` index names were being improperly handled in the cases when they were not provided (:issue:`23484`)
13871388
- Bug in :meth:`read_html()` in which the error message was not displaying the valid flavors when an invalid one was provided (:issue:`23549`)

pandas/_libs/parsers.pyx

+20-18
Original file line numberDiff line numberDiff line change
@@ -1070,18 +1070,6 @@ cdef class TextReader:
10701070

10711071
conv = self._get_converter(i, name)
10721072

1073-
# XXX
1074-
na_flist = set()
1075-
if self.na_filter:
1076-
na_list, na_flist = self._get_na_list(i, name)
1077-
if na_list is None:
1078-
na_filter = 0
1079-
else:
1080-
na_filter = 1
1081-
na_hashset = kset_from_list(na_list)
1082-
else:
1083-
na_filter = 0
1084-
10851073
col_dtype = None
10861074
if self.dtype is not None:
10871075
if isinstance(self.dtype, dict):
@@ -1106,13 +1094,26 @@ cdef class TextReader:
11061094
self.c_encoding)
11071095
continue
11081096

1109-
# Should return as the desired dtype (inferred or specified)
1110-
col_res, na_count = self._convert_tokens(
1111-
i, start, end, name, na_filter, na_hashset,
1112-
na_flist, col_dtype)
1097+
# XXX
1098+
na_flist = set()
1099+
if self.na_filter:
1100+
na_list, na_flist = self._get_na_list(i, name)
1101+
if na_list is None:
1102+
na_filter = 0
1103+
else:
1104+
na_filter = 1
1105+
na_hashset = kset_from_list(na_list)
1106+
else:
1107+
na_filter = 0
11131108

1114-
if na_filter:
1115-
self._free_na_set(na_hashset)
1109+
try:
1110+
# Should return as the desired dtype (inferred or specified)
1111+
col_res, na_count = self._convert_tokens(
1112+
i, start, end, name, na_filter, na_hashset,
1113+
na_flist, col_dtype)
1114+
finally:
1115+
if na_filter:
1116+
self._free_na_set(na_hashset)
11161117

11171118
if upcast_na and na_count > 0:
11181119
col_res = _maybe_upcast(col_res)
@@ -2059,6 +2060,7 @@ cdef kh_str_t* kset_from_list(list values) except NULL:
20592060

20602061
# None creeps in sometimes, which isn't possible here
20612062
if not isinstance(val, bytes):
2063+
kh_destroy_str(table)
20622064
raise ValueError('Must be all encoded bytes')
20632065

20642066
k = kh_put_str(table, PyBytes_AsString(val), &ret)

0 commit comments

Comments
 (0)