Skip to content

BUG: memory leak on get_c_string #16060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 20, 2017
Merged

Conversation

chris-b1
Copy link
Contributor

@chris-b1
Copy link
Contributor Author

@jreback - this fixes the leak, but on a closer look, I'm not sure we were safe releasing the GIL in StringHashTable - the c strings are views back on the original python objects, so not threadsafe?

@chris-b1
Copy link
Contributor Author

Here's a contrived case that creates a data race:

setup

def f1(arr):
    return pd.unique(arr)

def f2(arr):
    arr.iloc[-1] = 'a new long string'

from concurrent import futures
import time

Then depending on timing, unique can change

In [38]: x = pd.Series(np.arange(10**6) % 10**3).astype(str)
    ...: x.iloc[-1] = 'long string #1'
    ...: 
    ...: with futures.ThreadPoolExecutor(max_workers=2) as exc:
    ...:     fut1 = exc.submit(f1, x)
    ...:     fut2 = exc.submit(f2, x)
    ...: print(fut1.result()[-1])
a new long string

In [39]: x = pd.Series(np.arange(10**6) % 10**3).astype(str)
    ...: x.iloc[-1] = 'long string #1'
    ...: 
    ...: with futures.ThreadPoolExecutor(max_workers=2) as exc:
    ...:     fut1 = exc.submit(f1, x)
    ...:     time.sleep(2)
    ...:     fut2 = exc.submit(f2, x)
    ...: print(fut1.result()[-1])
long string #1

@jreback
Copy link
Contributor

jreback commented Apr 19, 2017

hmm it should be ok u r not manipulating python objects strings are fine
though since it's s pointer to an internal python object may be suspect

@jreback
Copy link
Contributor

jreback commented Apr 19, 2017

cc @wesm @cpcloud ?

@codecov
Copy link

codecov bot commented Apr 20, 2017

Codecov Report

Merging #16060 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #16060   +/-   ##
=======================================
  Coverage   90.79%   90.79%           
=======================================
  Files         156      156           
  Lines       50537    50537           
=======================================
  Hits        45886    45886           
  Misses       4651     4651
Flag Coverage Δ
#multiple 88.56% <ø> (ø) ⬆️
#single 40.44% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dd5cef5...8910efd. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 20, 2017

Codecov Report

Merging #16060 into master will increase coverage by 0.05%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16060      +/-   ##
==========================================
+ Coverage   90.79%   90.84%   +0.05%     
==========================================
  Files         156      159       +3     
  Lines       50537    50794     +257     
==========================================
+ Hits        45886    46146     +260     
+ Misses       4651     4648       -3
Flag Coverage Δ
#multiple 88.63% <ø> (+0.06%) ⬆️
#single 40.36% <ø> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/common.py 90.68% <0%> (-0.35%) ⬇️
pandas/io/formats/format.py 95.02% <0%> (-0.1%) ⬇️
pandas/core/frame.py 97.64% <0%> (-0.02%) ⬇️
pandas/io/formats/common.py 94.44% <0%> (ø)
pandas/io/formats/excel.py 96.64% <0%> (ø)
pandas/io/formats/css.py 100% <0%> (ø)
pandas/io/formats/style.py 96.36% <0%> (+0.07%) ⬆️
pandas/util/testing.py 79.81% <0%> (+0.18%) ⬆️
pandas/io/excel.py 80.62% <0%> (+0.88%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dd5cef5...e1b90eb. Read the comment docs.

@chris-b1
Copy link
Contributor Author

Yeah, not sure this is actually an issue, not like mutating is ever thread safe. I was thinking there could possibly be a use after free problem if one of the python strings got deleted during the nogil block, but I can't seem make it happen.

@jreback jreback added the Performance Memory or execution speed performance label Apr 20, 2017
@jreback jreback added this to the 0.20.0 milestone Apr 20, 2017
@jreback
Copy link
Contributor

jreback commented Apr 20, 2017

@chris-b1 your above example doesn't actually create a race. The actual conversion (IOW, the calling of get_c_string is WITH the GIL. After that we just have a pointer to the string, so even if there is mutation, it would have to be within a c-thread that itself mutates a PyObject (which is a no-no). So this looks ok.

@wesm
Copy link
Member

wesm commented Apr 20, 2017

from https://github.com/pandas-dev/pandas/blob/648ae4f03622d8eafe1ca3b833bd6a99f56bece4/pandas/_libs/hashtable_class_helper.pxi.in it looks like the GIL is always held when calling this function. It doesn't look like you are checking NULL though, so exceptions are going unchecked for a while. Perhaps you want except NULL in

inline char *get_c_string(object)
?

@chris-b1
Copy link
Contributor Author

Thanks @jreback, @wesm. Added the except NULL - on the StringHashTable path we've pre-checked that the object array are all strings so it shouldn't ever hit, but as a general utility function I agree it should check.

@jreback jreback merged commit b17e286 into pandas-dev:master Apr 20, 2017
@jreback
Copy link
Contributor

jreback commented Apr 20, 2017

thanks @chris-b1

analyticalmonk pushed a commit to analyticalmonk/pandas that referenced this pull request Apr 20, 2017
* BUG: memory leak on get_c_string

* add except NULL
pcluo pushed a commit to pcluo/pandas that referenced this pull request May 22, 2017
* BUG: memory leak on get_c_string

* add except NULL
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: memory leak in unique() with object dtype
3 participants