-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Errors in pandas._libs.lib.maybe_convert_objects leading to use of uninitialized memory #31944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@ahmadia - Thank you for the details! |
Thanks for the report. FYI, you probably want cc @jbrockmendel if you have thoughts on the proposed change to |
@ahmadia just to check, at the end of "code sample 3", are you intentionally re-printing
Based on a cursory glance, it isnt obvious that adding |
@jbrockmendel - Indeed, I am intentionally re-printing Let me take a closer look at how |
If I'm understanding the |
Great. PR would be welcome. |
This seems to be addressed in pandas 1.1.1. On code sample 3, with pandas 1.0.1, I get: >>> z1 = pd.Series([True, False, pd.NA]).value_counts(dropna=False)
>>> print(z1)
True 1
False 1
True 1
dtype: int64 With pandas 1.1.1, I get: >>> z1 = pd.Series([True, False, pd.NA]).value_counts(dropna=False)
>>> print(z1)
True 1
False 1
NaN 1
dtype: int64 So the A PR that adds a test for this would be welcome. |
take |
Code Sample 1
Code Sample 2
Code Sample 3
Problem description
This problem was originally detected by @benfuja when he noticed inconsistent output in the
value_counts
function whendropna=False
for otherwise Boolean arrays containing apd.NA
indicator. Code Sample 3 is Ben's original example showing the odd behavior (note that z1 is printed twice, and depending on your malloc library, will likely change values between calls). When we tried to isolate what was happening, we ended up down a couple of rabbit holes leading to themaybe_convert_objects
function.What I suspect is happening is that when
maybe_convert_objects
is called by the Series index formatter, it is expecting that if a mixture of Booleans andnp.nan
are present, that the presence ofnp.nan
will be preserved. Instead, I think what happens is thatmaybe_convert_objects
erroneously returns garbage values when either Booleans or datetimes/timedeltas are present along withnp.nan
by returning a corresponding array with uninitialized empty memory in the locations where thenp.nan
values were present. Code Samples 1 and 2 are examples of this.As noted in #27417, there are probably a few opportunities to improve this function and its dependencies. I'm not particularly familiar with this code base, but it seems to me that the most correct minimal fix may be in the Seen class:
Proposed Minimal Fix
Rationale
The
maybe_convert_objects
method relies on theSeen
class to tell it if it is safe to consider an array as containing only values of a specific type. Whenmaybe_convert_objects
encounters aNaN
or aNone
, it stores the encountered value in thefloats
andcomplexes
arrays and sets theseen.nan_
/seen.null_
flags. Later checks for whether an array is of a specific type will miss that these values were encountered if these flags are not checked for. This seemed like the most logical place to put these checks inSeen
, since bothself.null_
andself.nan_
are stored in thefloats/complexes
arrays earlier. Another option would be to fix the logic inmaybe_convert_objects
to more carefully check theself.nan_/seen.null_
flags in the case statements.Output of
pd.show_versions()
pandas : 1.0.1
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 45.0.0.post20200113
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None
The text was updated successfully, but these errors were encountered: