Skip to content

Errors in pandas._libs.lib.maybe_convert_objects leading to use of uninitialized memory #31944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ahmadia opened this issue Feb 13, 2020 · 8 comments · Fixed by #36783
Closed
Assignees
Labels
good first issue NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@ahmadia
Copy link

ahmadia commented Feb 13, 2020

Code Sample 1

import numpy as np
import pandas as pd
from pandas._libs import lib as lib
print(lib.maybe_convert_objects(np.array([np.nan, True], dtype='object'), safe=1))

Code Sample 2

import numpy as np
import pandas as pd
from pandas._libs import lib as lib
print(lib.maybe_convert_objects(np.array([np.nan, np.datetime64()], dtype='object'), safe=1, convert_datetime=1))

Code Sample 3

import math
import pandas as pd

z1 = pd.Series([True, False, pd.NA]).value_counts(dropna=False)
print(z1)
z2 = pd.Series({True: 1, False: 1, math.nan:1})
print(z1)

Problem description

This problem was originally detected by @benfuja when he noticed inconsistent output in the value_counts function when dropna=False for otherwise Boolean arrays containing a pd.NA indicator. Code Sample 3 is Ben's original example showing the odd behavior (note that z1 is printed twice, and depending on your malloc library, will likely change values between calls). When we tried to isolate what was happening, we ended up down a couple of rabbit holes leading to the maybe_convert_objects function.

What I suspect is happening is that when maybe_convert_objects is called by the Series index formatter, it is expecting that if a mixture of Booleans and np.nan are present, that the presence of np.nan will be preserved. Instead, I think what happens is that maybe_convert_objects erroneously returns garbage values when either Booleans or datetimes/timedeltas are present along with np.nan by returning a corresponding array with uninitialized empty memory in the locations where the np.nan values were present. Code Samples 1 and 2 are examples of this.

As noted in #27417, there are probably a few opportunities to improve this function and its dependencies. I'm not particularly familiar with this code base, but it seems to me that the most correct minimal fix may be in the Seen class:

Proposed Minimal Fix

    @property
    def numeric_(self):
-        return self.complex_ or self.float_ or self.int_
+        return self.complex_ or self.float_ or self.int_ or self.null_ or self.nan_ 

Rationale

The maybe_convert_objects method relies on the Seen class to tell it if it is safe to consider an array as containing only values of a specific type. When maybe_convert_objects encounters a NaN or a None, it stores the encountered value in the floats and complexes arrays and sets the seen.nan_ / seen.null_ flags. Later checks for whether an array is of a specific type will miss that these values were encountered if these flags are not checked for. This seemed like the most logical place to put these checks in Seen, since both self.null_ and self.nan_ are stored in the floats/complexes arrays earlier. Another option would be to fix the logic in maybe_convert_objects to more carefully check the self.nan_/seen.null_ flags in the case statements.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.8.1.final.0 python-bits : 64 OS : Darwin OS-release : 18.7.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.1
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 45.0.0.post20200113
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@mmccarty
Copy link
Contributor

@ahmadia - Thank you for the details!
@TomAugspurger - Can you advise on the proposed fix or point us to the right maintainer?

@TomAugspurger
Copy link
Contributor

Thanks for the report. FYI, you probably want Series([True, False, pd.NA], dtype='boolean') to get our new nullable dtype rather than object dtype.

cc @jbrockmendel if you have thoughts on the proposed change to Seen.numeric_. I'm not personally familiar with it.

@jbrockmendel
Copy link
Member

@ahmadia just to check, at the end of "code sample 3", are you intentionally re-printing z1, or should that be print(z2)?

if you have thoughts on the proposed change to Seen.numeric_

Based on a cursory glance, it isnt obvious that adding or self.null_ is correct. Wouldn't that incorrectly identify pd.NaT as numeric?

@ahmadia
Copy link
Author

ahmadia commented Feb 18, 2020

@jbrockmendel - Indeed, I am intentionally re-printing z1, on my laptop (and I suspect most OS X computers), you'll see two different outputs due to the use of uninitialized memory in the maybe_convert_objects functionality.

Let me take a closer look at how pd.NaT is handled currently.

@ahmadia
Copy link
Author

ahmadia commented Feb 18, 2020

If I'm understanding the Seen class definition (I'm looking at your master branch), it has separate flags for nat, nan, and null. This will not lead to a situation where pd.NaT is identified as numeric so long as Seen.nat_ is set.

@jbrockmendel
Copy link
Member

Great. PR would be welcome.

@ShaharNaveh ShaharNaveh added this to the Contributions Welcome milestone Mar 14, 2020
@Dr-Irv Dr-Irv added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Sep 12, 2020
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 12, 2020

This seems to be addressed in pandas 1.1.1. On code sample 3, with pandas 1.0.1, I get:

>>> z1 = pd.Series([True, False, pd.NA]).value_counts(dropna=False)
>>> print(z1)
True     1
False    1
True     1
dtype: int64

With pandas 1.1.1, I get:

>>> z1 = pd.Series([True, False, pd.NA]).value_counts(dropna=False)
>>> print(z1)
True     1
False    1
NaN      1
dtype: int64

So the NaN appeared in the value_counts().

A PR that adds a test for this would be welcome.

@Dr-Irv Dr-Irv added Needs Tests Unit test(s) needed to prevent regressions good first issue labels Sep 12, 2020
@icanhazcodeplz
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants