Errors in pandas._libs.lib.maybe_convert_objects leading to use of uninitialized memory #31944

ahmadia · 2020-02-13T02:56:18Z

Code Sample 1

import numpy as np
import pandas as pd
from pandas._libs import lib as lib
print(lib.maybe_convert_objects(np.array([np.nan, True], dtype='object'), safe=1))

Code Sample 2

import numpy as np
import pandas as pd
from pandas._libs import lib as lib
print(lib.maybe_convert_objects(np.array([np.nan, np.datetime64()], dtype='object'), safe=1, convert_datetime=1))

Code Sample 3

import math
import pandas as pd

z1 = pd.Series([True, False, pd.NA]).value_counts(dropna=False)
print(z1)
z2 = pd.Series({True: 1, False: 1, math.nan:1})
print(z1)

Problem description

This problem was originally detected by @benfuja when he noticed inconsistent output in the value_counts function when dropna=False for otherwise Boolean arrays containing a pd.NA indicator. Code Sample 3 is Ben's original example showing the odd behavior (note that z1 is printed twice, and depending on your malloc library, will likely change values between calls). When we tried to isolate what was happening, we ended up down a couple of rabbit holes leading to the maybe_convert_objects function.

What I suspect is happening is that when maybe_convert_objects is called by the Series index formatter, it is expecting that if a mixture of Booleans and np.nan are present, that the presence of np.nan will be preserved. Instead, I think what happens is that maybe_convert_objects erroneously returns garbage values when either Booleans or datetimes/timedeltas are present along with np.nan by returning a corresponding array with uninitialized empty memory in the locations where the np.nan values were present. Code Samples 1 and 2 are examples of this.

As noted in #27417, there are probably a few opportunities to improve this function and its dependencies. I'm not particularly familiar with this code base, but it seems to me that the most correct minimal fix may be in the Seen class:

Proposed Minimal Fix

    @property
    def numeric_(self):
-        return self.complex_ or self.float_ or self.int_
+        return self.complex_ or self.float_ or self.int_ or self.null_ or self.nan_

Rationale

The maybe_convert_objects method relies on the Seen class to tell it if it is safe to consider an array as containing only values of a specific type. When maybe_convert_objects encounters a NaN or a None, it stores the encountered value in the floats and complexes arrays and sets the seen.nan_ / seen.null_ flags. Later checks for whether an array is of a specific type will miss that these values were encountered if these flags are not checked for. This seemed like the most logical place to put these checks in Seen, since both self.null_ and self.nan_ are stored in the floats/complexes arrays earlier. Another option would be to fix the logic in maybe_convert_objects to more carefully check the self.nan_/seen.null_ flags in the case statements.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : None python : 3.8.1.final.0 python-bits : 64 OS : Darwin OS-release : 18.7.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.1
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 45.0.0.post20200113
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

mmccarty · 2020-02-13T13:40:25Z

@ahmadia - Thank you for the details!
@TomAugspurger - Can you advise on the proposed fix or point us to the right maintainer?

TomAugspurger · 2020-02-13T15:04:05Z

Thanks for the report. FYI, you probably want Series([True, False, pd.NA], dtype='boolean') to get our new nullable dtype rather than object dtype.

cc @jbrockmendel if you have thoughts on the proposed change to Seen.numeric_. I'm not personally familiar with it.

jbrockmendel · 2020-02-18T19:21:02Z

@ahmadia just to check, at the end of "code sample 3", are you intentionally re-printing z1, or should that be print(z2)?

if you have thoughts on the proposed change to Seen.numeric_

Based on a cursory glance, it isnt obvious that adding or self.null_ is correct. Wouldn't that incorrectly identify pd.NaT as numeric?

ahmadia · 2020-02-18T19:23:51Z

@jbrockmendel - Indeed, I am intentionally re-printing z1, on my laptop (and I suspect most OS X computers), you'll see two different outputs due to the use of uninitialized memory in the maybe_convert_objects functionality.

Let me take a closer look at how pd.NaT is handled currently.

ahmadia · 2020-02-18T19:29:20Z

If I'm understanding the Seen class definition (I'm looking at your master branch), it has separate flags for nat, nan, and null. This will not lead to a situation where pd.NaT is identified as numeric so long as Seen.nat_ is set.

jbrockmendel · 2020-02-18T19:41:47Z

Great. PR would be welcome.

Dr-Irv · 2020-09-12T16:16:57Z

This seems to be addressed in pandas 1.1.1. On code sample 3, with pandas 1.0.1, I get:

>>> z1 = pd.Series([True, False, pd.NA]).value_counts(dropna=False)
>>> print(z1)
True     1
False    1
True     1
dtype: int64

With pandas 1.1.1, I get:

>>> z1 = pd.Series([True, False, pd.NA]).value_counts(dropna=False)
>>> print(z1)
True     1
False    1
NaN      1
dtype: int64

So the NaN appeared in the value_counts().

A PR that adds a test for this would be welcome.

icanhazcodeplz · 2020-10-01T19:38:09Z

take

ShaharNaveh added this to the Contributions Welcome milestone Mar 14, 2020

Dr-Irv added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Sep 12, 2020

Dr-Irv added Needs Tests Unit test(s) needed to prevent regressions good first issue labels Sep 12, 2020

github-actions bot assigned icanhazcodeplz Oct 1, 2020

icanhazcodeplz mentioned this issue Oct 1, 2020

Test that nan value counts are included when dropna=False. GH#31944 #36783

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.2 Oct 2, 2020

jreback closed this as completed in #36783 Oct 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors in pandas._libs.lib.maybe_convert_objects leading to use of uninitialized memory #31944

Errors in pandas._libs.lib.maybe_convert_objects leading to use of uninitialized memory #31944

ahmadia commented Feb 13, 2020 •

edited

Loading

mmccarty commented Feb 13, 2020

TomAugspurger commented Feb 13, 2020

jbrockmendel commented Feb 18, 2020

ahmadia commented Feb 18, 2020

ahmadia commented Feb 18, 2020

jbrockmendel commented Feb 18, 2020

Dr-Irv commented Sep 12, 2020

icanhazcodeplz commented Oct 1, 2020

Errors in pandas._libs.lib.maybe_convert_objects leading to use of uninitialized memory #31944

Errors in pandas._libs.lib.maybe_convert_objects leading to use of uninitialized memory #31944

Comments

ahmadia commented Feb 13, 2020 • edited Loading

Code Sample 1

Code Sample 2

Code Sample 3

Problem description

Proposed Minimal Fix

Rationale

Output of pd.show_versions()

mmccarty commented Feb 13, 2020

TomAugspurger commented Feb 13, 2020

jbrockmendel commented Feb 18, 2020

ahmadia commented Feb 18, 2020

ahmadia commented Feb 18, 2020

jbrockmendel commented Feb 18, 2020

Dr-Irv commented Sep 12, 2020

icanhazcodeplz commented Oct 1, 2020

ahmadia commented Feb 13, 2020 •

edited

Loading

Output of `pd.show_versions()`