str accessor functions returns float(NaN) instead of pd.NA #30966

tsvikas · 2020-01-13T14:28:56Z

Code Sample

import pandas as pd
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'], dtype="string")
type(s.str.lower()[5])  # returns <class 'float'>

Problem description

the str accessor, when working on string-typed series, should return a string-typed series, which should be an array of [string, pd.NA] only, but it seems that some functions (see list below) can return series that contains float('nan').

Affected functions

As of now, I found these str accessor functions to be affected:
upper lower replace
also, extract(expand=False) on a string type series returns an object type series, which seems unintended as well.

Expected Output

<class 'pandas._libs.missing.NAType'>

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.8.0.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.0.0-38-generic
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_IL
LOCALE           : en_IL.UTF-8

pandas           : 1.0.0rc0
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 18.1
setuptools       : 40.8.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
numba            : None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-01-13T14:48:22Z

Thanks for the report. Seems to be with the StringArray constructor when passed an ndarray,

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: values = np.array(['a', np.nan], dtype=object)

In [6]: pd.arrays.StringArray(values)
Out[6]:
<StringArray>
['a', nan]
Length: 2, dtype: string

TomAugspurger · 2020-01-13T14:53:36Z

I don't recall if we expect / document that the values passed to StringArray should be strings / NA (not NaN). When we go through pd.array, things work correctly since StringArray._from_sequence does the NaN -> NA substitution. My preference is for

StringArray(np.array(['a', np.nan], dtype=object)) to raise, since the user has passed NaN insted of NA
update core/strings.py to use NA in the correct places or go through _from_sequence.

tsvikas · 2020-01-13T14:54:14Z

Thanks for the reply! Any way to help to solve the issue?

On a similar issue, str.extract(expand=False) on a string typed series returns an object typed dataframe, which seems to be unintended as well. (to compare, str.extract(expand=True) on a string typed series returns an string typed dataframe)

TomAugspurger · 2020-01-13T15:37:44Z

To have StringArray(np.array(['a', np.nan], dtype=object)), we'll need to update _libs.lib.is_string_array. Other places use it, so we'll need an additional flag like na_only to exclude NaN from the NA check.

TomAugspurger · 2020-01-13T15:39:17Z

The expand=False does look buggy. I thought there were cases where we returned a tuple of matches, but apparenlty not. Can you open an issue just for that?

Closes pandas-dev#30966

TomAugspurger added this to the 1.0.0 milestone Jan 13, 2020

TomAugspurger added Strings String extension data type and string data ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jan 13, 2020

tsvikas changed the title ~~str accesor functions returns float(NaN) instead of pd.NA~~ str accessor functions returns float(NaN) instead of pd.NA Jan 13, 2020

tsvikas mentioned this issue Jan 13, 2020

str.extract(expand=False) on a string typed series returns an object typed dataframe #30969

Closed

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 13, 2020

Disallow NaN in StringArray constructor

b162d1e

Closes pandas-dev#30966

TomAugspurger mentioned this issue Jan 13, 2020

API: Disallow NaN in StringArray constructor #30980

Merged

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 13, 2020

Disallow NaN in StringArray constructor

6f3e367

Closes pandas-dev#30966

jreback closed this as completed in #30980 Jan 14, 2020

lithomas1 mentioned this issue Dec 29, 2021

PERF: avoid copies in lib.infer_dtype #45057

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

str accessor functions returns float(NaN) instead of pd.NA #30966

str accessor functions returns float(NaN) instead of pd.NA #30966

tsvikas commented Jan 13, 2020 •

edited

Loading

TomAugspurger commented Jan 13, 2020

TomAugspurger commented Jan 13, 2020

tsvikas commented Jan 13, 2020

TomAugspurger commented Jan 13, 2020

TomAugspurger commented Jan 13, 2020

str accessor functions returns float(NaN) instead of pd.NA #30966

str accessor functions returns float(NaN) instead of pd.NA #30966

Comments

tsvikas commented Jan 13, 2020 • edited Loading

Code Sample

Problem description

Affected functions

Expected Output

Output of pd.show_versions()

TomAugspurger commented Jan 13, 2020

TomAugspurger commented Jan 13, 2020

tsvikas commented Jan 13, 2020

TomAugspurger commented Jan 13, 2020

TomAugspurger commented Jan 13, 2020

tsvikas commented Jan 13, 2020 •

edited

Loading

Output of `pd.show_versions()`