Skip to content

str.extract output assignment behavior inconsisent and broken in some cases #25582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Fealthas opened this issue Mar 7, 2019 · 4 comments
Closed
Labels
Duplicate Report Duplicate issue or pull request Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@Fealthas
Copy link

Fealthas commented Mar 7, 2019

import pandas as pd
aval = ['A','B','C']
bval = ['[0-10)','[10-20)','[20-30)']

df = pd.DataFrame({'A':aval, 'B':bval})

#case1
#df.loc[:,'num'] = df['B'].str.extract( r'^\[([0-9]{1,2})' )#fails

#case2
#df['num'] = df['B'].str.extract( r'^\[([0-9]{1,2})' )#works

#case3  
df.loc[:,'num']=''
df.loc[:,'num'] = df['B'].str.extract( r'^\[([0-9]{1,2})' ) #works

print(df)

Problem description

Assigning the extraction of groups to a column fails if using the .loc method. The column is filled with NaN, regardless of if there is a match or not. Pre-filling the column with empty strings 'fixes' it.

This behavior is strange, and I spent a long time trying to figure out what is wrong, as it does not raise an error or anything.

case 1 output

   A        B  num
0  A   [0-10)  NaN
1  B  [10-20)  NaN
2  C  [20-30)  NaN

Expected case 1 Output

   A        B num
0  A   [0-10)   0
1  B  [10-20)  10
2  C  [20-30)  20

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.7.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None

pandas: 0.24.1
pytest: 4.0.2
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.0
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 2.2.3
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml.etree: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd
Copy link
Member

WillAyd commented Mar 7, 2019

So I think the problem here is with how assignment gets aligned across all axes when using .loc as mentioned in the first warning here in the docs:

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#basics

The str.extract call by default is producing a DataFrame with a column label of 0, so since .loc expects alignment on all axes only the following would really work in one pass:

df.loc[:,0] = df['B'].str.extract( r'^\[([0-9]{1,2})' ) 

Or alternately you can return a Series from str.extract which won't have that limitation

df.loc[:, 'num'] = df['B'].str.extract( r'^\[([0-9]{1,2})', expand=False)

I can see where this is confusing just not sure off the top of my head what solution there is, if any. Let's see what others think

@WillAyd WillAyd added Indexing Related to indexing on series/frames, not to indexes themselves Duplicate Report Duplicate issue or pull request labels Mar 7, 2019
@WillAyd
Copy link
Member

WillAyd commented Mar 7, 2019

@Fealthas actually closing this as a duplicate of #10440 - feel free to give that a read as it may clarify some things and comment as necessary therein

@WillAyd WillAyd closed this as completed Mar 7, 2019
@Fealthas
Copy link
Author

Fealthas commented Mar 7, 2019

@WillAyd thanks for looking into it.
I'm not sure that this has anything to do with MultiIndex(my example doesn't use anything but a standard index), but if I don't know enough to disagree 🤷‍♂️ .

@WillAyd
Copy link
Member

WillAyd commented Mar 7, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

2 participants