-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fatal error with astype if duplicate columns are supplied for categorical #24704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It also seem to be the case when mapping from int64 too (so not specific to object): import pandas as pd
df = pd.DataFrame({'a': ['1',1,3], 'b' : [1,2,3]})
print(df.dtypes)
categoricals = list(df.select_dtypes(include='int64').columns.values)
categoricals =categoricals + ['b']
df[categoricals] = df[categoricals].astype('category')
print(df.dtypes)
|
But not when the astype is 'int64': import pandas as pd
df = pd.DataFrame({'a': ['1',1,3], 'b' : [1,2,3]})
print(df.dtypes)
categoricals = list(df.select_dtypes(include='object').columns.values)
categoricals =categoricals + ['a']
df[categoricals] = df[categoricals].astype('int64')
print(df.dtypes)
So it is probably related to category. |
you are trying to set with a duplicate
|
It looks like there are actually two issues here: 1. In [2]: df = pd.DataFrame([[1, 2], [1, 1], [3, 2]], columns=['a', 'a'])
In [3]: df
Out[3]:
a a
0 1 2
1 1 1
2 3 2
In [4]: df.astype('category')
---------------------------------------------------------------------------
RecursionError: maximum recursion depth exceeded
In [5]: df.astype('Int64')
---------------------------------------------------------------------------
RecursionError: maximum recursion depth exceeded This works for other dtypes when duplicate columns are present, and the fix looks easy, so we could probably support it. 2. Setting to a Even if item 1 was fixed, the setting process would result in an object dtype instead of a categorical/extension dtype: In [6]: df = pd.DataFrame({'a': [1, 1, 2], 'b' :['foo', 'bar', 'baz']})
In [7]: df
Out[7]:
a b
0 1 foo
1 1 bar
2 2 baz
In [8]: df.dtypes
Out[8]:
a int64
b object
dtype: object
In [9]: df_aa = pd.concat([pd.Series([10, 20, 30], name='a', dtype='category'),
...: pd.Series([11, 22, 33], name='a', dtype='category')], axis=1)
In [10]: df_aa
Out[10]:
a a
0 10 11
1 20 22
2 30 33
In [11]: df_aa.dtypes
Out[11]:
a category
a category
dtype: object
In [12]: df['a'] = df_aa
In [13]: df
Out[13]:
a b
0 10 foo
1 20 bar
2 30 baz
In [14]: df.dtypes
Out[14]:
a object
b object
dtype: object
In [15]: df[['a', 'a']] = df_aa
In [16]: df
Out[16]:
a b
0 10 foo
1 20 bar
2 30 baz
In [17]: df.dtypes
Out[17]:
a object
b object
dtype: object I'm not sure that this should be supported. The operation doesn't really make sense to me, and I'm a little bit surprised that it didn't raise. @jreback : what are your thoughts on items 1 and 2? |
yeah 1) is ok, 2) is somewhat tricky and prob ok to not support right away. |
I am not able to reproduce this issue with the code posted as the faulty example.
|
Since this looks fixed on main, closing |
Code Sample
Running the following code for changing type to category runs perfectly
which returns
If an extra extra column is faulty added ('a' is added again):
Python crashes with
Problem description
One would expect pandas to raise an error that there is duplicate columns or remove duplicate instead of crashing.
I'm using Python 3.6.1 and pandas-0.23.4.
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-1011-gcp
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: None
pip: 9.0.1
setuptools: 40.6.2
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: