Skip to content

DataFrame.__setitem__ converts Index to RangeIndex for length-zero value #22060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
s-wakaba opened this issue Jul 26, 2018 · 4 comments · Fixed by #41712
Closed

DataFrame.__setitem__ converts Index to RangeIndex for length-zero value #22060

s-wakaba opened this issue Jul 26, 2018 · 4 comments · Fixed by #41712
Labels
Dtype Conversions Unexpected or buggy dtype conversions good first issue Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@s-wakaba
Copy link

Code Sample

#!/usr/bin/env python3
import pandas as pd
from datetime import datetime

a = pd.DataFrame([[datetime.now(), 1234, 3.1415]], columns=['col0', 'col1', 'col2']).iloc[[]]
print(a.shape)
# (0, 3) <- 0 rows, 3 columns
print(a.dtypes)
# col0    datetime64[ns]
# col1             int64
# col2           float64
# dtype: object

b = a.set_index(['col0'])
print(b.reset_index().dtypes)
# col0    datetime64[ns] <- all preserved
# col1             int64
# col2           float64
# dtype: object
b['col3'] = []
print(b.reset_index().dtypes)
# index      int64 <- column name is lost and dtype is changed to int64
# col1       int64
# col2     float64
# col3     float64
# dtype: object

c = a.set_index(['col0', 'col1'])
print(c.reset_index().dtypes)
# col0    float64 <- column names are preserved but dtypes are changed to float64
# col1    float64 <-
# col2    float64
# dtype: object
c['col3'] = []
print(c.reset_index().dtypes)
# col0    float64 <- column names are still preserved
# col1    float64 <-
# col2    float64
# col3    float64
# dtype: object

Problem description

When some operations for DataFrames with zero-rows are executed, various information of their indice are lost. Furthermore types and triggers of lost information are not inconsistent between MultiIndex and normal Index.

In case of DataFrame with non-MultiIndex, both dtype and x.index.name are lost on appending new column by substitution of empty list object.
In case of having MultiIndex, dtypes are lost just on calling x.set_index([x, y,...]). However x.index.names are preserved on appending new column.

Expected Output

In my opinion, there are little bad effect if all dtype(s) and name(s) are preserved on any these example cases. and it's consistent with cases of operation for non-zero-rows DataFrame.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.3.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

There are likely two different cases

The first, b['col3'] = [], converts the DatetimetIndex to a RangeIndex, which shouldn't happen.

In [51]: b = a.set_index(['col0'])

In [52]: b.index
Out[52]: DatetimeIndex([], dtype='datetime64[ns]', name='col0', freq=None)

In [53]: b['col3'] = []

In [54]: b.index
Out[54]: RangeIndex(start=0, stop=0, step=1)

Not sure about the second issue.

@TomAugspurger TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate labels Jul 26, 2018
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Jul 26, 2018
@s-wakaba
Copy link
Author

in 1st case (NOT MultiIndex), type of index is lost in method DataFrame._ensure_valid_index() in file pandas/core/frame.py. This method is used in __setitem__() and insert(), thus DataFrame.insert() also has the same problem.

In [6]: b.shape
Out[6]: (0, 2)

In [7]: b.index
Out[7]: DatetimeIndex([], dtype='datetime64[ns]', name='col0', freq=None)

In [8]: b.insert(2, 'col3', [])

In [9]: b.index
Out[9]: RangeIndex(start=0, stop=0, step=1)

in _ensure_valid_index() method,

# pandas/core/frame.py
class DataFrame(NDFrame):
    ...
    def _ensure_valid_index(self, value):
        """
        ensure that if we don't have an index, that we can create one from the
        passed value
        """
        # GH5632, make sure that we are a Series convertible
        if not len(self.index) and is_list_like(value):
            ...

I'm not sure that it's appropriate that if not len(self.index) means "if we don't have an index"

Furthermore, on substitution non-zero-length list into zero-rows DataFrame as new column, it also behaves strange.

import pandas as pd
from datetime import datetime
a = pd.DataFrame([[datetime.now(), 1234, 3.1415]], columns=['col0', 'col1', 'col2']).iloc[[]]

b = a.set_index(['col0']) # b has 0 rows, 3 columns and non-multi index
b['col3'] = [1,2,3] # <- NO ERROR!
print(b)
#    col1  col2  col3
# 0   NaN   NaN     1
# 1   NaN   NaN     2
# 2   NaN   NaN     3

c = a.set_index(['col0', 'col1']) # c has 0 rows, 2 columns and multi index
c['col3'] = [1,2,3] # <- ERROR!
# ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
print(c)

If DataFrame has non-zero rows, substitution of length-mismatch list as a new column occurs "ValueError: Length of values does not match length of index" in both non-multi index and multi index cases.

@TomAugspurger
Copy link
Contributor

I'm not sure that it's appropriate that if not len(self.index) means "if we don't have an index"

I don't think it is appropriate.

@TomAugspurger TomAugspurger changed the title Lost of index information by operation for zero-rows DataFrame DataFrame.__setitem__ converts Index to RangeIndex for length-zero value Jan 22, 2019
@phofl
Copy link
Member

phofl commented Nov 9, 2020

This seems to work now

Code is returning:

(0, 3)
col0    datetime64[ns]
col1             int64
col2           float64
dtype: object
col0    datetime64[ns]
col1             int64
col2           float64
dtype: object
col0    datetime64[ns]
col1             int64
col2           float64
col3           float64
dtype: object
col0    datetime64[ns]
col1             int64
col2           float64
dtype: object
col0    datetime64[ns]
col1             int64
col2           float64
col3           float64
dtype: object

Process finished with exit code 0

@phofl phofl added Needs Tests Unit test(s) needed to prevent regressions good first issue labels Nov 9, 2020
@mroeschke mroeschke mentioned this issue May 29, 2021
15 tasks
@mroeschke mroeschke modified the milestones: Contributions Welcome, 1.3 May 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions good first issue Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants