DataFrame.setitem converts Index to RangeIndex for length-zero value #22060

s-wakaba · 2018-07-26T06:31:49Z

Code Sample

#!/usr/bin/env python3
import pandas as pd
from datetime import datetime

a = pd.DataFrame([[datetime.now(), 1234, 3.1415]], columns=['col0', 'col1', 'col2']).iloc[[]]
print(a.shape)
# (0, 3) <- 0 rows, 3 columns
print(a.dtypes)
# col0    datetime64[ns]
# col1             int64
# col2           float64
# dtype: object

b = a.set_index(['col0'])
print(b.reset_index().dtypes)
# col0    datetime64[ns] <- all preserved
# col1             int64
# col2           float64
# dtype: object
b['col3'] = []
print(b.reset_index().dtypes)
# index      int64 <- column name is lost and dtype is changed to int64
# col1       int64
# col2     float64
# col3     float64
# dtype: object

c = a.set_index(['col0', 'col1'])
print(c.reset_index().dtypes)
# col0    float64 <- column names are preserved but dtypes are changed to float64
# col1    float64 <-
# col2    float64
# dtype: object
c['col3'] = []
print(c.reset_index().dtypes)
# col0    float64 <- column names are still preserved
# col1    float64 <-
# col2    float64
# col3    float64
# dtype: object

Problem description

When some operations for DataFrames with zero-rows are executed, various information of their indice are lost. Furthermore types and triggers of lost information are not inconsistent between MultiIndex and normal Index.

In case of DataFrame with non-MultiIndex, both dtype and x.index.name are lost on appending new column by substitution of empty list object.
In case of having MultiIndex, dtypes are lost just on calling x.set_index([x, y,...]). However x.index.names are preserved on appending new column.

Expected Output

In my opinion, there are little bad effect if all dtype(s) and name(s) are preserved on any these example cases. and it's consistent with cases of operation for non-zero-rows DataFrame.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.3.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-07-26T12:25:02Z

There are likely two different cases

The first, b['col3'] = [], converts the DatetimetIndex to a RangeIndex, which shouldn't happen.

In [51]: b = a.set_index(['col0'])

In [52]: b.index
Out[52]: DatetimeIndex([], dtype='datetime64[ns]', name='col0', freq=None)

In [53]: b['col3'] = []

In [54]: b.index
Out[54]: RangeIndex(start=0, stop=0, step=1)

Not sure about the second issue.

s-wakaba · 2018-07-30T06:54:19Z

in 1st case (NOT MultiIndex), type of index is lost in method DataFrame._ensure_valid_index() in file pandas/core/frame.py. This method is used in __setitem__() and insert(), thus DataFrame.insert() also has the same problem.

In [6]: b.shape
Out[6]: (0, 2)

In [7]: b.index
Out[7]: DatetimeIndex([], dtype='datetime64[ns]', name='col0', freq=None)

In [8]: b.insert(2, 'col3', [])

In [9]: b.index
Out[9]: RangeIndex(start=0, stop=0, step=1)

in _ensure_valid_index() method,

# pandas/core/frame.py
class DataFrame(NDFrame):
    ...
    def _ensure_valid_index(self, value):
        """
        ensure that if we don't have an index, that we can create one from the
        passed value
        """
        # GH5632, make sure that we are a Series convertible
        if not len(self.index) and is_list_like(value):
            ...

I'm not sure that it's appropriate that if not len(self.index) means "if we don't have an index"

Furthermore, on substitution non-zero-length list into zero-rows DataFrame as new column, it also behaves strange.

import pandas as pd
from datetime import datetime
a = pd.DataFrame([[datetime.now(), 1234, 3.1415]], columns=['col0', 'col1', 'col2']).iloc[[]]

b = a.set_index(['col0']) # b has 0 rows, 3 columns and non-multi index
b['col3'] = [1,2,3] # <- NO ERROR!
print(b)
#    col1  col2  col3
# 0   NaN   NaN     1
# 1   NaN   NaN     2
# 2   NaN   NaN     3

c = a.set_index(['col0', 'col1']) # c has 0 rows, 2 columns and multi index
c['col3'] = [1,2,3] # <- ERROR!
# ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
print(c)

If DataFrame has non-zero rows, substitution of length-mismatch list as a new column occurs "ValueError: Length of values does not match length of index" in both non-multi index and multi index cases.

TomAugspurger · 2018-07-30T13:06:04Z

I'm not sure that it's appropriate that if not len(self.index) means "if we don't have an index"

I don't think it is appropriate.

phofl · 2020-11-09T22:52:12Z

This seems to work now

Code is returning:

(0, 3)
col0    datetime64[ns]
col1             int64
col2           float64
dtype: object
col0    datetime64[ns]
col1             int64
col2           float64
dtype: object
col0    datetime64[ns]
col1             int64
col2           float64
col3           float64
dtype: object
col0    datetime64[ns]
col1             int64
col2           float64
dtype: object
col0    datetime64[ns]
col1             int64
col2           float64
col3           float64
dtype: object

Process finished with exit code 0

TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate labels Jul 26, 2018

TomAugspurger added this to the Contributions Welcome milestone Jul 26, 2018

p-himik mentioned this issue Jan 22, 2019

Adding a column via assign or __setitem__ on an empty DataFrame changes index type, removes index name #24878

Closed

TomAugspurger changed the title ~~Lost of index information by operation for zero-rows DataFrame~~ DataFrame.__setitem__ converts Index to RangeIndex for length-zero value Jan 22, 2019

jbrockmendel removed Effort Medium labels Oct 21, 2019

phofl added Needs Tests Unit test(s) needed to prevent regressions good first issue labels Nov 9, 2020

mroeschke mentioned this issue May 29, 2021

TST: More old issues #41712

Merged

15 tasks

mroeschke modified the milestones: Contributions Welcome, 1.3 May 29, 2021

jreback closed this as completed in #41712 May 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.setitem converts Index to RangeIndex for length-zero value #22060

DataFrame.setitem converts Index to RangeIndex for length-zero value #22060

s-wakaba commented Jul 26, 2018

INSTALLED VERSIONS

TomAugspurger commented Jul 26, 2018

s-wakaba commented Jul 30, 2018

TomAugspurger commented Jul 30, 2018

phofl commented Nov 9, 2020

DataFrame.__setitem__ converts Index to RangeIndex for length-zero value #22060

DataFrame.__setitem__ converts Index to RangeIndex for length-zero value #22060

Comments

s-wakaba commented Jul 26, 2018

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Jul 26, 2018

s-wakaba commented Jul 30, 2018

TomAugspurger commented Jul 30, 2018

phofl commented Nov 9, 2020

DataFrame.setitem converts Index to RangeIndex for length-zero value #22060

DataFrame.setitem converts Index to RangeIndex for length-zero value #22060

Output of `pd.show_versions()`