reset_index() on MultiIndexed empty dataframe does not preserve dtypes #19602

alberto-dellera · 2018-02-08T19:21:52Z

Code Sample, a copy-pastable example if possible

df = pd.DataFrame( data=[[0,0,0]], columns=['level_1','level_2','payload'] )

# make dataframe empty
df = df[ df.payload == -1 ]

# columns are all int64 here

print(df.info())
#output: level_1    0 non-null int64
#output: level_2    0 non-null int64
#output: payload    0 non-null int64

# set MultiIndex - levels are still int64 
df = df.set_index(['level_1','level_2'])

print(str(df.index.levels[0].dtype))
print(str(df.index.levels[1].dtype))
#output: int64
#output: int64

# reset_index - former-levels columns are now float64
df = df.reset_index()

print(df.info())
#output: level_1    0 non-null float64
#output: level_2    0 non-null float64
#output: payload    0 non-null int64

Problem description

The dtypes are preserved instead if either
a) index is not a MultiIndex
b) dataframe is not empty

(b) is a big issue for programs that calculate subset of dataframes that sometimes
can be empty, since downstream code might expect a certain dtype and fail when it finds
a float64 instead.

Real-world scenario: sampling a system (a collection of processes or threads) at regular intervals,
and collecting some measures (cpu used, or other resources or figures); a very common strategy
in performance investigation software (e.g. check Oracle's v$active_session_history).
Here, the natural index is (sample_time, process_id), sample_time being datetime64 (a Time Series).
Even more naturally, we want to computes differences of sample_time, yielding a timedelta64,
and divide it by np.timedelta64(1,'s') to get the elapsed time in seconds; but when the
initial dataframe is empty, we try to divide float64 / np.timedelta64(1,'s') and get an exception.

An obvious workaround is to check for empty dataframes after EVERY reset_index()
and coerce the float64s back to their correct value - but that easily becomes a maintenance/coverage nightmare :O

Expected Output

resetted columns having their initial dtype

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: None
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.1
openpyxl: 2.4.9
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-02-09T17:14:42Z

Yeah, I could see these preserving the index dtypes. Interested in making a PR?

alberto-dellera · 2018-02-09T17:47:07Z

@TomAugspurger : unfortunately I am not skilled enough for a PR yet, sorry :(

TomAugspurger · 2018-02-09T19:18:15Z

I just came across this code for an unrelated PR. The issue may be at

pandas/pandas/core/frame.py

Lines 3387 to 3389 in a214915

    
           if mask.all(): 
        
               values = np.empty(len(mask)) 
        
               values.fill(np.nan)

mask is empty, but mask.all() is true, so we do np.empty(), which is float.

If we check for mask.all() and len(mask) on that line, things may be fixed. Haven't tried yet.

allComputableThings · 2018-02-26T22:16:34Z

Is resolved with something like: ?

 if mask.all(): 
     values = np.empty(len(mask), dtype=index.dtype) 
     values.fill(np.nan)

TomAugspurger · 2018-02-27T15:27:09Z

Not sure. Could you try it out and see?

…

On Mon, Feb 26, 2018 at 4:16 PM, Stuart Reynolds ***@***.***> wrote: Is resolved with something like: ? if mask.all(): values = np.empty(len(mask), dtype=index.dtype) values.fill(np.nan) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19602 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIsdvE5KL5wbiz9o6YzNGwv8qURQEks5tYy1IgaJpZM4R-4zy> .

vaibhawc · 2018-10-26T21:24:40Z

Did anything happen in this direction?
Recently I came across such a problem.

TomAugspurger · 2018-10-29T20:28:25Z

Still open. @vaibhawc can you investigate the issue and make a PR?

vaibhawc · 2018-10-30T08:03:24Z

My issue was resolved by what @stuz5000 suggested. Though I didn't quite understand how it happened.
:| Should I make a PR?

TomAugspurger · 2018-10-31T19:13:03Z

Sure, give that a shot.

vaibhawc · 2018-11-01T12:36:35Z

Ok, could you please help me with base and compare branch?

TomAugspurger · 2018-11-01T22:43:08Z

Contributing guidelines are at http://pandas-docs.github.io/pandas-docs-travis/contributing.html. Post if you have any issues.

agralak-queueco · 2019-09-06T12:17:15Z

btw, as a quick workaround, you can copy the previous dtypes and assign them back after grouping:

    old_dtypes = dict(data.dtypes)
    filtered = #insert your filter here
    filtered = filtered.astype({**{key: value for key, value in old_dtypes.items() if key in filtered.columns}, 
        **{'new_columns': np.int16}})

TomAugspurger added Reshaping Concat, Merge/Join, Stack/Unstack, Explode MultiIndex Effort Medium labels Feb 9, 2018

TomAugspurger mentioned this issue Nov 21, 2018

Empty dataframe with a multi-index loses the type information about the index types #23842

Closed

vaibhawc mentioned this issue Nov 22, 2018

Preserve dtype on reset_index() in an empty DataFrame #23856

Closed

ldmnt mentioned this issue Jul 12, 2019

BUG: Wrong dtype when resetting a multiindex with missing values. (#1… #27370

Closed

6 tasks

TomAugspurger mentioned this issue Aug 14, 2019

Empty series dtype discarded by reset_index with MultiIndex #27913

Closed

jbrockmendel removed Effort Medium labels Oct 21, 2019

Dr-Irv mentioned this issue Jan 7, 2020

DataFrame.set_index() may not preserve dtype #30517

Open

TomAugspurger mentioned this issue Apr 21, 2020

BUG: Incorrect dtypes after resetting a multi-index for a data frame with no rows #33698

Closed

3 tasks

rhshadrach mentioned this issue Jun 22, 2020

BUG: reset_index doesn't preserve dtype on empty frame with MultiIndex #34942

Merged

5 tasks

jreback added this to the 1.1 milestone Jun 23, 2020

jreback closed this as completed in #34942 Jun 23, 2020

itholic mentioned this issue Jul 31, 2020

Matching behavior with pandas 1.1.2 databricks/koalas#1688

Merged

11 tasks

TomAugspurger mentioned this issue Aug 6, 2020

BUG: DataFrame.reset_index() discards MultiIndex dtypes if DataFrame is empty #35589

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reset_index() on MultiIndexed empty dataframe does not preserve dtypes #19602

reset_index() on MultiIndexed empty dataframe does not preserve dtypes #19602

alberto-dellera commented Feb 8, 2018

INSTALLED VERSIONS

TomAugspurger commented Feb 9, 2018

alberto-dellera commented Feb 9, 2018

TomAugspurger commented Feb 9, 2018

allComputableThings commented Feb 26, 2018

TomAugspurger commented Feb 27, 2018 via email

vaibhawc commented Oct 26, 2018

TomAugspurger commented Oct 29, 2018

vaibhawc commented Oct 30, 2018

TomAugspurger commented Oct 31, 2018

vaibhawc commented Nov 1, 2018

TomAugspurger commented Nov 1, 2018

agralak-queueco commented Sep 6, 2019

reset_index() on MultiIndexed empty dataframe does not preserve dtypes #19602

reset_index() on MultiIndexed empty dataframe does not preserve dtypes #19602

Comments

alberto-dellera commented Feb 8, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Feb 9, 2018

alberto-dellera commented Feb 9, 2018

TomAugspurger commented Feb 9, 2018

allComputableThings commented Feb 26, 2018

TomAugspurger commented Feb 27, 2018 via email

vaibhawc commented Oct 26, 2018

TomAugspurger commented Oct 29, 2018

vaibhawc commented Oct 30, 2018

TomAugspurger commented Oct 31, 2018

vaibhawc commented Nov 1, 2018

TomAugspurger commented Nov 1, 2018

agralak-queueco commented Sep 6, 2019

Output of `pd.show_versions()`