PERF: reindex default fill_value dtype #47281

lukemanley · 2022-06-08T03:34:12Z

closes PERF: reindex unnecessarily introduces block with new dtype, preventing consolidation #45857
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.

Default fill_value (np.nan) to match interleaved_dtype when possible to avoid upcasting and allow for block consolidation.

Using the example from the OP:

import pandas as pd
import numpy as np

df = pd.concat([
    pd.DataFrame(np.zeros((1000, 1000), dtype='f4')),
], axis=1).reindex(columns=np.arange(5, 1005))
print(df._data.nblocks) 

df.values
print(df._data.nblocks)  # <- now consolidating

%timeit df.values

main:

2
2
789 µs ± 135 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops

PR:

2
1
2.01 µs ± 48.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops

lukemanley · 2022-06-09T02:02:38Z

@jbrockmendel - any suggestions here? The test added in this PR is failing for certain builds. Unfortunately, I'm unable to reproduce the failures locally so I'm not sure where to start.

jbrockmendel · 2022-06-17T21:55:42Z

@lukemanley do you happen to be using a windows machine or 32bit build locally?

jbrockmendel · 2022-06-17T21:56:19Z

pandas/core/internals/managers.py

+            dtype = interleaved_dtype([blk.dtype for blk in self.blocks])
+            if is_float_dtype(dtype):
+                # GH45857 avoid unnecessary upcasting
+                dtype = cast(np.dtype, dtype)


is there a risk of getting e.g. Float64Dtype here?

I don't think there is a risk in a DataFrame of only float32 values.

right, im talking about a case where the DataFrame isnt all-float32, but includes at least one Float64Dtype

jbrockmendel · 2022-06-17T21:59:46Z

There's a comment in NDFrame._reindex_with_indexers # TODO: speed up on homogeneous DataFrame objects (see _reindex_multi) that i think might be useful in these cases. Could that be helpful?

lukemanley · 2022-06-18T03:35:57Z

do you happen to be using a windows machine or 32bit build locally?

Nope. I'm unable to reproduce those errors on mac or 64bit windows.

There's a comment in NDFrame._reindex_with_indexers # TODO: speed up on homogeneous DataFrame objects (see _reindex_multi) that i think might be useful in these cases. Could that be helpful?

I'll take a look, thanks.

lukemanley · 2022-07-02T20:08:42Z

I’m going to close this PR. I cannot reproduce the failures locally. I tried a few different things but was unsuccessful in identifying where the float64 dtype comes from. I do not see that behavior locally.

reindex default fill_value dtype

9667a59

lukemanley added Performance Memory or execution speed performance Dtype Conversions Unexpected or buggy dtype conversions labels Jun 8, 2022

mypy

95093d6

jreback added this to the 1.5 milestone Jun 10, 2022

lukemanley added 2 commits June 10, 2022 20:57

Merge remote-tracking branch 'upstream/main' into GH45857

20c98c1

Merge remote-tracking branch 'upstream/main' into GH45857

27ab683

jbrockmendel reviewed Jun 17, 2022

View reviewed changes

lukemanley added 5 commits July 2, 2022 08:58

Merge remote-tracking branch 'upstream/main' into GH45857

8ab7209

infer dtype logic

c596952

one last try... infer_dtype_from_scalar for floats

a196212

expand test

109eb2d

try np.issubdtype

7e969b1

lukemanley closed this Jul 2, 2022

lukemanley deleted the GH45857 branch September 10, 2022 00:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: reindex default fill_value dtype #47281

PERF: reindex default fill_value dtype #47281

lukemanley commented Jun 8, 2022 •

edited

Loading

lukemanley commented Jun 9, 2022

jbrockmendel commented Jun 17, 2022

jbrockmendel Jun 17, 2022

lukemanley Jun 18, 2022

jbrockmendel Jun 18, 2022

jbrockmendel commented Jun 17, 2022

lukemanley commented Jun 18, 2022

lukemanley commented Jul 2, 2022

PERF: reindex default fill_value dtype #47281

PERF: reindex default fill_value dtype #47281

Conversation

lukemanley commented Jun 8, 2022 • edited Loading

lukemanley commented Jun 9, 2022

jbrockmendel commented Jun 17, 2022

jbrockmendel Jun 17, 2022

Choose a reason for hiding this comment

lukemanley Jun 18, 2022

Choose a reason for hiding this comment

jbrockmendel Jun 18, 2022

Choose a reason for hiding this comment

jbrockmendel commented Jun 17, 2022

lukemanley commented Jun 18, 2022

lukemanley commented Jul 2, 2022

lukemanley commented Jun 8, 2022 •

edited

Loading