Skip to content

PERF: reindex default fill_value dtype #47281

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Jun 8, 2022

Default fill_value (np.nan) to match interleaved_dtype when possible to avoid upcasting and allow for block consolidation.

Using the example from the OP:

import pandas as pd
import numpy as np

df = pd.concat([
    pd.DataFrame(np.zeros((1000, 1000), dtype='f4')),
], axis=1).reindex(columns=np.arange(5, 1005))
print(df._data.nblocks) 

df.values
print(df._data.nblocks)  # <- now consolidating

%timeit df.values

main:

2
2
789 µs ± 135 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops 

PR:

2
1
2.01 µs ± 48.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops

@lukemanley lukemanley added Performance Memory or execution speed performance Dtype Conversions Unexpected or buggy dtype conversions labels Jun 8, 2022
@lukemanley
Copy link
Member Author

@jbrockmendel - any suggestions here? The test added in this PR is failing for certain builds. Unfortunately, I'm unable to reproduce the failures locally so I'm not sure where to start.

@jreback jreback added this to the 1.5 milestone Jun 10, 2022
@jbrockmendel
Copy link
Member

@lukemanley do you happen to be using a windows machine or 32bit build locally?

dtype = interleaved_dtype([blk.dtype for blk in self.blocks])
if is_float_dtype(dtype):
# GH45857 avoid unnecessary upcasting
dtype = cast(np.dtype, dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a risk of getting e.g. Float64Dtype here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a risk in a DataFrame of only float32 values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, im talking about a case where the DataFrame isnt all-float32, but includes at least one Float64Dtype

@jbrockmendel
Copy link
Member

There's a comment in NDFrame._reindex_with_indexers # TODO: speed up on homogeneous DataFrame objects (see _reindex_multi) that i think might be useful in these cases. Could that be helpful?

@lukemanley
Copy link
Member Author

do you happen to be using a windows machine or 32bit build locally?

Nope. I'm unable to reproduce those errors on mac or 64bit windows.

There's a comment in NDFrame._reindex_with_indexers # TODO: speed up on homogeneous DataFrame objects (see _reindex_multi) that i think might be useful in these cases. Could that be helpful?

I'll take a look, thanks.

@lukemanley
Copy link
Member Author

I’m going to close this PR. I cannot reproduce the failures locally. I tried a few different things but was unsuccessful in identifying where the float64 dtype comes from. I do not see that behavior locally.

@lukemanley lukemanley closed this Jul 2, 2022
@lukemanley lukemanley deleted the GH45857 branch September 10, 2022 00:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: reindex unnecessarily introduces block with new dtype, preventing consolidation
3 participants