Skip to content

PERF: reindex unnecessarily introduces block with new dtype, preventing consolidation #45857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
batterseapower opened this issue Feb 7, 2022 · 1 comment · Fixed by #57272
Closed
2 of 3 tasks
Labels
Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@batterseapower
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

I frequently wrap large arrays with a DataFrame, do some operations on them, and then extract the .values as a numpy array for e.g. passing to a nopython numba function or just for fast indexing (.values[i, j] is often much faster than .iloc[i, j]).

I recently got bit by an issue where Pandas unexpectedly added a new block with a different dtype to my wrapped array. This caused .values to slow down by a factor of 1000x because it has to build a new array every time it is called rather than simply returning the single existing block.

Specifically, if you have a single float32 block and you reindex it, Pandas adds a float64 block to the BlockManager to hold the NaNs. This seems like potentially a bad default -- in the case where there is an exisiting block the type of that existing block so long as it's possible to store the fill_value in such a block.

Example with consolidation not working:

df = pd.concat([
    pd.DataFrame(np.zeros((1000, 1000), dtype='f4')),
], axis=1).reindex(columns=np.arange(5, 1005))
print(df._data.nblocks) # 2

df.values
print(df._data.nblocks) # 2

%timeit df.values # 2.74ms

Example with consolidation working:

df = pd.concat([
    pd.DataFrame(np.zeros((1000, 1000), dtype='f4')),
], axis=1).reindex(columns=np.arange(5, 1005), fill_value=np.float32(np.nan))
print(df._data.nblocks) # 2

df.values
print(df._data.nblocks) # 1

%timeit df.values # 3.38 µs

Installed Versions

INSTALLED VERSIONS ------------------ commit : 945c9ed python : 3.8.12.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19042 machine : AMD64 processor : Intel64 Family 6 Model 85 Stepping 7, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United Kingdom.1252

pandas : 1.3.4
numpy : 1.21.3
pytz : 2021.3
dateutil : 2.8.2
pip : 21.1.1
setuptools : 58.0.4
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.2
IPython : 7.29.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.2
fsspec : 2021.08.1
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : 0.19.0
xlrd : 2.0.1
xlwt : None
numba : 0.53.1

Prior Performance

I don't think this is a regression

@batterseapower batterseapower added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Feb 7, 2022
@jbrockmendel
Copy link
Member

This seems like potentially a bad default -- in the case where there is an exisiting block the type of that existing block so long as it's possible to store the fill_value in such a block.

Agreed. PR would be welcome.

@mroeschke mroeschke added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 7, 2022
@jreback jreback added this to the 1.5 milestone Jun 10, 2022
@mroeschke mroeschke removed this from the 1.5 milestone Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
4 participants