PERF: dropna
with SparseArray
experiments a much worse time complexity
#60179
Labels
dropna
with SparseArray
experiments a much worse time complexity
#60179
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Calling
df.dropna()
on a dataframe withpd.arrays.SparseArray
has a huge penalty that increases exponentially with the size of the dataframe.Given the following script (
pytest
,pandas
andpytest-benchmark
are needed. It can be run withpytest <script>
):It can be seen that, for only 1000 samples, the sparse array is 120 times slower than the standard approach. This feature gets unusable for medium-large datasets with only >1_000 samples or 10_000 samples
Installed Versions
INSTALLED VERSIONS
commit : 0691c5c
python : 3.10.11
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 151 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United Kingdom.1252
pandas : 2.2.3
numpy : 1.23.5
pytz : 2024.1
dateutil : 2.9.0.post0
pip : 23.0.1
Cython : 0.29.37
sphinx : None
IPython : 8.25.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.4
lxml.etree : 5.2.2
matplotlib : None
numba : 0.60.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : None
pyreadstat : None
pytest : 8.2.2
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.14.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
Prior Performance
No response
The text was updated successfully, but these errors were encountered: