Skip to content

Significant performance degradation in 0.9.1 for SparseDataFrame methods like to_dense() and save() and for arithmetic operations #2273

Closed
@bluefir

Description

@bluefir

This is what I have in version 0.9.0:

import pandas as pd
pd.__version__

'0.9.0'

barra_industry_exposures

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 253738 entries, (20061229, '00036110') to (20120928, 'Y8564W10')
Data columns:
MINING_METALS 253738 non-null values
GOLD 253738 non-null values
FORESTRY_PAPER 253738 non-null values
CHEMICAL 253738 non-null values
ENERGY_RESERVES 253738 non-null values
OIL_REFINING 253738 non-null values
OIL_SERVICES 253738 non-null values
FOOD_BEVERAGES 253738 non-null values
ALCOHOL 253738 non-null values
TOBACCO 253738 non-null values
HOME_PRODUCTS 253738 non-null values
GROCERY_STORES 253738 non-null values
CONSUMER_DURABLES 253738 non-null values
MOTOR_VEHICLES 253738 non-null values
APPAREL_TEXTILES 253738 non-null values
CLOTHING_STORES 253738 non-null values
SPECIALTY_RETAIL 253738 non-null values
DEPARTMENT_STORES 253738 non-null values
CONSTRUCTION 253738 non-null values
PUBLISHING 253738 non-null values
MEDIA 253738 non-null values
HOTELS 253738 non-null values
RESTAURANTS 253738 non-null values
ENTERTAINMENT 253738 non-null values
LEISURE 253738 non-null values
ENVIRONMENTAL_SERVICES 253738 non-null values
HEAVY_ELECTRICAL_EQUIPMENT 253738 non-null values
HEAVY_MACHINERY 253738 non-null values
INDUSTRIAL_PARTS 253738 non-null values
ELECTRICAL_UTILITY 253738 non-null values
GAS_WATER_UTILITY 253738 non-null values
RAILROADS 253738 non-null values
AIRLINES 253738 non-null values
FREIGHT 253738 non-null values
MEDICAL_SERVICES 253738 non-null values
MEDICAL_PRODUCTS 253738 non-null values
DRUGS 253738 non-null values
ELECTRONIC_EQUIPMENT 253738 non-null values
SEMICONDUCTORS 253738 non-null values
COMPUTER_HARDWARE 253738 non-null values
COMPUTER_SOFTWARE 253738 non-null values
DEFENCE_AEROSPACE 253738 non-null values
TELEPHONE 253738 non-null values
WIRELESS 253738 non-null values
INFORMATION_SERVICES 253738 non-null values
INDUSTRIAL_SERVICES 253738 non-null values
LIFE_HEALTH_INSURANCE 253738 non-null values
PROPERTY_CASUALTY_INSURANCE 253738 non-null values
BANKS 253738 non-null values
THRIFTS 253738 non-null values
ASSET_MANAGEMENT 253738 non-null values
FINANCIAL_SERVICES 253738 non-null values
INTERNET 253738 non-null values
REITS 253738 non-null values
BIOTECH 253738 non-null values
dtypes: int64(55)

sparse = barra_industry_exposures.to_sparse(fill_value=0)
sparse

<class 'pandas.sparse.frame.SparseDataFrame'>
MultiIndex: 253738 entries, (20061229, '00036110') to (20120928, 'Y8564W10')
Columns: 55 entries, AIRLINES to WIRELESS
dtypes: float64(55)

%timeit sparse / 100.

100 loops, best of 3: 6.64 ms per loop

%timeit sparse.to_dense()

10 loops, best of 3: 127 ms per loop

%timeit sparse.save('test.pkl')

1 loops, best of 3: 16.9 ms per loop

Now this is what I get in 0.9.1:

import pandas as pd
pd.__version__

'0.9.1'

%timeit sparse / 100.

1 loops, best of 3: 92.2 s per loop

%timeit sparse.to_dense()

1 loops, best of 3: 99.8 s per loop

%timeit sparse.save('test.pkl')

1 loops, best of 3: 100 s per loop

So, in the new version SparseDataFrame methods that used to run in less than 7-130 ms now run in more than 90 s. Ouch! What happened?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions