Description
This is what I have in version 0.9.0:
import pandas as pd
pd.__version__
'0.9.0'
barra_industry_exposures
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 253738 entries, (20061229, '00036110') to (20120928, 'Y8564W10')
Data columns:
MINING_METALS 253738 non-null values
GOLD 253738 non-null values
FORESTRY_PAPER 253738 non-null values
CHEMICAL 253738 non-null values
ENERGY_RESERVES 253738 non-null values
OIL_REFINING 253738 non-null values
OIL_SERVICES 253738 non-null values
FOOD_BEVERAGES 253738 non-null values
ALCOHOL 253738 non-null values
TOBACCO 253738 non-null values
HOME_PRODUCTS 253738 non-null values
GROCERY_STORES 253738 non-null values
CONSUMER_DURABLES 253738 non-null values
MOTOR_VEHICLES 253738 non-null values
APPAREL_TEXTILES 253738 non-null values
CLOTHING_STORES 253738 non-null values
SPECIALTY_RETAIL 253738 non-null values
DEPARTMENT_STORES 253738 non-null values
CONSTRUCTION 253738 non-null values
PUBLISHING 253738 non-null values
MEDIA 253738 non-null values
HOTELS 253738 non-null values
RESTAURANTS 253738 non-null values
ENTERTAINMENT 253738 non-null values
LEISURE 253738 non-null values
ENVIRONMENTAL_SERVICES 253738 non-null values
HEAVY_ELECTRICAL_EQUIPMENT 253738 non-null values
HEAVY_MACHINERY 253738 non-null values
INDUSTRIAL_PARTS 253738 non-null values
ELECTRICAL_UTILITY 253738 non-null values
GAS_WATER_UTILITY 253738 non-null values
RAILROADS 253738 non-null values
AIRLINES 253738 non-null values
FREIGHT 253738 non-null values
MEDICAL_SERVICES 253738 non-null values
MEDICAL_PRODUCTS 253738 non-null values
DRUGS 253738 non-null values
ELECTRONIC_EQUIPMENT 253738 non-null values
SEMICONDUCTORS 253738 non-null values
COMPUTER_HARDWARE 253738 non-null values
COMPUTER_SOFTWARE 253738 non-null values
DEFENCE_AEROSPACE 253738 non-null values
TELEPHONE 253738 non-null values
WIRELESS 253738 non-null values
INFORMATION_SERVICES 253738 non-null values
INDUSTRIAL_SERVICES 253738 non-null values
LIFE_HEALTH_INSURANCE 253738 non-null values
PROPERTY_CASUALTY_INSURANCE 253738 non-null values
BANKS 253738 non-null values
THRIFTS 253738 non-null values
ASSET_MANAGEMENT 253738 non-null values
FINANCIAL_SERVICES 253738 non-null values
INTERNET 253738 non-null values
REITS 253738 non-null values
BIOTECH 253738 non-null values
dtypes: int64(55)
sparse = barra_industry_exposures.to_sparse(fill_value=0)
sparse
<class 'pandas.sparse.frame.SparseDataFrame'>
MultiIndex: 253738 entries, (20061229, '00036110') to (20120928, 'Y8564W10')
Columns: 55 entries, AIRLINES to WIRELESS
dtypes: float64(55)
%timeit sparse / 100.
100 loops, best of 3: 6.64 ms per loop
%timeit sparse.to_dense()
10 loops, best of 3: 127 ms per loop
%timeit sparse.save('test.pkl')
1 loops, best of 3: 16.9 ms per loop
Now this is what I get in 0.9.1:
import pandas as pd
pd.__version__
'0.9.1'
%timeit sparse / 100.
1 loops, best of 3: 92.2 s per loop
%timeit sparse.to_dense()
1 loops, best of 3: 99.8 s per loop
%timeit sparse.save('test.pkl')
1 loops, best of 3: 100 s per loop
So, in the new version SparseDataFrame methods that used to run in less than 7-130 ms now run in more than 90 s. Ouch! What happened?