We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
There was an error while loading. Please reload this page.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
This is what I have in version 0.9.0:
import pandas as pd pd.__version__
'0.9.0'
barra_industry_exposures
<class 'pandas.core.frame.DataFrame'> MultiIndex: 253738 entries, (20061229, '00036110') to (20120928, 'Y8564W10') Data columns: MINING_METALS 253738 non-null values GOLD 253738 non-null values FORESTRY_PAPER 253738 non-null values CHEMICAL 253738 non-null values ENERGY_RESERVES 253738 non-null values OIL_REFINING 253738 non-null values OIL_SERVICES 253738 non-null values FOOD_BEVERAGES 253738 non-null values ALCOHOL 253738 non-null values TOBACCO 253738 non-null values HOME_PRODUCTS 253738 non-null values GROCERY_STORES 253738 non-null values CONSUMER_DURABLES 253738 non-null values MOTOR_VEHICLES 253738 non-null values APPAREL_TEXTILES 253738 non-null values CLOTHING_STORES 253738 non-null values SPECIALTY_RETAIL 253738 non-null values DEPARTMENT_STORES 253738 non-null values CONSTRUCTION 253738 non-null values PUBLISHING 253738 non-null values MEDIA 253738 non-null values HOTELS 253738 non-null values RESTAURANTS 253738 non-null values ENTERTAINMENT 253738 non-null values LEISURE 253738 non-null values ENVIRONMENTAL_SERVICES 253738 non-null values HEAVY_ELECTRICAL_EQUIPMENT 253738 non-null values HEAVY_MACHINERY 253738 non-null values INDUSTRIAL_PARTS 253738 non-null values ELECTRICAL_UTILITY 253738 non-null values GAS_WATER_UTILITY 253738 non-null values RAILROADS 253738 non-null values AIRLINES 253738 non-null values FREIGHT 253738 non-null values MEDICAL_SERVICES 253738 non-null values MEDICAL_PRODUCTS 253738 non-null values DRUGS 253738 non-null values ELECTRONIC_EQUIPMENT 253738 non-null values SEMICONDUCTORS 253738 non-null values COMPUTER_HARDWARE 253738 non-null values COMPUTER_SOFTWARE 253738 non-null values DEFENCE_AEROSPACE 253738 non-null values TELEPHONE 253738 non-null values WIRELESS 253738 non-null values INFORMATION_SERVICES 253738 non-null values INDUSTRIAL_SERVICES 253738 non-null values LIFE_HEALTH_INSURANCE 253738 non-null values PROPERTY_CASUALTY_INSURANCE 253738 non-null values BANKS 253738 non-null values THRIFTS 253738 non-null values ASSET_MANAGEMENT 253738 non-null values FINANCIAL_SERVICES 253738 non-null values INTERNET 253738 non-null values REITS 253738 non-null values BIOTECH 253738 non-null values dtypes: int64(55)
sparse = barra_industry_exposures.to_sparse(fill_value=0) sparse
<class 'pandas.sparse.frame.SparseDataFrame'> MultiIndex: 253738 entries, (20061229, '00036110') to (20120928, 'Y8564W10') Columns: 55 entries, AIRLINES to WIRELESS dtypes: float64(55)
%timeit sparse / 100.
100 loops, best of 3: 6.64 ms per loop
%timeit sparse.to_dense()
10 loops, best of 3: 127 ms per loop
%timeit sparse.save('test.pkl')
1 loops, best of 3: 16.9 ms per loop
Now this is what I get in 0.9.1:
'0.9.1'
1 loops, best of 3: 92.2 s per loop
1 loops, best of 3: 99.8 s per loop
1 loops, best of 3: 100 s per loop
So, in the new version SparseDataFrame methods that used to run in less than 7-130 ms now run in more than 90 s. Ouch! What happened?
The text was updated successfully, but these errors were encountered:
We need more performance benchmarks in the vbench suite. Thanks for the feedback. We'll investigate.
Sorry, something went wrong.
This looks like 4a5b75b, though I'm not sure why take is so expensive. the pending #2253 (3688e53) fixes the problem for me.
Testcase:
import pandas as pd num=250000 l1=[randint(0,1000) for x in range(num)] l2=[randint(0,20000) for x in range(num)] l3=[randint(0,20000) for x in range(num)] l4=[randint(0,20000) for x in range(num)] a=pd.DataFrame(dict(zip([0,1,2,3],[l1,l2,l3,l4]))).set_index([0,1]) b=a.to_sparse() %timeit b/100 %timeit b.to_dense() %timeit b.save('test.pk1')
Edit: but perhaps there's another issue at play. I can't reproduce anything like 90s runtime on this data
Doh, this will teach me to review PRs more carefully; this is theoretically what vbench is for. I will fix
Ugh, iteritems for all DataFrames has borked performance. Guess we're going to see 0.9.2 sooner rather than later
iteritems
bdbca8e
VB: add test for iteritems performance
9202540
catch the regression noted in #2273 next time.
No branches or pull requests
This is what I have in version 0.9.0:
Now this is what I get in 0.9.1:
So, in the new version SparseDataFrame methods that used to run in less than 7-130 ms now run in more than 90 s. Ouch! What happened?
The text was updated successfully, but these errors were encountered: