-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
performance of Series label-indexing with a list of labels #16285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yeah, this is slow. Note that it is much faster if the keys are wrapped in an array. In [11]: %timeit [dct[k] for k in keys]
10 loops, best of 3: 79.5 ms per loop
In [12]: %timeit sdct[keys]
1 loop, best of 3: 645 ms per loop
In [13]: %timeit sdct[np.array(keys)]
10 loops, best of 3: 65.1 ms per loop Bottleneck seems to be in |
Ah, I see, good catch, could have tried that. This makes it comparably faster to the comprehension. Shouldn't it be significantly faster though? I assume the comprehension is interpreted, whereas the Series lookup is one call to the C extension. Or does it boil down to efficiencies of Pandas's and C Python's hash tables? |
The bulk of the time in this operation is actually in placing the new values, not the hash table lookups. Below I skip the hash table in both cases
You are are right that these are C operations that avoid python overhead, but they are also basic python ops on optimized data structures, so not as much pickup as you might guess. |
This is indeed the case:
@chris-b1 do you understand the purpose of |
so you can look at #16295, but this is actually quite subtle. you cannot simply |
closes pandas-dev#16285 (cherry picked from commit ce4eef3)
Code Sample, a copy-pastable example if possible
Problem description
I would expect the Series performance to be comparable, if not faster than the Python comprehension.
Output of
pd.show_versions()
pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 35.0.2
Cython: None
numpy: 1.12.1
scipy: 0.19.0
statsmodels: 0.6.1
xarray: None
IPython: 6.0.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
matplotlib: 2.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: