-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Severe performance problems with multi-index joins #14775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
after you reset, you need to |
So, I think one problem is here: Line 2644 in 2bd9c95
If the levels are the same, we try and do a 'simple' join, but it involves populating a huge and expensive hash table of tuples, xref #13904. I think that could be easily fixed (the join case - not the general issue). This would explain why the joins on non-identical levels are faster. |
Thanks for this info; that sounds very plausible. (I didn't get any performance warnings, though.)
Ah, so I suppose the large differences between subsequent tries are also due to the calculation of this hash table.
Wouldn't we then run into the problem that "merging with more than one level overlap is not implemented", though? |
Any update? I'm facing the same problem on multi-index joins with 0.21.0, about 6M rows of each dataframe.
It takes only about 2 sec for |
Looks like these benchmark use a fairly old version of pandas. We can reopen if a simplified benchmark shows this is still an issue in a modern version of pandas |
Code Sample (microbenchmark)
Problem description
We are getting severe performance bottlenecks in our code due to join operations on multiple index levels.
To analyze and illustrate the problem, we have created a microbenchmark which executes joins on data frames with different indices and one data column. The benchmark is constructed such that the resulting data frame always has the same number of rows (10000000 in this case, though we also have to deal with larger data sets). We would therefore expect all join operations to take a similar amount of time, perhaps somewhat linear in the number of index levels. Instead, we see up to 200x longer execution times in some cases (see output).
The benchmark works as follows:
The numbers show the times of both joins that are executed on the same data frames. Interestingly, in some cases, the first join often takes a lot longer than the second. For the alternative implementations, the numbers in parentheses show the number of seconds spent in the main inner operation (either join or concat).
We would be grateful for answers to any of the following questions:
Output
test_join_results.txt
Expected Output
We would expect all execution times to be less than a second, usually much less.
Output of
pd.show_versions()
pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 20.10.1
Cython: None
numpy: 1.11.2
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: