-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Pre-compute the size of an outer merge and raise MemoryError if merge will be too large #15068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
we have a doc warning for this. https://github.com/pandas-dev/pandas/pull/14788/files I suppose this could be added, though you would have to recast the above to use internal functions. And would have to have this raise only in the case of a huge number (we don't directly sample memory). |
@jreback @mproffitt Can I work on this issue? |
sure |
@IshankGulati I too am happy for you to pick this up. For information purposes I did put together a more complete version of the The more complete version is: def merge_size(left_frame, right_frame, group_by, how='inner'):
left_groups = left_frame.groupby(group_by).size()
right_groups = right_frame.groupby(group_by).size()
left_keys = set(left_groups.index)
right_keys = set(right_groups.index)
intersection = right_keys & left_keys
left_diff = left_keys - intersection
right_diff = right_keys - intersection
left_nan = len(left_frame[left_frame[group_by] != left_frame[group_by]])
right_nan = len(right_frame[right_frame[group_by] != right_frame[group_by]])
left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan
sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
sizes += [left_nan * right_nan]
left_size = [left_groups[group_name] for group_name in left_diff]
right_size = [right_groups[group_name] for group_name in right_diff]
if how == 'inner':
return sum(sizes)
elif how == 'left':
return sum(sizes + left_size)
elif how == 'right':
return sum(sizes + right_size)
return sum(sizes + left_size + right_size) As mentioned in that issue, this function doesn't handle a multi-index for min([merge_size(df1, df2, label, how) for label in group_by]) I am aware that this is a very inefficient way of calculating sizes for multi-index. If that can be solved, it would remove a headache. Cheers |
@mproffitt |
@IshankGulati That is the desired behaviour yes but in practice passing a list of columns as As an example, if |
@jreback I was going through the code. From what I have understood columns on which merge is applied are stored in right_join_keys and left_join_keys. So should I use them for group_by in above merge_size method? |
yes this would be a computation that would occur late in the initialization phase (e.g. we know what keys we have already, then you can feed the 'magic function' and get an answer, we will then have an option to merge of what to do |
Looks like this idea never took off, additionally it would be hard to know when to raise since a user's available memory can vary wildly so closing for now |
When performing an outer merge on large dataframes,
pandas
tries to compute the merge and only when the system has run out of memory will a MemoryError be thrown, or, in more extreme cases, the kernel will simply take over and kill the application.In the more extreme cases, this makes it impossible to handle the error and / or switch to an alternate approach.
Testing:
Problem description
The real issue at hand here is the use of memory to determine if a merge can be carried out. Because merges are computationally heavy solutions, systems can quickly become unstable as the merge takes place and memory is not left available for other processes.
Secondly, because a MemoryError is not raised until the system has already run out of memory, it is not always possible to handle this within the application.
By first looking at the intersection of groups within the dataframes being merged, it is possible to predict how many rows will be contained within the final merged dataframe. This can be achieved with:
Using this method, it is then possible to calculate how much memory will be required by the final merge, such that:
At this point it would already be clear that the amount of memory required for the merge far outweighs the amount available to the system and a memory error can be raised early without causing instability within the application or surrounding system.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.2.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
pandas: 0.19.0
nose: 1.3.7
pip: 9.0.1
setuptools: 20.10.1
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: