Skip to content

Pre-compute the size of an outer merge and raise MemoryError if merge will be too large #15068

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mproffitt opened this issue Jan 5, 2017 · 9 comments
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@mproffitt
Copy link

When performing an outer merge on large dataframes, pandas tries to compute the merge and only when the system has run out of memory will a MemoryError be thrown, or, in more extreme cases, the kernel will simply take over and kill the application.

In the more extreme cases, this makes it impossible to handle the error and / or switch to an alternate approach.

Testing:

import math
import pandas as pd
import numpy as np
def create_table(x, y, name_len=2):
    alphabet = [a for a in 'abcdefghijklmnopqrstuvwxyz']
    
    # initial table
    table = pd.DataFrame(np.random.rand(x, (y - 1)))
    table = table.assign(name=[np.random.choice(alphabet) for x in range(len(table))])
    # extra rows which can be unique to each frame:
    l = math.ceil(len(table) / 12)
    c = pd.DataFrame(np.random.rand(l, (y - 1)))
    col = []
    for x in range(l):
        ch = np.random.choice(alphabet)
        col.append(''.join([ch for x in range(name_len)]))
    table = table.append(c.assign(name=col))
    return table.reset_index(drop=True)

df1 = create_table(13000, 4)
df2 = create_table(370000, 4, name_len=3)
df1.merge(df2, on='name', how='outer') # Killed

Problem description

The real issue at hand here is the use of memory to determine if a merge can be carried out. Because merges are computationally heavy solutions, systems can quickly become unstable as the merge takes place and memory is not left available for other processes.

Secondly, because a MemoryError is not raised until the system has already run out of memory, it is not always possible to handle this within the application.

By first looking at the intersection of groups within the dataframes being merged, it is possible to predict how many rows will be contained within the final merged dataframe. This can be achieved with:

def merge_size(left_frame, right_frame, group_by):
    left_groups = left_frame.groupby(group_by).size()
    right_groups = right_frame.groupby(group_by).size()
    left_keys = set(left_groups.index)
    right_keys = set(right_groups.index)
    intersection = right_keys & left_keys
    left_diff = left_keys - intersection
    right_diff = right_keys - intersection
    
    # if the joining column contains np.nan values, these get missed by the intersections
    # but are present in the merge. These need to be added separately.
    left_nan = len(left_frame.query('{0} != {0}'.format(group_by)))
    right_nan = len(right_frame.query('{0} != {0}'.format(group_by)))
    left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
    right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan
    
    sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
    sizes += [left_groups[group_name] for group_name in left_diff]
    sizes += [right_groups[group_name] for group_name in right_diff]
    sizes += [left_nan * right_nan]
    return sum(sizes)

Using this method, it is then possible to calculate how much memory will be required by the final merge, such that:

rows = merge_size(df1, df2, 'unique_key') # 14084 * 400834 = 185030090 rows
cols = len(df1.columns) + (len(df1.columns) - 1) # 7
required_memory = (rows * cols) * np.dtype(np.float64).itemsize # 5920962904 bytes

At this point it would already be clear that the amount of memory required for the merge far outweighs the amount available to the system and a memory error can be raised early without causing instability within the application or surrounding system.

Output of pd.show_versions()

>>> pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.2.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.19.0
nose: 1.3.7
pip: 9.0.1
setuptools: 20.10.1
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Jan 5, 2017

we have a doc warning for this. https://github.com/pandas-dev/pandas/pull/14788/files

I suppose this could be added, though you would have to recast the above to use internal functions. And would have to have this raise only in the case of a huge number (we don't directly sample memory).

@jreback jreback added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 5, 2017
@IshankGulati
Copy link

@jreback @mproffitt Can I work on this issue?

@jreback
Copy link
Contributor

jreback commented Jan 12, 2017

sure

@mproffitt
Copy link
Author

@IshankGulati I too am happy for you to pick this up.

For information purposes I did put together a more complete version of the merge_size function which would calculate for inner / left / right merges as well which I posted on stackoverflow last week.

The more complete version is:

def merge_size(left_frame, right_frame, group_by, how='inner'):
    left_groups = left_frame.groupby(group_by).size()
    right_groups = right_frame.groupby(group_by).size()
    left_keys = set(left_groups.index)
    right_keys = set(right_groups.index)
    intersection = right_keys & left_keys
    left_diff = left_keys - intersection
    right_diff = right_keys - intersection

    left_nan = len(left_frame[left_frame[group_by] != left_frame[group_by]])
    right_nan = len(right_frame[right_frame[group_by] != right_frame[group_by]])
    left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
    right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan

    sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
    sizes += [left_nan * right_nan]

    left_size = [left_groups[group_name] for group_name in left_diff]
    right_size = [right_groups[group_name] for group_name in right_diff]
    if how == 'inner':
        return sum(sizes)
    elif how == 'left':
        return sum(sizes + left_size)
    elif how == 'right':
        return sum(sizes + right_size)
    return sum(sizes + left_size + right_size)

As mentioned in that issue, this function doesn't handle a multi-index for group_by very well, instead you would need to calculate this as:

min([merge_size(df1, df2, label, how) for label in group_by])

I am aware that this is a very inefficient way of calculating sizes for multi-index. If that can be solved, it would remove a headache.

Cheers

@IshankGulati
Copy link

@mproffitt
If I understand correctly, we can pass a list of columns to merge both the dataframes on as group_by in merge_size function and it should give the correct size of final dataframe instead of sum of merge_size of individual column.

@mproffitt
Copy link
Author

@IshankGulati That is the desired behaviour yes but in practice passing a list of columns as group_by returns a sum of the calculated size of each column in the group_by as a unique merge.

As an example, if group_by = a == 68 rows and group_by = b == 900 rows then group_by = [a, b] currently returns 968 when in fact it should return 68. This is the reason for the min([...]) reduction.

@IshankGulati
Copy link

IshankGulati commented Jan 19, 2017

@jreback I was going through the code. From what I have understood columns on which merge is applied are stored in right_join_keys and left_join_keys. So should I use them for group_by in above merge_size method?

@jreback
Copy link
Contributor

jreback commented Jan 19, 2017

yes this would be a computation that would occur late in the initialization phase (e.g. we know what keys we have already, then you can feed the 'magic function' and get an answer, we will then have an option to merge of what to do errors='raise', 'ignore', 'warn' something like that.

@mroeschke
Copy link
Member

Looks like this idea never took off, additionally it would be hard to know when to raise since a user's available memory can vary wildly so closing for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

4 participants