Pre-compute the size of an outer merge and raise MemoryError if merge will be too large #15068

mproffitt · 2017-01-05T11:19:47Z

When performing an outer merge on large dataframes, pandas tries to compute the merge and only when the system has run out of memory will a MemoryError be thrown, or, in more extreme cases, the kernel will simply take over and kill the application.

In the more extreme cases, this makes it impossible to handle the error and / or switch to an alternate approach.

Testing:

import math
import pandas as pd
import numpy as np
def create_table(x, y, name_len=2):
    alphabet = [a for a in 'abcdefghijklmnopqrstuvwxyz']
    
    # initial table
    table = pd.DataFrame(np.random.rand(x, (y - 1)))
    table = table.assign(name=[np.random.choice(alphabet) for x in range(len(table))])
    # extra rows which can be unique to each frame:
    l = math.ceil(len(table) / 12)
    c = pd.DataFrame(np.random.rand(l, (y - 1)))
    col = []
    for x in range(l):
        ch = np.random.choice(alphabet)
        col.append(''.join([ch for x in range(name_len)]))
    table = table.append(c.assign(name=col))
    return table.reset_index(drop=True)

df1 = create_table(13000, 4)
df2 = create_table(370000, 4, name_len=3)
df1.merge(df2, on='name', how='outer') # Killed

Problem description

The real issue at hand here is the use of memory to determine if a merge can be carried out. Because merges are computationally heavy solutions, systems can quickly become unstable as the merge takes place and memory is not left available for other processes.

Secondly, because a MemoryError is not raised until the system has already run out of memory, it is not always possible to handle this within the application.

By first looking at the intersection of groups within the dataframes being merged, it is possible to predict how many rows will be contained within the final merged dataframe. This can be achieved with:

def merge_size(left_frame, right_frame, group_by):
    left_groups = left_frame.groupby(group_by).size()
    right_groups = right_frame.groupby(group_by).size()
    left_keys = set(left_groups.index)
    right_keys = set(right_groups.index)
    intersection = right_keys & left_keys
    left_diff = left_keys - intersection
    right_diff = right_keys - intersection
    
    # if the joining column contains np.nan values, these get missed by the intersections
    # but are present in the merge. These need to be added separately.
    left_nan = len(left_frame.query('{0} != {0}'.format(group_by)))
    right_nan = len(right_frame.query('{0} != {0}'.format(group_by)))
    left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
    right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan
    
    sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
    sizes += [left_groups[group_name] for group_name in left_diff]
    sizes += [right_groups[group_name] for group_name in right_diff]
    sizes += [left_nan * right_nan]
    return sum(sizes)

Using this method, it is then possible to calculate how much memory will be required by the final merge, such that:

rows = merge_size(df1, df2, 'unique_key') # 14084 * 400834 = 185030090 rows
cols = len(df1.columns) + (len(df1.columns) - 1) # 7
required_memory = (rows * cols) * np.dtype(np.float64).itemsize # 5920962904 bytes

At this point it would already be clear that the amount of memory required for the merge far outweighs the amount available to the system and a memory error can be raised early without causing instability within the application or surrounding system.

Output of `pd.show_versions()`

>>> pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.2.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.19.0
nose: 1.3.7
pip: 9.0.1
setuptools: 20.10.1
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-01-05T11:47:36Z

we have a doc warning for this. https://github.com/pandas-dev/pandas/pull/14788/files

I suppose this could be added, though you would have to recast the above to use internal functions. And would have to have this raise only in the case of a huge number (we don't directly sample memory).

IshankGulati · 2017-01-12T13:20:33Z

@jreback @mproffitt Can I work on this issue?

jreback · 2017-01-12T14:55:54Z

sure

mproffitt · 2017-01-12T15:21:06Z

@IshankGulati I too am happy for you to pick this up.

For information purposes I did put together a more complete version of the merge_size function which would calculate for inner / left / right merges as well which I posted on stackoverflow last week.

The more complete version is:

def merge_size(left_frame, right_frame, group_by, how='inner'):
    left_groups = left_frame.groupby(group_by).size()
    right_groups = right_frame.groupby(group_by).size()
    left_keys = set(left_groups.index)
    right_keys = set(right_groups.index)
    intersection = right_keys & left_keys
    left_diff = left_keys - intersection
    right_diff = right_keys - intersection

    left_nan = len(left_frame[left_frame[group_by] != left_frame[group_by]])
    right_nan = len(right_frame[right_frame[group_by] != right_frame[group_by]])
    left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
    right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan

    sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
    sizes += [left_nan * right_nan]

    left_size = [left_groups[group_name] for group_name in left_diff]
    right_size = [right_groups[group_name] for group_name in right_diff]
    if how == 'inner':
        return sum(sizes)
    elif how == 'left':
        return sum(sizes + left_size)
    elif how == 'right':
        return sum(sizes + right_size)
    return sum(sizes + left_size + right_size)

As mentioned in that issue, this function doesn't handle a multi-index for group_by very well, instead you would need to calculate this as:

min([merge_size(df1, df2, label, how) for label in group_by])

I am aware that this is a very inefficient way of calculating sizes for multi-index. If that can be solved, it would remove a headache.

Cheers

IshankGulati · 2017-01-13T09:30:31Z

@mproffitt
If I understand correctly, we can pass a list of columns to merge both the dataframes on as group_by in merge_size function and it should give the correct size of final dataframe instead of sum of merge_size of individual column.

mproffitt · 2017-01-13T13:12:18Z

@IshankGulati That is the desired behaviour yes but in practice passing a list of columns as group_by returns a sum of the calculated size of each column in the group_by as a unique merge.

As an example, if group_by = a == 68 rows and group_by = b == 900 rows then group_by = [a, b] currently returns 968 when in fact it should return 68. This is the reason for the min([...]) reduction.

IshankGulati · 2017-01-19T12:40:53Z

@jreback I was going through the code. From what I have understood columns on which merge is applied are stored in right_join_keys and left_join_keys. So should I use them for group_by in above merge_size method?

jreback · 2017-01-19T14:16:28Z

yes this would be a computation that would occur late in the initialization phase (e.g. we know what keys we have already, then you can feed the 'magic function' and get an answer, we will then have an option to merge of what to do errors='raise', 'ignore', 'warn' something like that.

mroeschke · 2024-01-27T23:00:04Z

Looks like this idea never took off, additionally it would be hard to know when to raise since a user's available memory can vary wildly so closing for now

jreback added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 5, 2017

jreback mentioned this issue May 6, 2017

ENH: option to check merge is one-to-one, many-to-one, one-to-many, or many-to-many #16270

Closed

mroeschke added Enhancement Error Reporting Incorrect or improved errors from pandas labels May 8, 2021

mroeschke closed this as completed Jan 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-compute the size of an outer merge and raise MemoryError if merge will be too large #15068

Pre-compute the size of an outer merge and raise MemoryError if merge will be too large #15068

mproffitt commented Jan 5, 2017

INSTALLED VERSIONS

jreback commented Jan 5, 2017

IshankGulati commented Jan 12, 2017

jreback commented Jan 12, 2017

mproffitt commented Jan 12, 2017

IshankGulati commented Jan 13, 2017

mproffitt commented Jan 13, 2017

IshankGulati commented Jan 19, 2017 •

edited

Loading

jreback commented Jan 19, 2017

mroeschke commented Jan 27, 2024

Pre-compute the size of an outer merge and raise MemoryError if merge will be too large #15068

Pre-compute the size of an outer merge and raise MemoryError if merge will be too large #15068

Comments

mproffitt commented Jan 5, 2017

Testing:

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Jan 5, 2017

IshankGulati commented Jan 12, 2017

jreback commented Jan 12, 2017

mproffitt commented Jan 12, 2017

IshankGulati commented Jan 13, 2017

mproffitt commented Jan 13, 2017

IshankGulati commented Jan 19, 2017 • edited Loading

jreback commented Jan 19, 2017

mroeschke commented Jan 27, 2024

Output of `pd.show_versions()`

IshankGulati commented Jan 19, 2017 •

edited

Loading