Skip to content

validate ='m:m' in dataframe.merge should have a meaningful purpose #27430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gitgithan opened this issue Jul 17, 2019 · 5 comments
Closed

validate ='m:m' in dataframe.merge should have a meaningful purpose #27430

gitgithan opened this issue Jul 17, 2019 · 5 comments
Labels
API Design Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@gitgithan
Copy link

Code Sample, a copy-pastable example if possible

Currently, in https://github.com/pandas-dev/pandas/blob/master/pandas/core/reshape/merge.py

elif validate in ["many_to_many", "m:m"]:
            pass

Problem description

I was expecting m:m to give me a warning/error if either the left or right merge keys are unique.
Isn't that the purpose of specifying m:m? That is to ensure it is really a many to many merge.

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

Note: Many problems can be resolved by simply upgrading pandas to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master addresses this issue, but that is not necessary.

For documentation-related issues, you can check the latest versions of the docs on master here:

https://pandas-docs.github.io/pandas-docs-travis/

If the issue has not been resolved there, go ahead and file it in the issue tracker.

Expected Output

Expected an Error that keys are unique in either left/right/both and it is not a many-to-many merge when validate=m:m is specified.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: 4.3.1
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.4.0
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.1
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.2
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.1
pymysql: None
psycopg2: 2.8.3 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@TomAugspurger
Copy link
Contributor

Do you have a minimal example showing this? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@gitgithan
Copy link
Author

# both left right keys unique
df1 = pd.DataFrame({'lkey': ['foo', 'bar'],
                    'value': [1, 2]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar'],
                    'value2': [5, 6]})

df1.merge(df2, left_on='lkey', right_on='rkey',validate='m:m')


# right key unique
df3 = pd.DataFrame({'lkey': ['foo', 'foo'],
                    'value': [1, 2]})
df4 = pd.DataFrame({'rkey': ['foo', 'bar'],
                    'value2': [5, 6]})

df3.merge(df4, left_on='lkey', right_on='rkey',validate='m:m')


# left key unique
df5 = pd.DataFrame({'lkey': ['foo', 'bar'],
                    'value': [1, 2]})
df6 = pd.DataFrame({'rkey': ['bar', 'bar'],
                    'value2': [5, 6]})

df5.merge(df6, left_on='lkey', right_on='rkey',validate='m:m')


# both left right keys not unique
df7 = pd.DataFrame({'lkey': ['foo', 'foo'],
                    'value': [1, 2]})
df8 = pd.DataFrame({'rkey': ['foo', 'foo'],
                    'value2': [5, 6]})

df7.merge(df8, left_on='lkey', right_on='rkey',validate='m:m')

My issue is not that their output is wrong, what i was hoping for is that the 1st three merges will generate a warning/error telling me that it is not a many to many merge as i specified in validate='m:m' and only the 4th merge run with no issue. This will make the behaviour of validate consistent in the sense that if i specify a 1:1, 1:m, m:1, or m:m, i expect an alert if the data in the joining columns are not that way. Currently, the 1st three merges do not generate any alert.
Having this will save the user some non-uniqueness(if it is desired) checking.

@mroeschke mroeschke added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Error Reporting Incorrect or improved errors from pandas labels Nov 2, 2019
@simonjayhawkins
Copy link
Member

I was expecting m:m to give me a warning/error if either the left or right merge keys are unique.
Isn't that the purpose of specifying m:m? That is to ensure it is really a many to many merge.

The expected behaviour is clearly documented.

from https://pandas.pydata.org/pandas-docs/version/1.1.0/reference/api/pandas.merge.html

validate : str, optional
If specified, checks if merge is of specified type.

“one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.

“one_to_many” or “1:m”: check if merge keys are unique in left dataset.

“many_to_one” or “m:1”: check if merge keys are unique in right dataset.

“many_to_many” or “m:m”: allowed, but does not result in checks.

Changing this is an breaking API change. Would first need a deprecation. FWIW i'm -1 on changing this.

@simonjayhawkins simonjayhawkins added API Design Needs Discussion Requires discussion from core team before further action and removed Error Reporting Incorrect or improved errors from pandas labels Jul 22, 2020
@simonjayhawkins
Copy link
Member

The current behaviour was a conscious design choice #16275 (comment)

@mroeschke
Copy link
Member

Thanks for the suggestion but agreed that changing the behavior is probably not worth the churn since it was intentionally designed this way. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants