Skip to content

ENH: option to check merge is one-to-one, many-to-one, one-to-many, or many-to-many #16270

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nickeubank opened this issue May 6, 2017 · 4 comments · Fixed by #16275
Closed
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@nickeubank
Copy link
Contributor

In my experience, there are few places in data work where problems with data are more evident than when merging datasets -- something that is both a problem (if you think you're doing a one-to-one merge and one of the keys isn't unique in one dataset, you can introduce huge problems) and an opportunity (checking that a merge works as expected is a great way to catch problems).

With that in mind, I'd like to propose adding a check_merge argument that accepts strings one_to_one, one_to_many, may_to_one, and many_to_many. If one of those strings is passed, the merge function would then check uniqueness of merge variables before running the merge, and if for example the merge key for the first dataset has duplicates in a one_to_many merge, it would throw an (informative) exception.

(Stata made a similar move to bake this functionality into its merge command around Stata 12 using 1:1, 1:m, m:1 syntax, which I'm also open to).

Though this functionality can be replicated with user tests, it gets tiring to write them every time...

@jreback
Copy link
Contributor

jreback commented May 6, 2017

see also #15068

@jreback
Copy link
Contributor

jreback commented May 6, 2017

I like the stat syntax, so maybe for readability accept both. Also call this check= I think. or maybe validate=

@jreback jreback added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode Difficulty Intermediate labels May 6, 2017
@jreback jreback added this to the Next Major Release milestone May 6, 2017
@nickeubank
Copy link
Contributor Author

Sounds good. Will pull together a PR and ping for input.

@nickeubank
Copy link
Contributor Author

Thoughts on whether it should check that keys AREN'T unique for the many tests?

i.e. if someone runs merge(left, right, on='a', validate="many_to_one") and it turns out that a is unique in left, should that throw an exception? I don't think Stata does, and my inclination is to say no, but open to input

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.21.0, Next Major Release May 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants