Skip to content

BUG: qcut can fail for highly discontinuous data distributions #31626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

puneet29
Copy link

@puneet29 puneet29 commented Feb 3, 2020

Fixes #15069. Needs refactoring

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Needs refactoring
@pep8speaks
Copy link

pep8speaks commented Feb 3, 2020

Hello @puneet29! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-02 21:13:47 UTC

@puneet29 puneet29 requested a review from jreback February 3, 2020 17:23
@MarcoGorelli
Copy link
Member

Thanks @puneet29

However, before this PR can be accepted, you need to write tests / make sure you don't break existing ones - see contributing to the code base:

pandas is serious about testing and strongly encourages contributors to embrace test-driven development (TDD). This development process “relies on the repetition of a very short development cycle: first the developer writes an (initially failing) automated test case that defines a desired improvement or new function, then produces the minimum amount of code to pass that test.” So, before actually writing any code, you should write your tests. Often the test can be taken from the original GitHub issue. However, it is always worth considering additional use cases and writing corresponding tests.

Adding tests is one of the most common requests after code is pushed to pandas. Therefore, it is worth getting in the habit of writing tests ahead of time so this is never an issue.

@puneet29
Copy link
Author

puneet29 commented Feb 4, 2020

Hey @MarcoGorelli ! Kindly guide me on what to do with the duplicates parameter, because this PR deals with duplicate values in bins. Can you review the code? Thanks a lot.

@jreback jreback changed the title Fixed #15069 BUG: qcut can fail for highly discontinuous data distributions Feb 5, 2020
@jbrockmendel jbrockmendel added the cut cut, qcut label Mar 2, 2020
@jbrockmendel
Copy link
Member

@puneet29 this needs tests so we know that the bug is fixed. it will probably go in pandas.tests.reshape.test_qcut

@WillAyd
Copy link
Member

WillAyd commented Apr 2, 2020

I think this is stale so closing for now, but ping @puneet29 if you'd like to pick back up

@WillAyd WillAyd closed this Apr 2, 2020
@puneet29
Copy link
Author

puneet29 commented Apr 2, 2020

Hi, I'm sorry. I will take it up again. I was quite busy before. 😅

@WillAyd WillAyd reopened this Apr 2, 2020
@WillAyd
Copy link
Member

WillAyd commented Apr 2, 2020

No worries - thanks!

@puneet29
Copy link
Author

puneet29 commented Apr 2, 2020

Hi @jbrockmendel @MarcoGorelli
I was having trouble with what output do we expect from discontinuous data. Can you please check and verify if I have implemented it correctly? Thanks! :)

@MarcoGorelli
Copy link
Member

Kindly guide me on what to do with the duplicates parameter

From what I've understood reading through the original issue, the output should not change if duplicates is 'drop' - it's in the case when the duplicates keyword hasn't been set that the expected output should change.

Anyway, if you write a test case, then it'll be easier for anyone reviewing to tell how the output has changed.

@jreback
Copy link
Contributor

jreback commented Jun 14, 2020

closing as stale if you want to continue, please open a new PR.

@jreback jreback closed this Jun 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cut cut, qcut
Projects
None yet
Development

Successfully merging this pull request may close these issues.

qcut can fail for highly discontinuous data distributions
6 participants