Skip to content

Solving the Top k most frequent words problem using a max-heap #8125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from

Conversation

aparibocci
Copy link
Contributor

@aparibocci aparibocci commented Feb 7, 2023

Describe your change:

This PR aims to add an algorithm to identify the top k most frequent strings given a provided string list of elements.
To do this, the algorithm is using a max-heap implementation already existing in this repository (a generic type was introduced to allow the usage).

Time complexity is O(n), where n is the number of words:

  • O(n) for building the max-heap
  • k*O(logn) for extracting the k most frequent strings
  • Add an algorithm?
  • Fix a bug or typo in an existing algorithm?
  • Documentation change?

Checklist:

  • I have read CONTRIBUTING.md.
  • This pull request is all my own work -- I have not plagiarized.
  • I know that pull requests will not be merged if they fail the automated tests.
  • This PR only changes one algorithm file. To ease review, please open separate PRs for separate algorithms.
  • All new Python files are placed inside an existing directory.
  • All filenames are in all lowercase characters with no spaces or dashes.
  • All functions and variable names follow Python naming conventions.
  • All function parameters and return values are annotated with Python type hints.
  • All functions have doctests that pass the automated testing.
  • All new algorithms include at least one URL that points to Wikipedia or another similar explanation.
  • If this pull request resolves one or more open issues then the commit message contains Fixes: #{$ISSUE_NO}.

@algorithms-keeper algorithms-keeper bot added awaiting reviews This PR is ready to be reviewed enhancement This PR modified some existing files require descriptive names This PR needs descriptive function and/or variable names require tests Tests [doctest/unittest/pytest] are required require type hints https://docs.python.org/3/library/typing.html labels Feb 7, 2023
Copy link

@algorithms-keeper algorithms-keeper bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Click here to look at the relevant links ⬇️

🔗 Relevant Links

Repository:

Python:

Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.

algorithms-keeper commands and options

algorithms-keeper actions can be triggered by commenting on this PR:

  • @algorithms-keeper review to trigger the checks for only added pull request files
  • @algorithms-keeper review-all to trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.

NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.

@algorithms-keeper algorithms-keeper bot removed require descriptive names This PR needs descriptive function and/or variable names require tests Tests [doctest/unittest/pytest] are required require type hints https://docs.python.org/3/library/typing.html labels Feb 7, 2023
Copy link
Contributor

@CaedenPH CaedenPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very well written algorithm, great stuff!

@aparibocci
Copy link
Contributor Author

Hello @cclauss, I'm tagging you as I saw your involvement in several PRs.
Do you know who can I ask to for a review? Is there anything I should still change in the PR?
Thank you!

@cclauss
Copy link
Member

cclauss commented Apr 23, 2023

All of this seems like it could be done in just a few lines using only a collections.Counter so what are the advantages of this approach?

@aparibocci
Copy link
Contributor Author

All of this seems like it could be done in just a few lines using only a collections.Counter so what are the advantages of this approach?

Thanks for the feedback.
I had a look to collections.Counter and noticed that, indeed, most_common solves exactly this problem: https://github.com/python/cpython/blob/3.11/Lib/collections/__init__.py#L608

This is using heapq.nlargest behind the scenes: https://github.com/python/cpython/blob/3.11/Lib/heapq.py#L523

There is already a Heap class in this repository, I imagined it would be useful to show a typical usage of heaps (i.e. finding out order statistics). Do you think I should close this PR entirely, or do you have other suggestions?

Thank you

@cclauss
Copy link
Member

cclauss commented Apr 23, 2023

from collections import Counter
def top_k_frequent_words(words, k_value):
    return [x[0] for x in Counter(words).most_common(k_value)]

@cclauss
Copy link
Member

cclauss commented Apr 23, 2023

Please put the text of our last two messages into the file’s docstring and then we can merge this. That will show why this work was done and that there also that the Python standard library provides a far more straightforward way to solve the same problem.

@aparibocci
Copy link
Contributor Author

Please put the text of our last two messages into the file’s docstring and then we can merge this. That will show why this work was done and that there also that the Python standard library provides a far more straightforward way to solve the same problem.

Thanks for the feedback.
I have updated the docstring mentioning the (preferable) Python standard library solution.

@algorithms-keeper algorithms-keeper bot added the tests are failing Do not merge until tests pass label Apr 23, 2023
Copy link
Member

@cclauss cclauss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting reviews This PR is ready to be reviewed enhancement This PR modified some existing files tests are failing Do not merge until tests pass
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants