Skip to content

CLN: Use defaultdict for minor optimization #32209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

CLN: Use defaultdict for minor optimization #32209

wants to merge 3 commits into from

Conversation

jaketae
Copy link
Contributor

@jaketae jaketae commented Feb 23, 2020

Edit _from_nested_dict() by using defaultdict for marginal optimization

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Edit `_from_nested_dict` by using `defaultdict` for marginal optimization
@jaketae jaketae changed the title CLN: Use defaultdict for minor optimization CLN: Use defaultdict for minor optimization Feb 23, 2020
@jbrockmendel
Copy link
Member

whats the optimization here? performance improved?

@jaketae
Copy link
Contributor Author

jaketae commented Feb 24, 2020

whats the optimization here? performance improved?

@jbrockmendel Albeit marginal, yes. Instead of using the new_data.get(col, {}), performance-wise it is more efficient to utilize defaultdict(dict) when initializing new_data to handle missing keys.

@jbrockmendel
Copy link
Member

can you provide a %timeit measure showing the improvement

@jaketae
Copy link
Contributor Author

jaketae commented Feb 24, 2020

@jbrockmendel Here is the code for a dummy test I ran to compare the runtime.

test_dict = {}
test_defaultdict = defaultdict(dict)
dict_time = []
defaultdict_time = []

for _ in range(10000):
    start_time = time.time()
    test_lst = [test_dict.get(x, {}) for x in range(1000)]
    dict_time.append(time.time() - start_time)

for _ in range(10000):
    start_time = time.time()
    test_lst = [test_defaultdict[x] for x in range(1000)]
    defaultdict_time.append(time.time() - start_time)

print(sum(dict_time)/len(dict_time))
print(sum(defaultdict_time)/len(defaultdict_time))

I got the following results.

0.000193774962425 
9.86933469772e-05 

Not sure if this is quite what you wanted, but there is a difference in runtime.


EDIT: After increasing the number of iterations to 10000, I got the following results.

0.00218405880928
0.00110734097958

On average, it seems like defaultdict is about twice as fast as using a vanilla dictionary in conjunction with a get() method.

@jbrockmendel
Copy link
Member

Thats a good start. What i had in mind was something like:

In [1]: import pandas as pd
In [2]: data = something_that_would_go_through_the_code_affected_by_this_PR
In [3]: %timeit pd.DataFrame(data)

and then results both before and after this PR

@jaketae
Copy link
Contributor Author

jaketae commented Feb 24, 2020

@jbrockmendel I wasn't quite sure on how to test this using the pd.DataFrame() initialization as you suggested, so I decided to conduct a direct side-by-side comparison of the two methods instead.

Here is the setup:

FACTOR, DIM = 0.1, 10000

data = {f'col {i}': 
        {f'index {j}': 'val' for j in random.sample(range(DIM), int(FACTOR * DIM))} 
        for i in range(DIM)} # randomly populate nested dictionary

defaultdict_data = defaultdict(dict, data) # equivalent data in defaultdict

def _from_nested_dict(data): # original method
    new_data = {}
    for index, s in data.items():
        for col, v in s.items():
            new_data[col] = new_data.get(col, {})
            new_data[col][index] = v
    return new_data

def _from_nested_dict_PR(data): # PR
    new_data = defaultdict(dict)
    for index, s in data.items():
        for col, v in s.items():
            new_data[col][index] = v
    return new_data

Here are the results:

>>> %timeit _from_nested_dict(data)
6.99 s ± 33.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit _from_nested_dict_PR(defaultdict_data)
4.88 s ± 32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

There seems to be a non-negligible boost in performance, and we would only expect this gap to grow with larger inputs.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@jaketae
Copy link
Contributor Author

jaketae commented Feb 24, 2020

@WillAyd @jbrockmendel I'm wondering if we need need cast data to a dictionary as it stands in the PR (return dict(data)) or if we can just return data as it is, retaining its dtype as defaultdict. As I see it, returning a defaultdict should be fine since isinstance(data, dict) will return True. Or am I missing something?

@WillAyd
Copy link
Member

WillAyd commented Feb 24, 2020

Can you check where function is called internally? If it doesn’t change anything then sure returning as dict might be unnecessary

Should annotate return type of function in either case

Verified that casting `new_data` with `dict()` is unnecessary, returning `defaultdict` should be fine.
@jaketae
Copy link
Contributor Author

jaketae commented Feb 24, 2020

@WillAyd I'm wondering if annotation is necessary for _from_nested_dict as it is a hidden method. If it is, I will add docstrings as per your suggestion.

@WillAyd
Copy link
Member

WillAyd commented Feb 24, 2020 via email

@WillAyd
Copy link
Member

WillAyd commented Feb 26, 2020

@jaketae can you merge master? Should fix CI problem

@jaketae
Copy link
Contributor Author

jaketae commented Feb 26, 2020

@WillAyd I'm new to Git and I'm not exactly sure if this was the right way to merge. The CI problem seems to have been fixed though. If this wasn't quite right, I'll close this PR, fork and clone again and start afresh with a new PR after closing this one.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use git merge upstream/master
and push again

@jreback jreback added the Performance Memory or execution speed performance label Feb 26, 2020
@jaketae
Copy link
Contributor Author

jaketae commented Feb 26, 2020

I'll open a new PR after setting up a clean fork with a proper branch. Will make sure to reference this PR in the new PR, after which this PR will be closed. Thank you for the patience and help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants