CLN: Use defaultdict for minor optimization #32209

jaketae · 2020-02-23T21:18:20Z

Edit _from_nested_dict() by using defaultdict for marginal optimization

Edit `_from_nested_dict` by using `defaultdict` for marginal optimization

jbrockmendel · 2020-02-24T16:19:28Z

whats the optimization here? performance improved?

jaketae · 2020-02-24T16:23:11Z

whats the optimization here? performance improved?

@jbrockmendel Albeit marginal, yes. Instead of using the new_data.get(col, {}), performance-wise it is more efficient to utilize defaultdict(dict) when initializing new_data to handle missing keys.

jbrockmendel · 2020-02-24T16:25:59Z

can you provide a %timeit measure showing the improvement

jaketae · 2020-02-24T16:46:38Z

@jbrockmendel Here is the code for a dummy test I ran to compare the runtime.

test_dict = {}
test_defaultdict = defaultdict(dict)
dict_time = []
defaultdict_time = []

for _ in range(10000):
    start_time = time.time()
    test_lst = [test_dict.get(x, {}) for x in range(1000)]
    dict_time.append(time.time() - start_time)

for _ in range(10000):
    start_time = time.time()
    test_lst = [test_defaultdict[x] for x in range(1000)]
    defaultdict_time.append(time.time() - start_time)

print(sum(dict_time)/len(dict_time))
print(sum(defaultdict_time)/len(defaultdict_time))

I got the following results.

0.000193774962425 
9.86933469772e-05

Not sure if this is quite what you wanted, but there is a difference in runtime.

EDIT: After increasing the number of iterations to 10000, I got the following results.

0.00218405880928
0.00110734097958

On average, it seems like defaultdict is about twice as fast as using a vanilla dictionary in conjunction with a get() method.

jbrockmendel · 2020-02-24T17:04:36Z

Thats a good start. What i had in mind was something like:

In [1]: import pandas as pd
In [2]: data = something_that_would_go_through_the_code_affected_by_this_PR
In [3]: %timeit pd.DataFrame(data)

and then results both before and after this PR

jaketae · 2020-02-24T18:11:13Z

@jbrockmendel I wasn't quite sure on how to test this using the pd.DataFrame() initialization as you suggested, so I decided to conduct a direct side-by-side comparison of the two methods instead.

Here is the setup:

FACTOR, DIM = 0.1, 10000

data = {f'col {i}': 
        {f'index {j}': 'val' for j in random.sample(range(DIM), int(FACTOR * DIM))} 
        for i in range(DIM)} # randomly populate nested dictionary

defaultdict_data = defaultdict(dict, data) # equivalent data in defaultdict

def _from_nested_dict(data): # original method
    new_data = {}
    for index, s in data.items():
        for col, v in s.items():
            new_data[col] = new_data.get(col, {})
            new_data[col][index] = v
    return new_data

def _from_nested_dict_PR(data): # PR
    new_data = defaultdict(dict)
    for index, s in data.items():
        for col, v in s.items():
            new_data[col][index] = v
    return new_data

Here are the results:

>>> %timeit _from_nested_dict(data)
6.99 s ± 33.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit _from_nested_dict_PR(defaultdict_data)
4.88 s ± 32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

There seems to be a non-negligible boost in performance, and we would only expect this gap to grow with larger inputs.

WillAyd

Lgtm

jaketae · 2020-02-24T19:33:44Z

@WillAyd @jbrockmendel I'm wondering if we need need cast data to a dictionary as it stands in the PR (return dict(data)) or if we can just return data as it is, retaining its dtype as defaultdict. As I see it, returning a defaultdict should be fine since isinstance(data, dict) will return True. Or am I missing something?

WillAyd · 2020-02-24T19:36:31Z

Can you check where function is called internally? If it doesn’t change anything then sure returning as dict might be unnecessary

Should annotate return type of function in either case

Verified that casting `new_data` with `dict()` is unnecessary, returning `defaultdict` should be fine.

jaketae · 2020-02-24T21:13:49Z

@WillAyd I'm wondering if annotation is necessary for _from_nested_dict as it is a hidden method. If it is, I will add docstrings as per your suggestion.

WillAyd · 2020-02-24T22:24:44Z

Yea still helpful for internal funcs

…

Sent from my iPhone

On Feb 24, 2020, at 1:13 PM, Jake Tae ***@***.***> wrote: @WillAyd I'm wondering if annotation is necessary for _from_nested_dict as it is a hidden method. If it is, I will add docstrings as per your suggestion. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

WillAyd · 2020-02-26T05:51:04Z

@jaketae can you merge master? Should fix CI problem

jaketae · 2020-02-26T07:04:10Z

@WillAyd I'm new to Git and I'm not exactly sure if this was the right way to merge. The CI problem seems to have been fixed though. If this wasn't quite right, I'll close this PR, fork and clone again and start afresh with a new PR after closing this one.

jreback

use git merge upstream/master
and push again

jaketae · 2020-02-26T21:43:45Z

I'll open a new PR after setting up a clean fork with a proper branch. Will make sure to reference this PR in the new PR, after which this PR will be closed. Thank you for the patience and help.

Use defaultdict for minor optimization

58c71d0

Edit `_from_nested_dict` by using `defaultdict` for marginal optimization

jaketae changed the title ~~CLN: Use defaultdict for minor optimization~~ CLN: Use defaultdict for minor optimization Feb 23, 2020

WillAyd approved these changes Feb 24, 2020

View reviewed changes

Omit casting to dictionary

2adcedf

Verified that casting `new_data` with `dict()` is unnecessary, returning `defaultdict` should be fine.

Merge upstream master

56a11a7

jreback requested changes Feb 26, 2020

View reviewed changes

jreback added the Performance Memory or execution speed performance label Feb 26, 2020

jaketae mentioned this pull request Feb 26, 2020

CLN: Use defaultdict for minor optimization #32278

Merged

5 tasks

jaketae closed this Feb 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN: Use defaultdict for minor optimization #32209

CLN: Use defaultdict for minor optimization #32209

jaketae commented Feb 23, 2020

jbrockmendel commented Feb 24, 2020

jaketae commented Feb 24, 2020

jbrockmendel commented Feb 24, 2020

jaketae commented Feb 24, 2020 •

edited

Loading

jbrockmendel commented Feb 24, 2020

jaketae commented Feb 24, 2020 •

edited

Loading

WillAyd left a comment

jaketae commented Feb 24, 2020 •

edited

Loading

WillAyd commented Feb 24, 2020

jaketae commented Feb 24, 2020

WillAyd commented Feb 24, 2020 via email

WillAyd commented Feb 26, 2020

jaketae commented Feb 26, 2020 •

edited

Loading

jreback left a comment •

edited

Loading

jaketae commented Feb 26, 2020

CLN: Use defaultdict for minor optimization #32209

CLN: Use defaultdict for minor optimization #32209

Conversation

jaketae commented Feb 23, 2020

jbrockmendel commented Feb 24, 2020

jaketae commented Feb 24, 2020

jbrockmendel commented Feb 24, 2020

jaketae commented Feb 24, 2020 • edited Loading

jbrockmendel commented Feb 24, 2020

jaketae commented Feb 24, 2020 • edited Loading

WillAyd left a comment

Choose a reason for hiding this comment

jaketae commented Feb 24, 2020 • edited Loading

WillAyd commented Feb 24, 2020

jaketae commented Feb 24, 2020

WillAyd commented Feb 24, 2020 via email

WillAyd commented Feb 26, 2020

jaketae commented Feb 26, 2020 • edited Loading

jreback left a comment • edited Loading

Choose a reason for hiding this comment

jaketae commented Feb 26, 2020

jaketae commented Feb 24, 2020 •

edited

Loading

jaketae commented Feb 24, 2020 •

edited

Loading

jaketae commented Feb 24, 2020 •

edited

Loading

jaketae commented Feb 26, 2020 •

edited

Loading

jreback left a comment •

edited

Loading