-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
CLN: Use defaultdict for minor optimization #32209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Edit `_from_nested_dict` by using `defaultdict` for marginal optimization
defaultdict
for minor optimization
whats the optimization here? performance improved? |
@jbrockmendel Albeit marginal, yes. Instead of using the |
can you provide a %timeit measure showing the improvement |
@jbrockmendel Here is the code for a dummy test I ran to compare the runtime. test_dict = {}
test_defaultdict = defaultdict(dict)
dict_time = []
defaultdict_time = []
for _ in range(10000):
start_time = time.time()
test_lst = [test_dict.get(x, {}) for x in range(1000)]
dict_time.append(time.time() - start_time)
for _ in range(10000):
start_time = time.time()
test_lst = [test_defaultdict[x] for x in range(1000)]
defaultdict_time.append(time.time() - start_time)
print(sum(dict_time)/len(dict_time))
print(sum(defaultdict_time)/len(defaultdict_time)) I got the following results.
Not sure if this is quite what you wanted, but there is a difference in runtime. EDIT: After increasing the number of iterations to 10000, I got the following results.
On average, it seems like |
Thats a good start. What i had in mind was something like:
and then results both before and after this PR |
@jbrockmendel I wasn't quite sure on how to test this using the Here is the setup: FACTOR, DIM = 0.1, 10000
data = {f'col {i}':
{f'index {j}': 'val' for j in random.sample(range(DIM), int(FACTOR * DIM))}
for i in range(DIM)} # randomly populate nested dictionary
defaultdict_data = defaultdict(dict, data) # equivalent data in defaultdict
def _from_nested_dict(data): # original method
new_data = {}
for index, s in data.items():
for col, v in s.items():
new_data[col] = new_data.get(col, {})
new_data[col][index] = v
return new_data
def _from_nested_dict_PR(data): # PR
new_data = defaultdict(dict)
for index, s in data.items():
for col, v in s.items():
new_data[col][index] = v
return new_data Here are the results:
There seems to be a non-negligible boost in performance, and we would only expect this gap to grow with larger inputs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
@WillAyd @jbrockmendel I'm wondering if we need need cast |
Can you check where function is called internally? If it doesn’t change anything then sure returning as dict might be unnecessary Should annotate return type of function in either case |
Verified that casting `new_data` with `dict()` is unnecessary, returning `defaultdict` should be fine.
@WillAyd I'm wondering if annotation is necessary for |
Yea still helpful for internal funcs
…Sent from my iPhone
On Feb 24, 2020, at 1:13 PM, Jake Tae ***@***.***> wrote:
@WillAyd I'm wondering if annotation is necessary for _from_nested_dict as it is a hidden method. If it is, I will add docstrings as per your suggestion.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@jaketae can you merge master? Should fix CI problem |
@WillAyd I'm new to Git and I'm not exactly sure if this was the right way to merge. The CI problem seems to have been fixed though. If this wasn't quite right, I'll close this PR, fork and clone again and start afresh with a new PR after closing this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use git merge upstream/master
and push again
I'll open a new PR after setting up a clean fork with a proper branch. Will make sure to reference this PR in the new PR, after which this PR will be closed. Thank you for the patience and help. |
Edit
_from_nested_dict()
by usingdefaultdict
for marginal optimizationblack pandas
git diff upstream/master -u -- "*.py" | flake8 --diff