-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: speed-up DateFrame.itertuples() with namedtuples #11625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
you would have to add a benchmark for this |
As fas as I can see,
I disagree, to me this is exactly what the
|
pass | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are missing the point here; this should NOT create a named tuple if it fails the try, that is the point; and this code exists below, you are duplicating code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for correcting you, but the code does exactly what you say is to be achieved.
The try ... except statement has an optional else clause, which, when present,
must follow all except clauses. It is useful for code that must be executed
if the try clause does not raise an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are duplicating code, see the return statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I do not see this. Can you be a little more explicit.
There are two return statements just like in the current version in master, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the statement you added duplicates the other return statement. you don't need the else clause at all. if you think that the _make
actually speeds things, then put it there.
- would need a benchmark as well
- need a test for
name=None
if this actually changed anything (does it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify: I moved the first return statement inside the else-clause and modified it to use _make
(i.e. I did not add a statement and not introduce (new) duplication).
If you remove the else
-block (with the contained return) then namedtuples will never be returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, this is just for name tuples, ok, then, but just put it where it was. I think else clauses make this very unreadable. No reason not to put it in the try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the else as discussed above and put it in the try
Here are some simple timings: import collections
import pandas as pd
from pandas.compat import map, zip
class DataFrame(pd.DataFrame):
def itertuples_new(self, index=True, name="Pandas"):
(...)
else:
return (itertuple(*row) for row in zip(*arrays))
# fallback to regular tuples
return zip(*arrays)
def itertuples_make(self, index=True, name="Pandas"):
(...)
else:
return map(itertuple._make, zip(*arrays))
# fallback to regular tuples
return zip(*arrays)
df = DataFrame({'A': 'spam', 'B': range(1000), 'C': None,
'D': range(1000), 'E': range(1000), 'F': range(1000)})
%timeit list(df.itertuples_new())
100 loops, best of 3: 3.04 ms per loop
%timeit list(df.itertuples_make())
100 loops, best of 3: 2.68 ms per loop
%timeit list(df.itertuples_make(name=None))
1000 loops, best of 3: 1.17 ms per loop |
pls add a benchmark to the asv suite (make about 10x bigger). |
$ asv continuous master HEAD -b
frame_methods.frame_itertuples
· Creating environments
· Discovering benchmarks
·· Uninstalling from py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-sci
py-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sq
lalchemy-xlrd-xlsxwriter-xlwt
·· Installing into py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy
-sqlalchemy-xlrd-xlsxwriter-xlwt
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For pandas commit hash 2238f73e:
[ 0.00%] ·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 0.00%] ·· Benchmarking py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples 10.03ms
[ 50.00%] · For pandas commit hash e29bf614:
[ 50.00%] ·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ·· Benchmarking py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytable
s-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· Running ...thods.frame_itertuples.time_frame_itertuples 12.24msB
ENCHMARKS NOT SIGNIFICANTLY CHANGED. |
class frame_itertuples(object): | ||
|
||
def setup(self): | ||
self.df = DataFrame(np.random.randn(5000, 10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this bigger (50k)
|
@@ -5545,6 +5545,8 @@ def test_itertuples(self): | |||
dfaa = df[['a', 'a']] | |||
self.assertEqual(list(dfaa.itertuples()), [(0, 1, 1), (1, 2, 2), (2, 3, 3)]) | |||
|
|||
self.assertEqual(type(next(df.itertuples(name=None))), tuple) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not good enough as a NamedTuple will match this as well.. Construct an exact tuple like above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for having to correct you again but this is not true:
>>> import collections
>>> spam = collections.namedtuple('spam', 'spam eggs')(1, 2)
>>> type(spam) == tuple
False
>>> spam == (1, 2)
True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't use type(...)
is not idiomatic. use isinstance
and see why what you are doing is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree: testing is exactly the context where type(spam) is eggs
is what you do if you want to check for exact type and not for ancestry. I would use self.assertIs
but AFAIR this was added in Python 2.7.
What do you propose to use as check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read the original PR we already had this discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why test for type when you can explicitly test for object equality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaximilianR: Thanks for the hint, pandas.common.core.is_named_tuple()
uses hasattr(..., '_fields')
. Is that to be prefered?
@kawochen: See above, the test (as I understand it) is to check the returned values are plain tuples, but namedtuple('spam', 'spam eggs')(1, 2) == (1, 2)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. that is quite annoying. perhaps we need more utilities for dealing with these, and type
does seem like the way to test for this. IMO you still need to test for object equality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If you want to test if an instance is a namedtuple use
is_named_tuple
- If you want to test if an instance is a specific namedtuple, I would check the
repr
matches therepr
you expect. Also OK if you additionally check the instances' equality / if the expected instance is a namedtuple.
Type won't help us here @kawochen, because namedtuple's type is just a tuple - it's actually a factory function rather than a type (although there's much debate over whether this is good)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to repr
checking. Note that grepping tests
for Equal(type(
there are some more of such explicit type comparisons (which IMHO this is fine).
@xflr6 pls just follow my directions. It has nothing to do with whether I like it or not. Its not consistent at all in the code base. |
Performance comparison with the regular tuple returning branch:
|
ok, add this issue number onto where #11269 is in whatsnew/v0.17.0 |
There is no issue for this (only the PR), should I open one? |
no use the pr number |
merged via 4ffc3ef thanks! |
Also:
except:
to avoid catchingSystemExit
andKeyboardInterrupt
return
from thetry
-clause to anelse
name=None
(docs?)