-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: change classification of performance related warnings to a side note #54336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Does warnings.simplefilter not work? |
Hi, I am not sharing this proposal to end the discussion with a filter, I hope we can continue little deeper. I would like to elaborate further on the current status of this "warning." I assume that this "warning" was implemented a while ago, and its occurrence has become much worse with pandas 2+. Thus, my aim is to open a discussion about whether the pandas package should be warning users when they are using the correct function in the right manner and under appropriate circumstances. Is that the purpose of the warning? And what is the fix of such warning - and by fix I am not considering to filter it. Even "stupidly" asking ChatGPT to fix it - the response basically is to ignore it as it is not a real warning...and fix is not really there. Isn't that concerning - that package warning and recommendation is being treated like that by community and AI models? Shouldn't warning be a warning? You know how it goes..inflation is everywhere. And if these types of warnings grow then people just start ignoring all warnings of pandas..I know it bit extreme but thinking here ad absurdum as an academic thought process of reasoning the existence of an element. The obvious fix of this warning is not using pandas for data operations and only use it as a final wrapper after all operations. Well, if that so then I am asking myself, what is the pupose of pandas . I am not being mean here, I am just expressing my thoughts and questioning myself here..as if we shoudn't use pandas for data manipulations (as it ends up with performance warning) then we end up not using it at all and that is not the solution (I hope) as pandas is an amazing tool.. So my question is, what exactly is the purpose of this warning in the pandas package? Every time I've tried to start a discussion about it, do a code revision, performance tests...it ends with "just ignore it, it is nonsense." Thanks for Your opinion and clarification. |
Sorry, I don't think I was clear before, but I was referring to the part about changing the warning to a "side note".
The PerformanceWarning predates my contributions to pandas, and personally I haven't encountered it, but it's generally used to alert you to the fact that unexpected performance degradation might be happening. I know this isn't very specific, but this warning can be raised from a lot of places (e.g. you are missing an optional dependency, the internal dataframe structure is in a weird state, etc.), so without knowing too much about your use case, I can't really be more helpful here. I do agree that it can be confusing at times, and the heuristic for raising PerformanceWarnings is probably not bug-free, so I'd recommend opening a new issue with a Minimal Reproducible Example. (We can always revisit this issue later, depending on whether that one is a false positive or not) Apart from that, I don't think there's anything I can do or infer based on the anecdotes provided so far. |
Hello @lithomas1 , My apologies for the delay; summer can be quite busy. :) I've checked the Pandas source code (source), and the PerformanceWarning appears to be straightforward. It triggers when more than 100 variables are created in the Block fashion. Here's a simple example:
Even though V1 perform better than V2: yet, V1 will produce PerformanceWarning for all variables added from the point of 100+ and V2 not. Is there any particular reason for this warning, or is it simply a relic from older versions of Pandas? |
No worries, I think the issue here is that you are calling concat repeatedly in a for loop. One concat at the end should be enough. If I add a V3 like this,
and do timings I get something like this
so it's around a 10-20% perf penalty in this case. |
Hi, that is a relevant note for a comparison, even though in pratice it is not true - reason is that if ML team works extensively with large data set then there is a core data set which we work with and based on which features are derived. The FE derivation and meta-storage is truly via V3, however, pluggin in derived features to main dataset is via V2. Doing these type of steps many times - derive fetures from previously derived features etc. is a combination of V2 and V3 approach or it can simply be just use of V1. The point of our discussion and comparison is to show that PerformanceWarning is a false positive in 99% real-world scenarios as this example shows: import pandas as pd
V1: 151 ms ± 841 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) Thus my idea still holds - is it correct to show the warning for V1 when it performs just about the same as your preferred approach? |
To provide a clearer idea of what I meant above and to transition from an academic discussion to a more real-world scenario:
V1: 13.2 s ± 51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) V1 has a desirable property, at least from our perspective. It directly adds newly generated data to the main dataset, which has some positive coding implications. However, it's worth noting that V1 will produce warnings, whereas V3 will not, even though there are relevant situations where V1 is the preferred approach in code construction and data manipulation and it might be faster (from pratical pov V1 has same performance as V3). |
The |
Hello @mroeschke, is there any type of 'self-healing' function that can fix this? Let's assume we've performed thousands of operations, and we've started seeing the PerfWarn warning. As the warnings' proposed solution "df.copy()" doesn't work (the warning is still present), I was thinking that some kind of 'df.heal_structure()' would be a great tool if the pandas team considers this warning as potentially serious issue. Does such a function exist, or is there a reference for a healing process after extensive data manipulations when df.copy() doesn't fix it? |
I would have thought |
copy should consolidate, yes. I think to merit a healing-process method we'd need a good collection of things besides consolidation that need healing. Off the top of my head:
|
In pandas 3.0, |
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
Hello,
I would like to propose to change classification of performance related warnings to a side note. Main reasoning is that the performance warning is not a warning, it is an educational note to a user. And sadly it is in 99.9% false positive in real world usage by expert users.
From philosophical point of view, the very existence of pandas is to manipulate conveniently with data. Surely there are more efficient ways to do certain operations, for instance use C instead of Python or SQL instead of pandas, but that is not the reason why DS and ML world is using pandas.
A share from many discussions around this topic, as their case is just identical with ours - a team of PhD guys with highly optimized code with respect to Python environment, yet, we receive thousands upon thousands of irrelevant educational warnings.
An excerpt from a discussion (link below) :
Aware that this might be a reply that some will find highly controversial, I'm still posting my opinion here...
Proposed answer: Ignore the warning. If the user thinks/observes that the code suffers from poor performance, it's the user's responsibility to fix it, not the module's responsibility to propose code refactoring steps.
Rationale for this harsh reply: I am seeing this warning now that I have migrated to pandas v2.0.0 at many different places. Reason is that, at multiple places in the script, I remove and add records from dataframes, using many calls to .loc[] and .concat().
Now, given I am pretty savvy in vectorization, we perform these operations with performance in mind (e.g., never inside a for loop, but maybe ripping out an entire block of records, such as overwriting some "inner 20%" of the dataframe, after multiple pd.merge() operations - think of it as ETL operations on a database implemented in pandas instead of SQL). We see that the application runs incredibly fast, even though some dataframes contain ~4.5 mn records. More specifically: For one script, I get >50 of these warnings logged out in <0.3 seconds.. which I, subjectively, don't perceive as particularly "poor performance" (running a serial application with PyCharm in 'debugging' mode - so not exactly a setup in which you would expect best performance in the first place).
So, I conclude:
The code ran with pandas <2.0.0, and never raised a warning
The performance is excellent
We have multiple colleagues with a PhD in high-performance computing working on the code, and they believe it's fine
Module warning messages should not be abused for 'tutorials' or 'educational purposes' (even if well intented) - this is different than, for example, the "setting to copy of dataframe", where chances are very high that the functional behavior of the module leads to incorrect output. Here, it's just a 100% educational warning - that deserves, if anything, the logger level "info" (if not "debug"), certainly not "warning"
We get an incredibly dirty stdout log, for no reason
The warning itself is highly misleading - we don't have a single call to .insert() across the entire ecosystem - the fragmentation that we do have in our dataframes comes from many iterative, but fast, updates - so thanks for sending us down the wrong path
We will certainly not refactor a code that is showing excellent performance, and has been tested and validated over and over again, just because someone from the pandas team wants to educate us about stuff we know :/ If at least the performance was poor, I would welcome this message as a suggestion for improvement (even then: not a warning, but an 'info') - but given its current indiscriminate way of popping up: For once, it's actually the module that's the problem, not the user.
Edit: This is 100% the same issue as the following warning PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance. - which, despite warning me about "performance", pops up 28 times (!) in less than 3 seconds - again, in debugging mode of PyCharm. I'm pretty sure removing the warning alone would improve performance by 20% (or, 20 ms per operation ;)). Also, starts bothering as of pandas v2.0.0 and should be removed from the module altogether.
https://stackoverflow.com/questions/68292862/performancewarning-dataframe-is-highly-fragmented-this-is-usually-the-result-o/76306267#76306267
Feature Description
Please remove educational warnings for good or move them to a level of a note for users who might be interested.
Alternative Solutions
Introduce a global parameter allowing to turn off all educational warnings.
Additional Context
No response
The text was updated successfully, but these errors were encountered: