Skip to content

ENH: change classification of performance related warnings to a side note #54336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
MichalRIcar opened this issue Jul 31, 2023 · 12 comments
Closed
2 of 3 tasks
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@MichalRIcar
Copy link

MichalRIcar commented Jul 31, 2023

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Hello,

I would like to propose to change classification of performance related warnings to a side note. Main reasoning is that the performance warning is not a warning, it is an educational note to a user. And sadly it is in 99.9% false positive in real world usage by expert users.

From philosophical point of view, the very existence of pandas is to manipulate conveniently with data. Surely there are more efficient ways to do certain operations, for instance use C instead of Python or SQL instead of pandas, but that is not the reason why DS and ML world is using pandas.

A share from many discussions around this topic, as their case is just identical with ours - a team of PhD guys with highly optimized code with respect to Python environment, yet, we receive thousands upon thousands of irrelevant educational warnings.

An excerpt from a discussion (link below) :
Aware that this might be a reply that some will find highly controversial, I'm still posting my opinion here...

Proposed answer: Ignore the warning. If the user thinks/observes that the code suffers from poor performance, it's the user's responsibility to fix it, not the module's responsibility to propose code refactoring steps.

Rationale for this harsh reply: I am seeing this warning now that I have migrated to pandas v2.0.0 at many different places. Reason is that, at multiple places in the script, I remove and add records from dataframes, using many calls to .loc[] and .concat().

Now, given I am pretty savvy in vectorization, we perform these operations with performance in mind (e.g., never inside a for loop, but maybe ripping out an entire block of records, such as overwriting some "inner 20%" of the dataframe, after multiple pd.merge() operations - think of it as ETL operations on a database implemented in pandas instead of SQL). We see that the application runs incredibly fast, even though some dataframes contain ~4.5 mn records. More specifically: For one script, I get >50 of these warnings logged out in <0.3 seconds.. which I, subjectively, don't perceive as particularly "poor performance" (running a serial application with PyCharm in 'debugging' mode - so not exactly a setup in which you would expect best performance in the first place).

So, I conclude:

The code ran with pandas <2.0.0, and never raised a warning
The performance is excellent
We have multiple colleagues with a PhD in high-performance computing working on the code, and they believe it's fine
Module warning messages should not be abused for 'tutorials' or 'educational purposes' (even if well intented) - this is different than, for example, the "setting to copy of dataframe", where chances are very high that the functional behavior of the module leads to incorrect output. Here, it's just a 100% educational warning - that deserves, if anything, the logger level "info" (if not "debug"), certainly not "warning"
We get an incredibly dirty stdout log, for no reason
The warning itself is highly misleading - we don't have a single call to .insert() across the entire ecosystem - the fragmentation that we do have in our dataframes comes from many iterative, but fast, updates - so thanks for sending us down the wrong path
We will certainly not refactor a code that is showing excellent performance, and has been tested and validated over and over again, just because someone from the pandas team wants to educate us about stuff we know :/ If at least the performance was poor, I would welcome this message as a suggestion for improvement (even then: not a warning, but an 'info') - but given its current indiscriminate way of popping up: For once, it's actually the module that's the problem, not the user.

Edit: This is 100% the same issue as the following warning PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance. - which, despite warning me about "performance", pops up 28 times (!) in less than 3 seconds - again, in debugging mode of PyCharm. I'm pretty sure removing the warning alone would improve performance by 20% (or, 20 ms per operation ;)). Also, starts bothering as of pandas v2.0.0 and should be removed from the module altogether.

https://stackoverflow.com/questions/68292862/performancewarning-dataframe-is-highly-fragmented-this-is-usually-the-result-o/76306267#76306267

Feature Description

Please remove educational warnings for good or move them to a level of a note for users who might be interested.

Alternative Solutions

Introduce a global parameter allowing to turn off all educational warnings.

Additional Context

No response

@MichalRIcar MichalRIcar added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 31, 2023
@lithomas1
Copy link
Member

Does warnings.simplefilter not work?

@MichalRIcar
Copy link
Author

MichalRIcar commented Aug 1, 2023

Hi, I am not sharing this proposal to end the discussion with a filter, I hope we can continue little deeper.

I would like to elaborate further on the current status of this "warning." I assume that this "warning" was implemented a while ago, and its occurrence has become much worse with pandas 2+.

Thus, my aim is to open a discussion about whether the pandas package should be warning users when they are using the correct function in the right manner and under appropriate circumstances.

Is that the purpose of the warning? And what is the fix of such warning - and by fix I am not considering to filter it.
We discussed it with collegues over departments and simply said, we consider it as a pure noise and redundat message.

Even "stupidly" asking ChatGPT to fix it - the response basically is to ignore it as it is not a real warning...and fix is not really there. Isn't that concerning - that package warning and recommendation is being treated like that by community and AI models? Shouldn't warning be a warning? You know how it goes..inflation is everywhere. And if these types of warnings grow then people just start ignoring all warnings of pandas..I know it bit extreme but thinking here ad absurdum as an academic thought process of reasoning the existence of an element.

The obvious fix of this warning is not using pandas for data operations and only use it as a final wrapper after all operations. Well, if that so then I am asking myself, what is the pupose of pandas . I am not being mean here, I am just expressing my thoughts and questioning myself here..as if we shoudn't use pandas for data manipulations (as it ends up with performance warning) then we end up not using it at all and that is not the solution (I hope) as pandas is an amazing tool..

So my question is, what exactly is the purpose of this warning in the pandas package? Every time I've tried to start a discussion about it, do a code revision, performance tests...it ends with "just ignore it, it is nonsense."

Thanks for Your opinion and clarification.

@lithomas1
Copy link
Member

Sorry, I don't think I was clear before, but I was referring to the part about changing the warning to a "side note".
Personally, I think the warning level is fine and I would be opposed to changing it unless there is a problem with simplefilter.
(The warning is a custom warning class in pandas, so it is pretty easy to filter out).

Hi, I am not sharing this proposal to end the discussion with a filter, I hope we can continue little deeper.

I would like to elaborate further on the current status of this "warning." I assume that this "warning" was implemented a while ago, and its occurrence has become much worse with pandas 2+.

Thus, my aim is to open a discussion about whether the pandas package should be warning users when they are using the correct function in the right manner and under appropriate circumstances.

The PerformanceWarning predates my contributions to pandas, and personally I haven't encountered it, but it's generally used to alert you to the fact that unexpected performance degradation might be happening.

I know this isn't very specific, but this warning can be raised from a lot of places (e.g. you are missing an optional dependency, the internal dataframe structure is in a weird state, etc.), so without knowing too much about your use case, I can't really be more helpful here.

I do agree that it can be confusing at times, and the heuristic for raising PerformanceWarnings is probably not bug-free, so I'd recommend opening a new issue with a Minimal Reproducible Example.

(We can always revisit this issue later, depending on whether that one is a false positive or not)

Apart from that, I don't think there's anything I can do or infer based on the anecdotes provided so far.

@MichalRIcar
Copy link
Author

Hello @lithomas1 ,

My apologies for the delay; summer can be quite busy. :)

I've checked the Pandas source code (source), and the PerformanceWarning appears to be straightforward. It triggers when more than 100 variables are created in the Block fashion. Here's a simple example:

    import pandas as pd
    
    def V1():
        X = pd.DataFrame()
        for i in range(101):
            y = pd.DataFrame({f'X{i}':[i]*10000})
            X[f'X{i}'] = y
        return X
    
    def V2():
        X = pd.DataFrame()
        for i in range(101):
            y = pd.DataFrame({f'X{i}':[i]*10000})
            X = pd.concat([X,y], axis =1)
        return X

Even though V1 perform better than V2:
V1: 157 ms ± 887 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
V2: 219 ms ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

yet, V1 will produce PerformanceWarning for all variables added from the point of 100+ and V2 not.

Is there any particular reason for this warning, or is it simply a relic from older versions of Pandas?

@lithomas1
Copy link
Member

lithomas1 commented Sep 6, 2023

No worries, I think the issue here is that you are calling concat repeatedly in a for loop. One concat at the end should be enough.

If I add a V3 like this,

def V3():
    dfs = []
    for i in range(101):
        dfs.append(pd.DataFrame({f'X{i}': [i] * 10000}))
    X = pd.concat(dfs, axis=1)
    return X

and do timings I get something like this

V1: 2.1140471610124223
V2: 8.657369335996918
V3: 1.777801971998997

so it's around a 10-20% perf penalty in this case.

@MichalRIcar
Copy link
Author

MichalRIcar commented Sep 6, 2023

Hi, that is a relevant note for a comparison, even though in pratice it is not true - reason is that if ML team works extensively with large data set then there is a core data set which we work with and based on which features are derived.

The FE derivation and meta-storage is truly via V3, however, pluggin in derived features to main dataset is via V2. Doing these type of steps many times - derive fetures from previously derived features etc. is a combination of V2 and V3 approach or it can simply be just use of V1.

The point of our discussion and comparison is to show that PerformanceWarning is a false positive in 99% real-world scenarios as this example shows:

import pandas as pd

def V1():
    X = pd.DataFrame()
    for i in range(101):
        y = pd.DataFrame({f'X{i}':[i]*10000})
        X[f'X{i}'] = y
    return X

def V2():
    X = pd.DataFrame()
    for i in range(101):
        y = pd.DataFrame({f'X{i}':[i]*10000})
        X = pd.concat([X,y], axis =1)
    return X

def V3():
    X = []
    for i in range(101):
        y = pd.DataFrame({f'X{i}':[i]*10000})
        X.append(y)
    X = pd.concat(X, axis=1)
    return X

V1: 151 ms ± 841 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
V2: 233 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
V3: 136 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Thus my idea still holds - is it correct to show the warning for V1 when it performs just about the same as your preferred approach?

@MichalRIcar
Copy link
Author

MichalRIcar commented Sep 6, 2023

To provide a clearer idea of what I meant above and to transition from an academic discussion to a more real-world scenario:

import pandas as pd

def V1():
    X = pd.DataFrame()
    for z in range(10):
        Z, V = [], []        
        for i in range(101):
            v = f'X{z}_{i}'
            Z.append( pd.DataFrame({v:[i]*100000}) )
            V.append(v)
        X.loc[:,V] = Z
    return X

def V3():
    X = pd.DataFrame()
    for z in range(10):
        Z = []
        for i in range(101):
            v = f'X{z}_{i}'
            Z.append( pd.DataFrame({v:[i]*100000}) )
        Z = pd.concat(Z, axis=1)
        X = pd.concat([X,Z], axis=1)
    return X

V1: 13.2 s ± 51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
V3: 14.7 s ± 94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

V1 has a desirable property, at least from our perspective. It directly adds newly generated data to the main dataset, which has some positive coding implications.

However, it's worth noting that V1 will produce warnings, whereas V3 will not, even though there are relevant situations where V1 is the preferred approach in code construction and data manipulation and it might be faster (from pratical pov V1 has same performance as V3).

@mroeschke
Copy link
Member

The PerformanceWarning also implies that subsequent operations with V1 dataframe may be unexpectedly slow because it is fragmented.

@MichalRIcar
Copy link
Author

Hello @mroeschke, is there any type of 'self-healing' function that can fix this?

Let's assume we've performed thousands of operations, and we've started seeing the PerfWarn warning. As the warnings' proposed solution "df.copy()" doesn't work (the warning is still present), I was thinking that some kind of 'df.heal_structure()' would be a great tool if the pandas team considers this warning as potentially serious issue.

Does such a function exist, or is there a reference for a healing process after extensive data manipulations when df.copy() doesn't fix it?

@mroeschke
Copy link
Member

I would have thought copy would have worked here as the warning mentioned cc @jbrockmendel

@jbrockmendel
Copy link
Member

copy should consolidate, yes. I think to merit a healing-process method we'd need a good collection of things besides consolidation that need healing. Off the top of my head:

  • consolidation
  • too many pyarrow chunks (there's an issue for this, too lazy to get the GH ref, i think jeff opened it)
  • infer_objects
  • swap F/C contiguity (depends on what you'll be doing next)

@mroeschke
Copy link
Member

In pandas 3.0, PerformanceWarning can be disabled by a global option (see #56920) so closing as implemented

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

4 participants