Skip to content

BUG: Issues with groupby ewm and times #40951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
stevenschaerer opened this issue Apr 14, 2021 · 0 comments · Fixed by #40952
Closed
3 tasks done

BUG: Issues with groupby ewm and times #40951

stevenschaerer opened this issue Apr 14, 2021 · 0 comments · Fixed by #40952
Labels
Bug Window rolling, ewma, expanding
Milestone

Comments

@stevenschaerer
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

This refers to the code that is currently on master 84d9c5e (2021-04-14). The issues also exist on the latest version of pandas but are different.

import pandas as pd

halflife = "23 days"
baseline_df = pd.DataFrame(
    {
        "A": ["a", "b", "a", "b", "a", "b"],
        "B": [0, 0, 1, 1, 2, 2],
        "C": pd.to_datetime(
            [
                "2020-01-01",
                "2020-01-01",
                "2020-01-10",
                "2020-01-02",
                "2020-01-23",
                "2020-01-03",
            ]
        )
    }
)

cython_result = baseline_df.groupby("A").ewm(halflife=halflife, times="C").mean()
print("cython")
print(cython_result)
print("numba")
numba_result = baseline_df.groupby("A").ewm(halflife=halflife, times="C").mean(engine="numba")
print(numba_result)

expected_result_a = pd.DataFrame([0, 1, 2]).ewm(
    halflife=halflife, times=pd.to_datetime(["2020-01-01", "2020-01-10", "2020-01-23"])
).mean()
expected_result_b = pd.DataFrame([0, 1, 2]).ewm(
    halflife=halflife, times=pd.to_datetime(["2020-01-01", "2020-01-02", "2020-01-03"])
).mean()
print("expected")
print("  group a")
print(expected_result_a)
print("  group b")
print(expected_result_b)

Output:

cython
            B
A            
a 0  0.000000
  2  0.500000
  4  1.094088
b 1  0.000000
  3  0.500000
  5  1.094088
numba
            B
A            
a 0  0.000000
  2  0.666667
  4  1.428571
b 1  0.000000
  3  0.666667
  5  1.428571
expected
  group a
          0
0  0.000000
1  0.567395
2  1.221209
  group b
          0
0  0.000000
1  0.507534
2  1.020088

Problem description

There are three problems with the current groupby ewm implementation in the case of non-None times.

  1. numba implementation: ignores the times
  2. cython implementation: does not use the correct times/deltas in aggregations.pyx in case of multiple groups
  3. if the groups are non-trivial the time vector and values become out of sync

I have a branch that fixes these issues, will link to it in a bit.

Expected Output

cython
            B
A            
a 0  0.000000
  2  0.567395
  4  1.221209
b 1  0.000000
  3  0.507534
  5  1.020088
numba
            B
A            
a 0  0.000000
  2  0.567395
  4  1.221209
b 1  0.000000
  3  0.507534
  5  1.020088
expected
  group a
          0
0  0.000000
1  0.567395
2  1.221209
  group b
          0
0  0.000000
1  0.507534
2  1.020088

Output of pd.show_versions()

@stevenschaerer stevenschaerer added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 14, 2021
@mroeschke mroeschke added Window rolling, ewma, expanding and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 14, 2021
@mroeschke mroeschke added this to the 1.3 milestone Apr 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants