Skip to content

DataFrame.apply adds a frequency to a freq=None DatetimeIndex as a side-effect #22150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dycw opened this issue Aug 1, 2018 · 8 comments · Fixed by #22561
Closed

DataFrame.apply adds a frequency to a freq=None DatetimeIndex as a side-effect #22150

dycw opened this issue Aug 1, 2018 · 8 comments · Fixed by #22561
Labels
Apply Apply, Aggregate, Transform, Map Datetime Datetime data dtype Frequency DateOffsets
Milestone

Comments

@dycw
Copy link

dycw commented Aug 1, 2018

Code Sample, a copy-pastable example if possible

import numpy as np, pandas as pd

def sudden_frequency(num_columns):
    index = pd.DatetimeIndex(["1950-06-30", "1952-10-24", "1953-05-29"])
    columns = list(range(num_columns))
    df = pd.DataFrame(np.random.random((len(index), num_columns)), index, columns)
    df.apply(lambda sr: sr)
    return index

for num_columns in range(5):
    print(num_columns, "--", sudden_frequency(num_columns))

Output:

0 -- DatetimeIndex(['1950-06-30', '1952-10-24', '1953-05-29'], dtype='datetime64[ns]', freq=None)
1 -- DatetimeIndex(['1950-06-30', '1952-10-24', '1953-05-29'], dtype='datetime64[ns]', freq=None)
2 -- DatetimeIndex(['1950-06-30', '1952-10-24', '1953-05-29'], dtype='datetime64[ns]', freq='WOM-4FRI')
3 -- DatetimeIndex(['1950-06-30', '1952-10-24', '1953-05-29'], dtype='datetime64[ns]', freq='WOM-4FRI')
4 -- DatetimeIndex(['1950-06-30', '1952-10-24', '1953-05-29'], dtype='datetime64[ns]', freq='WOM-4FRI')

Problem description

This particular index (found by hypothesis) suddenly gains a frequency it is used in a DataFrame, with >= 2 columns, which goes on to call ".apply".

Expected Output

n -- DatetimeIndex(['1950-06-30', '1952-10-24', '1953-05-29'], dtype='datetime64[ns]', freq=None)

for all n.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.16.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C
LOCALE: None.None

pandas: 0.23.3
pytest: 3.6.4
pip: 18.0
setuptools: 39.2.0
Cython: None
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.10
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Aug 1, 2018

Can you provide a more minimal example to reproduce the issue?

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label Aug 1, 2018
@dycw
Copy link
Author

dycw commented Aug 3, 2018

Yes. I have reduced the conditions to the following:

  1. The DatetimeIndex is of length >= 3.
  2. The DatetimeIndex has an inferrable frequency.
  3. The DataFrame has >= 2 columns.
from hypothesis import given
from hypothesis.strategies import composite, dates, integers, sampled_from
from pandas import DataFrame, DatetimeIndex, Timestamp, date_range


@composite
def indices(draw, max_length=5):
    date = draw(
        dates(
            min_value=Timestamp.min.ceil("D").to_pydatetime().date(),
            max_value=Timestamp.max.floor("D").to_pydatetime().date(),
        ).map(Timestamp)
    )
    periods = draw(integers(0, max_length))
    freq = draw(sampled_from(list("BDHTS")))
    dr = date_range(date, periods=periods, freq=freq)
    return DatetimeIndex(list(dr))


@given(index=indices(5), num_columns=integers(0, 5))
def test_main(index, num_columns):
    original = index.copy()
    df = DataFrame(True, index=index, columns=range(num_columns))
    df.apply(lambda x: x)
    assert index.freq == original.freq

One example is

index = DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05'], dtype='datetime64[ns]', freq='D'), num_columns = 2

    @given(index=indices(5), num_columns=integers(0, 5))
    def test_main(index, num_columns):
        original = index.copy()
        df = DataFrame(True, index=index, columns=range(num_columns))
        df.apply(lambda x: x)
>       assert index.freq == original.freq
E       AssertionError: assert <Day> == None
E        +  where <Day> = DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05'], dtype='datetime64[ns]', freq='D').freq
E        +  and   None = DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05'], dtype='datetime64[ns]', freq=None).freq

@mroeschke
Copy link
Member

mroeschke commented Aug 3, 2018

Thanks @dycw. I can reproduce with a similar example:

In [1]: index = pd.DatetimeIndex(["1950-06-30", "1952-10-24", "1953-05-29"])

In [2]: df = pd.DataFrame(1, index=index, columns=range(2))

In [3]: index
Out[3]: DatetimeIndex(['1950-06-30', '1952-10-24', '1953-05-29'], dtype='datetime64[ns]', freq=None)

In [4]: df.apply(lambda x: x)
Out[4]:
            0  1
1950-06-30  1  1
1952-10-24  1  1
1953-05-29  1  1

# Gains a frequency
In [5]: index
Out[5]: DatetimeIndex(['1950-06-30', '1952-10-24', '1953-05-29'], dtype='datetime64[ns]', freq='WOM-4FRI')

In [6]: pd.__version__
Out[6]: '0.23.3'

#14927 may be playing a role here somewhere. Investigation and PR' are always welcome!

@mroeschke mroeschke added Datetime Datetime data dtype Apply Apply, Aggregate, Transform, Map Frequency DateOffsets and removed Needs Info Clarification about behavior needed to assess issue labels Aug 3, 2018
@HannahFerch
Copy link
Contributor

I would like to look into this, however, am quite new to open source - can I ask questions here if I get stuck or rather in another place, e.g. Gitter?

@WillAyd
Copy link
Member

WillAyd commented Aug 10, 2018

@HannahFerch high level can ask questions here or on Gitter. For detailed code review it is easiest if you just push a PR and get feedback directly on that

@HannahFerch
Copy link
Contributor

@WillAyd Makes sense. Thanks!

@HannahFerch
Copy link
Contributor

I have been looking at the example of @mroeschke. The setting of the frequency takes place in pandas.core.apply.FrameRowApply when wrapping the results with wrap_results_for_axis(). This calls self.obj._constructor, which returns a results object with freq='WOM-4FRI' instead of the original freq='None' that went inside.
Should the setting of the frequency be prevented at this point or better be set back to the original 'None' later on before returning the df?

@mroeschke
Copy link
Member

mroeschke commented Aug 24, 2018

It would be more ideal to prevent self.obj._constructor from setting a new freq.

@jreback jreback added this to the 0.24.0 milestone Sep 4, 2018
HannahFerch added a commit to HannahFerch/pandas that referenced this issue Sep 9, 2018
HannahFerch added a commit to HannahFerch/pandas that referenced this issue Sep 16, 2018
# Conflicts:
#	doc/source/whatsnew/v0.24.0.txt
HannahFerch added a commit to HannahFerch/pandas that referenced this issue Sep 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Datetime Datetime data dtype Frequency DateOffsets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants