Skip to content

DatetimeIndex union fails in 0.19rc1 when constructed from differences in DatetimeIndexes #14323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Liam3851 opened this issue Sep 29, 2016 · 6 comments
Labels
Bug Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@Liam3851
Copy link
Contributor

When constructing a union of 2 DatetimeIndex objects that themselves were constructed from differences from a third DatetimeIndex, the union operator is ignored. This appears to be new behavior in 0.19rc1; the code functioned correctly under 0.18.1.

x = pd.DatetimeIndex(['20160923', '20160924'])                                                                                                                                                 
y = pd.DatetimeIndex(['20160923', '20160925'])                                             
z = pd.date_range("20160921", "20160930")                                                  
a = z.difference(x)                                                                        
b = z.difference(y)                                                                        

In [59]: a.union(b)                                                                                 
Out[59]:                                                                                            
DatetimeIndex(['2016-09-21', '2016-09-22', '2016-09-25', '2016-09-26',                              
               '2016-09-27', '2016-09-28', '2016-09-29', '2016-09-30'],                             
              dtype='datetime64[ns]', freq='D')                                                     

In [60]: a.union(b).difference(a)                                                                   
Out[60]: Index([], dtype='object')

#### Expected Output

# a.union(b) should contain 2016-09-24
# ('2016-09-24' exists in b, since it was in z and not in y,
#  and thus '2016-09-24' should be in a.union(b)).

In [59]: a.union(b)                                                                                 
Out[59]:                                                                                            
DatetimeIndex(['2016-09-21', '2016-09-22', '2016-09-24', '2016-09-25',                              
               '2016-09-26', '2016-09-27', '2016-09-28', '2016-09-29',                              
               '2016-09-30'],                                                                       
              dtype='datetime64[ns]', freq=None, tz=None)

In [60]: a.union(b).difference(a)                                                                   
Out[60]: DatetimeIndex(['2016-09-24'], dtype='datetime64[ns]', freq=None, tz=None)       

Output of pd.show_versions()

## INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.0rc1+0.g497a3bc.dirty
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: 0.9.2
apiclient: 1.4.2
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: 0.2.1

@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Sep 29, 2016
@jorisvandenbossche
Copy link
Member

@Liam3851 Thanks for the report! I can confirm it is indeed a bug/regression.

Seems it has something to do with offset being now defined:

In [36]: pd.__version__
Out[36]: '0.19.0rc1+34.gc128626'

In [37]: a
Out[37]: 
DatetimeIndex(['2016-09-21', '2016-09-22', '2016-09-25', '2016-09-26',
               '2016-09-27', '2016-09-28', '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', freq='D')

In [38]: a.offset
Out[38]: <Day>

vs

In [12]: pd.__version__
Out[12]: '0.18.1'

In [13]: a
Out[13]: 
DatetimeIndex(['2016-09-21', '2016-09-22', '2016-09-25', '2016-09-26',
               '2016-09-27', '2016-09-28', '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', freq=None)

In [15]: a.offset is None
Out[15]: True

@jorisvandenbossche jorisvandenbossche added Bug Datetime Datetime data dtype labels Sep 29, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.0, 0.19.1 Sep 29, 2016
@Liam3851
Copy link
Contributor Author

Liam3851 commented Oct 3, 2016

It looks to me like this was introduced with #13514. Index.difference now returns a shallow copy of the original index with the differenced values:

return this._shallow_copy(the_diff, name=result_name)

Before, this code returned a new DatetimeIndex:

return Index(theDiff, name=result_name)

I'm not too familiar with this code, but it looks to me like _shallow_copy is copying all the attributes, including the freq (which is no longer valid after the differencing operation). The Index constructor version would have re-computed the frequency during the construction of the new Index (thus determining there was no valid frequency).

@jorisvandenbossche
Copy link
Member

@Liam3851 I think that is a perfect assessment of the situation. We could use _simple_new instead of _shallow_copy (which does not retain the attributes), but then eg the timezone of a DatetimeIndex would get lost.

@sinhrks do you think of a generic approach without adding code to specifically invalidate the freq in case of a DatetimeIndex

@jreback
Copy link
Contributor

jreback commented Oct 4, 2016

you can just pass freq=None to _shallow_copy
don't directly use _simple_new

@jorisvandenbossche
Copy link
Member

Ah, I thought that would not work for PeriodIndex (which wants to keep its freq), but apparently it does.

@Liam3851 Wants to do a PR for this change?

@Liam3851
Copy link
Contributor Author

Liam3851 commented Oct 4, 2016

I'm doing a PR (sorry, still a newb at Git and pandas building). Just want to get the requirements straight for the unit tests:

  1. DatetimeIndex and TimedeltaIndex should be None freq after the differencing, not an inferred frequency (it appears this was the old behavior, counter to what I said above).
  2. PeriodIndex should retain its frequency after differencing, even without the missing datapoints (because the Periods themselves have frequency)
  3. All three should have a union of differences matching a union of indexes.

That about right?

jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this issue Nov 2, 2016
… incorrect

closes pandas-dev#14323

Sets freq to None when doing a difference operation on a DatetimeIndex
or TimedeltaIndex, rather than retaining the frequency (which can
cause  problems with downstream operations). Frequency of PeriodIndex
is retained.

Author: David Krych <[email protected]>

Closes pandas-dev#14346 from Liam3851/dtind_diff_14323 and squashes the following commits:

1dbf582 [David Krych] BUG: GH14323 Union of differences from DatetimeIndex incorrect

(cherry picked from commit bee90a7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants