DatetimeIndex union fails in 0.19rc1 when constructed from differences in DatetimeIndexes #14323

Liam3851 · 2016-09-29T16:24:34Z

When constructing a union of 2 DatetimeIndex objects that themselves were constructed from differences from a third DatetimeIndex, the union operator is ignored. This appears to be new behavior in 0.19rc1; the code functioned correctly under 0.18.1.

x = pd.DatetimeIndex(['20160923', '20160924'])                                                                                                                                                 
y = pd.DatetimeIndex(['20160923', '20160925'])                                             
z = pd.date_range("20160921", "20160930")                                                  
a = z.difference(x)                                                                        
b = z.difference(y)                                                                        

In [59]: a.union(b)                                                                                 
Out[59]:                                                                                            
DatetimeIndex(['2016-09-21', '2016-09-22', '2016-09-25', '2016-09-26',                              
               '2016-09-27', '2016-09-28', '2016-09-29', '2016-09-30'],                             
              dtype='datetime64[ns]', freq='D')                                                     

In [60]: a.union(b).difference(a)                                                                   
Out[60]: Index([], dtype='object')

#### Expected Output

# a.union(b) should contain 2016-09-24
# ('2016-09-24' exists in b, since it was in z and not in y,
#  and thus '2016-09-24' should be in a.union(b)).

In [59]: a.union(b)                                                                                 
Out[59]:                                                                                            
DatetimeIndex(['2016-09-21', '2016-09-22', '2016-09-24', '2016-09-25',                              
               '2016-09-26', '2016-09-27', '2016-09-28', '2016-09-29',                              
               '2016-09-30'],                                                                       
              dtype='datetime64[ns]', freq=None, tz=None)

In [60]: a.union(b).difference(a)                                                                   
Out[60]: DatetimeIndex(['2016-09-24'], dtype='datetime64[ns]', freq=None, tz=None)

Output of `pd.show_versions()`

## INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.0rc1+0.g497a3bc.dirty
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: 0.9.2
apiclient: 1.4.2
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: 0.2.1

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2016-09-29T17:55:46Z

@Liam3851 Thanks for the report! I can confirm it is indeed a bug/regression.

Seems it has something to do with offset being now defined:

In [36]: pd.__version__
Out[36]: '0.19.0rc1+34.gc128626'

In [37]: a
Out[37]: 
DatetimeIndex(['2016-09-21', '2016-09-22', '2016-09-25', '2016-09-26',
               '2016-09-27', '2016-09-28', '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', freq='D')

In [38]: a.offset
Out[38]: <Day>

vs

In [12]: pd.__version__
Out[12]: '0.18.1'

In [13]: a
Out[13]: 
DatetimeIndex(['2016-09-21', '2016-09-22', '2016-09-25', '2016-09-26',
               '2016-09-27', '2016-09-28', '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', freq=None)

In [15]: a.offset is None
Out[15]: True

Liam3851 · 2016-10-03T17:29:40Z

It looks to me like this was introduced with #13514. Index.difference now returns a shallow copy of the original index with the differenced values:

return this._shallow_copy(the_diff, name=result_name)

Before, this code returned a new DatetimeIndex:

return Index(theDiff, name=result_name)

I'm not too familiar with this code, but it looks to me like _shallow_copy is copying all the attributes, including the freq (which is no longer valid after the differencing operation). The Index constructor version would have re-computed the frequency during the construction of the new Index (thus determining there was no valid frequency).

jorisvandenbossche · 2016-10-04T13:44:58Z

@Liam3851 I think that is a perfect assessment of the situation. We could use _simple_new instead of _shallow_copy (which does not retain the attributes), but then eg the timezone of a DatetimeIndex would get lost.

@sinhrks do you think of a generic approach without adding code to specifically invalidate the freq in case of a DatetimeIndex

jreback · 2016-10-04T13:51:27Z

you can just pass freq=None to _shallow_copy
don't directly use _simple_new

jorisvandenbossche · 2016-10-04T14:00:36Z

Ah, I thought that would not work for PeriodIndex (which wants to keep its freq), but apparently it does.

@Liam3851 Wants to do a PR for this change?

Liam3851 · 2016-10-04T18:29:17Z

I'm doing a PR (sorry, still a newb at Git and pandas building). Just want to get the requirements straight for the unit tests:

DatetimeIndex and TimedeltaIndex should be None freq after the differencing, not an inferred frequency (it appears this was the old behavior, counter to what I said above).
PeriodIndex should retain its frequency after differencing, even without the missing datapoints (because the Periods themselves have frequency)
All three should have a union of differences matching a union of indexes.

That about right?

… incorrect closes pandas-dev#14323 Sets freq to None when doing a difference operation on a DatetimeIndex or TimedeltaIndex, rather than retaining the frequency (which can cause problems with downstream operations). Frequency of PeriodIndex is retained. Author: David Krych <[email protected]> Closes pandas-dev#14346 from Liam3851/dtind_diff_14323 and squashes the following commits: 1dbf582 [David Krych] BUG: GH14323 Union of differences from DatetimeIndex incorrect (cherry picked from commit bee90a7)

jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Sep 29, 2016

jorisvandenbossche added Bug Datetime Datetime data dtype labels Sep 29, 2016

jorisvandenbossche modified the milestones: 0.19.0, 0.19.1 Sep 29, 2016

jreback added Difficulty Intermediate labels Oct 3, 2016

Liam3851 mentioned this issue Oct 4, 2016

BUG: GH14323 Union of differences from DatetimeIndex incorrect #14346

Closed

4 tasks

jreback closed this as completed in bee90a7 Oct 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DatetimeIndex union fails in 0.19rc1 when constructed from differences in DatetimeIndexes #14323

DatetimeIndex union fails in 0.19rc1 when constructed from differences in DatetimeIndexes #14323

Liam3851 commented Sep 29, 2016

jorisvandenbossche commented Sep 29, 2016

Liam3851 commented Oct 3, 2016

jorisvandenbossche commented Oct 4, 2016

jreback commented Oct 4, 2016

jorisvandenbossche commented Oct 4, 2016

Liam3851 commented Oct 4, 2016

DatetimeIndex union fails in 0.19rc1 when constructed from differences in DatetimeIndexes #14323

DatetimeIndex union fails in 0.19rc1 when constructed from differences in DatetimeIndexes #14323

Comments

Liam3851 commented Sep 29, 2016

Output of pd.show_versions()

jorisvandenbossche commented Sep 29, 2016

Liam3851 commented Oct 3, 2016

jorisvandenbossche commented Oct 4, 2016

jreback commented Oct 4, 2016

jorisvandenbossche commented Oct 4, 2016

Liam3851 commented Oct 4, 2016

Output of `pd.show_versions()`