BUG: Concatentation of TZ-aware dataframes (#12396) (#18447) #19327

paul-mannino · 2018-01-20T23:30:49Z

tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

paul-mannino · 2018-01-20T23:41:00Z

Looking for guidance on this issue related to #12396 that I still haven't fixed.

In [6]: df2 = pd.DataFrame([[pd.Timestamp('2015/01/01', tz='UTC')], [pd.Timestamp('2016/01/01', tz='US/Eastern')]])

In [7]: pd.concat([df1, df2]).dtypes
Out[7]: 
0    object
dtype: object

In [8]: pd.concat([df1, df2])
Out[8]: 
                           0
0                        NaN
1                        NaN
0  2015-01-01 00:00:00+00:00
1  2016-01-01 00:00:00-05:00

You said

This is incorrect; we are coercing to object (correct), but the NaT are getting incorrectly corced to nan here.

What's the best approach for this? Right now, an object block will never have a property that indicates it contains only datelike data. So going forward, should a Block be able to have self.is_datetime/self.is_datetimtz/etc. = True? Or should this be determined when we decide on the fill_value for a concat operation?

jreback · 2018-01-21T15:55:44Z

pandas/tests/reshape/test_concat.py

+        second = pd.DataFrame([[pd.NaT]])
+
+        result = pd.concat([first, second], axis=0)
+        assert_frame_equal(result, expected, check_datetimelike_compat=True)


you don't want to pass this flag on the comparion, that will allow comparisons of actual datetimes in an object column to succeed (not expected here, but still want to be strict)

jreback · 2018-01-21T15:58:40Z

pandas/tests/reshape/test_concat.py

+        first = pd.DataFrame([[pd.NaT], [pd.NaT]], dtype=dtype)
+
+        result = pd.concat([first, second], axis=0)
+        # upcasts for mixed case


put a blank line before comments

jreback · 2018-01-21T16:00:53Z

pandas/tests/reshape/test_concat.py

+        result = pd.concat([first, second], axis=1)
+        assert_frame_equal(result, expected, check_datetimelike_compat=True)
+
+        # both sides timezone-aware


add the comment right below to this, otherwise it looks like 2 cases

jreback · 2018-01-21T16:01:01Z

pandas/tests/reshape/test_concat.py

+    def test_concat_NaT_dataframes_mixed_timestamps_and_NaT(self):
+        # GH 12396
+
+        # non-timezone aware


jreback · 2018-01-21T16:01:37Z

pandas/tests/reshape/test_concat.py

+        assert_frame_equal(result, expected, check_datetimelike_compat=True)
+
+        # one side timezone-aware
+        dtype = DatetimeTZDtype('ns', tz='UTC')


I don't really like this use of DatetimeTZDtyep, you can just do an explict astype here

jreback · 2018-01-21T16:01:49Z

pandas/tests/reshape/test_concat.py

+        second = second.apply(lambda x: x.astype(dtype))
+
+        result = pd.concat([first, second], axis=0)
+        expected = expected.apply(lambda x: x.astype(dtype))


same don't use apply, rather be explicit

jreback · 2018-01-21T16:03:11Z

pandas/tests/reshape/test_concat.py

+        result = pd.concat([first, second], axis=1)
+        assert_frame_equal(result, expected, check_datetimelike_compat=True)
+
+        # one side timezone-aware


i don't like using DatetimeTZDtype explicity like this its not user facing, instead

In [10]: pd.DataFrame(pd.Series([pd.NaT, pd.NaT]).dt.tz_localize('UTC')) Out[10]: 0 0 NaT 1 NaT In [11]: pd.DataFrame(pd.Series([pd.NaT, pd.NaT]).dt.tz_localize('UTC')).dtypes Out[11]: 0 datetime64[ns, UTC] dtype: object

jreback · 2018-01-21T16:04:12Z

pandas/tests/reshape/test_concat.py

+        result = pd.concat([first, second], axis=0)
+        assert_frame_equal(result, expect, check_datetimelike_compat=True)
+
+    def test_concat_empty_datetime_series(self):


see if there are other very similar tests in this file, if so move them next to these (or put these tests next to them). we have to split this file as its getting too big anyhow.

jreback · 2018-01-21T16:17:45Z

pandas/core/internals.py

@@ -5598,8 +5598,10 @@ def get_reindexed_values(self, empty_dtype, upcasted_na):
                    if len(values) and values[0] is None:
                        fill_value = None

-                if getattr(self.block, 'is_datetimetz', False):
-                    pass
+                if getattr(self.block, 'is_datetimetz', False) or \


so I would like to fix this more generally.

define something like this on the Block. This should just work generally

def empty(self): """ return a same shape block filled with the empty value for this block""" arr = np.full(np.prod(self.shape), self._na_value) arr = _block_shape(arr, self.ndim) return self.make_block_same_klass(arr)

if can get this to work , then can clean up a bunch of additional code.

Not sure how this simplifies. Won't you have to cast the result to the empty_dtype anyway, which doesn't necessarily match the dtype of the block?

What would this do for non-NA holding blocks? Raise a TypeError probably?

IOW, what's more important: that you know you'll get a result, or that you know the result will be the same block type? I'd vote for same block type.

And should we pass through placement?

Actually, perhaps something like

def empty(self, dtype=None): ...

Then you get back a block for dtype. By defualt you get the same type.

This solves two problems:

you can do IntBlock.empty(dtype=float) just fine. (will still raise if that type can't hold NA)

As @paul-mannino indicated, we have to cast to empty_dtype anyway.

TomAugspurger · 2018-01-30T19:59:02Z

pandas/core/indexes/base.py

        # only fill if we are passing a non-None fill_value
        if allow_fill and fill_value is not None:
            if (indices < -1).any():
                msg = ('When allow_fill=True and fill_value is not None, '
                       'all indices must be >= -1')
                raise ValueError(msg)
+            if values.size == 0:


Why only do this when fill_value is not None? Is that what we want? (I'm not sure). If so, maybe add a comment explaining it.

jreback · 2018-02-24T17:03:21Z

can you rebase

TomAugspurger · 2018-03-16T12:06:27Z

@paul-mannino do you have time to update?

TomAugspurger · 2018-04-23T19:16:31Z

ping @paul-mannino if you have time to update.

jreback · 2018-04-24T12:47:15Z

I'll fix this up if @paul-mannino doesn't have time in next day or 2

paul-mannino · 2018-04-25T03:05:14Z

I'll pick this up tomorrow evening

…-dev#18447)

codecov · 2018-04-26T03:47:26Z

Codecov Report

❗ No coverage uploaded for pull request base (master@648ca95). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master   #19327   +/-   ##
=========================================
  Coverage          ?   91.77%           
=========================================
  Files             ?      153           
  Lines             ?    49316           
  Branches          ?        0           
=========================================
  Hits              ?    45262           
  Misses            ?     4054           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`90.17% <100%> (?)`
#single	`41.88% <0%> (?)`

Impacted Files	Coverage Δ
pandas/core/internals.py	`95.59% <100%> (ø)`
pandas/core/indexes/base.py	`96.63% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 648ca95...4323f5e. Read the comment docs.

jreback · 2018-04-26T12:06:38Z

pandas/core/indexes/base.py

            mask = indices == -1
-            if mask.any():
-                taken[mask] = na_value
+            if mask.all():


@TomAugspurger did we move the empty check into take (the new helper)?

Yeah, algos.take simplifies this. The only thing it doesn't do is the check on lines 2174-2178, where we raise a ValueError when

indices has negative values

allow_fill is true

fill_value is specified

which seems like a strange set of conditions. Going to see what triggers it.

jreback · 2018-04-26T12:09:11Z

pandas/tests/reshape/test_concat.py

@@ -1865,6 +1865,135 @@ def test_concat_tz_series_tzlocal(self):
        tm.assert_series_equal(result, pd.Series(x + y))
        assert result.dtype == 'datetime64[ns, tzlocal()]'

+    def test_concat_NaT_dataframes_all_NaT_axis_0(self):


can you parameterize these tests, this is too much copy-paste

jreback · 2018-04-26T12:11:17Z

pandas/core/indexes/base.py

            mask = indices == -1
-            if mask.any():
-                taken[mask] = na_value
+            if mask.all():


can add a comment here. is this hit by the tests?

jreback · 2018-04-26T12:12:44Z

pandas/core/internals.py

-                    pass
+                if getattr(self.block, 'is_datetimetz', False) or \
+                        is_datetimetz(empty_dtype):
+                    missing_arr = np.full(np.prod(self.shape), fill_value)


move this logic to the Block as I indicated above.

paul-mannino · 2018-04-27T04:05:28Z

Didn't realize how much refactoring there was left to do. It seems like you want this done ASAP. I don't have time to do this right now--would be a week or two at least. Feel free to take it over.

TomAugspurger · 2018-04-27T15:26:14Z

No worries. I'll have some time to work on it in ~3 hours.

TomAugspurger · 2018-04-28T12:29:20Z

No luck on the internals refactoring yet. I'm getting tripped up in the concat plan merging.

This reverts commit cf618db.

TomAugspurger · 2018-04-28T19:03:41Z

I won't have time to get to this before the RC. Reverted my changes @paul-mannino.

jreback · 2018-05-11T12:32:12Z

i refactored the tests and found another case, so still WIP.

jreback · 2018-05-11T21:52:24Z

superseded by #21014

jreback added Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 21, 2018

jreback requested changes Jan 21, 2018

View reviewed changes

paul-mannino force-pushed the master branch from c17f377 to 78e1523 Compare January 29, 2018 04:42

TomAugspurger reviewed Jan 30, 2018

View reviewed changes

jreback added this to the 0.23.0 milestone Apr 24, 2018

paul-mannino force-pushed the master branch from 78e1523 to 7fc879c Compare April 26, 2018 02:58

BUG: Concatentation of TZ-aware dataframes (pandas-dev#12396) (pandas…

155fc90

…-dev#18447)

paul-mannino force-pushed the master branch from 7fc879c to 155fc90 Compare April 26, 2018 03:47

jreback requested changes Apr 26, 2018

View reviewed changes

jreback modified the milestone: 0.23.0 Apr 27, 2018

TomAugspurger added 2 commits April 28, 2018 07:27

use new take

cf618db

Merge remote-tracking branch 'upstream/master' into paul-mannino-master

9ccb8e7

Revert "use new take"

49eefd7

This reverts commit cf618db.

TomAugspurger modified the milestones: 0.23.0, 0.23.1 Apr 28, 2018

jreback removed this from the 0.23.1 milestone May 10, 2018

jreback added this to the 0.23.0 milestone May 10, 2018

jreback added 2 commits May 11, 2018 08:30

Merge branch 'master' into PR_TOOL_MERGE_PR_19327

2f58bc9

fix some tests

4323f5e

jreback mentioned this pull request May 11, 2018

BUG: Concatentation of TZ-aware dataframes #21014

Closed

jreback closed this May 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Concatentation of TZ-aware dataframes (#12396) (#18447) #19327

BUG: Concatentation of TZ-aware dataframes (#12396) (#18447) #19327

paul-mannino commented Jan 20, 2018 •

edited

Loading

paul-mannino commented Jan 20, 2018 •

edited

Loading

jreback Jan 21, 2018

jreback Jan 21, 2018

jreback Jan 21, 2018

jreback Jan 21, 2018

jreback Jan 21, 2018

jreback Jan 21, 2018

jreback Jan 21, 2018

jreback Jan 21, 2018

jreback Jan 21, 2018

paul-mannino Jan 29, 2018 •

edited

Loading

TomAugspurger Apr 27, 2018 •

edited

Loading

TomAugspurger Apr 27, 2018

TomAugspurger Apr 27, 2018

TomAugspurger Jan 30, 2018

jreback commented Feb 24, 2018

TomAugspurger commented Mar 16, 2018

TomAugspurger commented Apr 23, 2018

jreback commented Apr 24, 2018

paul-mannino commented Apr 25, 2018

codecov bot commented Apr 26, 2018 •

edited

Loading

jreback Apr 26, 2018

TomAugspurger Apr 27, 2018

jreback Apr 26, 2018

jreback Apr 26, 2018

jreback Apr 26, 2018

paul-mannino commented Apr 27, 2018 •

edited

Loading

TomAugspurger commented Apr 27, 2018

TomAugspurger commented Apr 28, 2018

TomAugspurger commented Apr 28, 2018

jreback commented May 11, 2018

jreback commented May 11, 2018

BUG: Concatentation of TZ-aware dataframes (#12396) (#18447) #19327

BUG: Concatentation of TZ-aware dataframes (#12396) (#18447) #19327

Conversation

paul-mannino commented Jan 20, 2018 • edited Loading

paul-mannino commented Jan 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul-mannino Jan 29, 2018 • edited Loading

Choose a reason for hiding this comment

TomAugspurger Apr 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 24, 2018

TomAugspurger commented Mar 16, 2018

TomAugspurger commented Apr 23, 2018

jreback commented Apr 24, 2018

paul-mannino commented Apr 25, 2018

codecov bot commented Apr 26, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul-mannino commented Apr 27, 2018 • edited Loading

TomAugspurger commented Apr 27, 2018

TomAugspurger commented Apr 28, 2018

TomAugspurger commented Apr 28, 2018

jreback commented May 11, 2018

jreback commented May 11, 2018

paul-mannino commented Jan 20, 2018 •

edited

Loading

paul-mannino commented Jan 20, 2018 •

edited

Loading

paul-mannino Jan 29, 2018 •

edited

Loading

TomAugspurger Apr 27, 2018 •

edited

Loading

codecov bot commented Apr 26, 2018 •

edited

Loading

paul-mannino commented Apr 27, 2018 •

edited

Loading