Skip to content

BUG: Concatentation of TZ-aware dataframes (#12396) (#18447) #19327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1370,6 +1370,8 @@ Reshaping
- Bug in :meth:`DataFrame.astype` where column metadata is lost when converting to categorical or a dictionary of dtypes (:issue:`19920`)
- Bug in :func:`cut` and :func:`qcut` where timezone information was dropped (:issue:`19872`)
- Bug in :class:`Series` constructor with a ``dtype=str``, previously raised in some cases (:issue:`19853`)
- Bug in :func:`concat` which raises an error when concatenating TZ-aware dataframes and all-NaT dataframes (:issue:`12396`)
- Bug in :func:`concat` which raises an error when concatenating empty TZ-aware series (:issue:`18447`)
- Bug in :func:`get_dummies`, and :func:`select_dtypes`, where duplicate column names caused incorrect behavior (:issue:`20848`)
- Bug in :func:`isna`, which cannot handle ambiguous typed lists (:issue:`20675`)

Expand Down
8 changes: 6 additions & 2 deletions pandas/core/dtypes/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -465,8 +465,12 @@ def convert_to_pydatetime(x, axis):
if _contains_datetime:

if 'datetime' in typs:
new_values = np.concatenate([x.view(np.int64) for x in
to_concat], axis=axis)
to_concat = [np.array(x, copy=False).view(np.int64)
for x in to_concat]
if axis == 1:
to_concat = [np.atleast_2d(x) for x in to_concat]

new_values = np.concatenate(to_concat, axis=axis)
return new_values.view(_NS_DTYPE)
else:
# when to_concat has different tz, len(typs) > 1.
Expand Down
10 changes: 6 additions & 4 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2183,17 +2183,19 @@ def _assert_take_fillable(self, values, indices, allow_fill=True,
fill_value=None, na_value=np.nan):
""" Internal method to handle NA filling of take """
indices = _ensure_platform_int(indices)

# only fill if we are passing a non-None fill_value
if allow_fill and fill_value is not None:
if (indices < -1).any():
msg = ('When allow_fill=True and fill_value is not None, '
'all indices must be >= -1')
raise ValueError(msg)
taken = values.take(indices)
mask = indices == -1
if mask.any():
taken[mask] = na_value
if mask.all():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger did we move the empty check into take (the new helper)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, algos.take simplifies this. The only thing it doesn't do is the check on lines 2174-2178, where we raise a ValueError when

  • indices has negative values
  • allow_fill is true
  • fill_value is specified

which seems like a strange set of conditions. Going to see what triggers it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can add a comment here. is this hit by the tests?

taken = np.full(indices.shape, fill_value=na_value)
else:
taken = values.take(indices)
if mask.any():
taken[mask] = na_value
else:
taken = values.take(indices)
return taken
Expand Down
6 changes: 4 additions & 2 deletions pandas/core/internals.py
Original file line number Diff line number Diff line change
Expand Up @@ -5835,8 +5835,10 @@ def get_reindexed_values(self, empty_dtype, upcasted_na):
if len(values) and values[0] is None:
fill_value = None

if getattr(self.block, 'is_datetimetz', False):
pass
if getattr(self.block, 'is_datetimetz', False) or \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I would like to fix this more generally.

define something like this on the Block. This should just work generally

def empty(self):
    """ return a same shape block filled with the empty value for this block"""
   arr = np.full(np.prod(self.shape), self._na_value)
   arr = _block_shape(arr, self.ndim)
   return self.make_block_same_klass(arr)

if can get this to work , then can clean up a bunch of additional code.

Copy link
Contributor Author

@paul-mannino paul-mannino Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this simplifies. Won't you have to cast the result to the empty_dtype anyway, which doesn't necessarily match the dtype of the block?

Copy link
Contributor

@TomAugspurger TomAugspurger Apr 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would this do for non-NA holding blocks? Raise a TypeError probably?

IOW, what's more important: that you know you'll get a result, or that you know the result will be the same block type? I'd vote for same block type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And should we pass through placement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, perhaps something like

def empty(self, dtype=None):
    ...

Then you get back a block for dtype. By defualt you get the same type.

This solves two problems:

  1. you can do IntBlock.empty(dtype=float) just fine. (will still raise if that type can't hold NA)
  2. As @paul-mannino indicated, we have to cast to empty_dtype anyway.

is_datetimetz(empty_dtype):
missing_arr = np.full(np.prod(self.shape), fill_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this logic to the Block as I indicated above.

return DatetimeIndex(missing_arr, dtype=empty_dtype)
elif getattr(self.block, 'is_categorical', False):
pass
elif getattr(self.block, 'is_sparse', False):
Expand Down
101 changes: 101 additions & 0 deletions pandas/tests/reshape/test_concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -1917,6 +1917,92 @@ def test_concat_tz_series_tzlocal(self):
tm.assert_series_equal(result, pd.Series(x + y))
assert result.dtype == 'datetime64[ns, tzlocal()]'

@pytest.mark.parametrize('tz1', [None, 'UTC'])
@pytest.mark.parametrize('tz2', [None, 'UTC'])
@pytest.mark.parametrize('s', [pd.NaT, pd.Timestamp('20150101')])
def test_concat_NaT_dataframes_all_NaT_axis_0(self, tz1, tz2, s):
# GH 12396

# tz-naive
first = pd.DataFrame([[pd.NaT], [pd.NaT]]).apply(
lambda x: x.dt.tz_localize(tz1))
second = pd.DataFrame([s]).apply(lambda x: x.dt.tz_localize(tz2))

# we are all NaT so this is ok
if tz1 is None:
tz = tz2
elif tz2 is None:
tz = tz1
elif tz1 == tz2:
tz = tz1
else:
tz = None

result = pd.concat([first, second], axis=0)
expected = pd.DataFrame(pd.Series(
[pd.NaT, pd.NaT, s], index=[0, 1, 0]))
expected = expected.apply(lambda x: x.dt.tz_localize(tz))
assert_frame_equal(result, expected)

@pytest.mark.parametrize('tz1', [None, 'UTC'])
@pytest.mark.parametrize('tz2', [None, 'UTC'])
def test_concat_NaT_dataframes_all_NaT_axis_1(self, tz1, tz2):
# GH 12396

first = pd.DataFrame(pd.Series([pd.NaT, pd.NaT]).dt.tz_localize(tz1))
second = pd.DataFrame(pd.Series(
[pd.NaT]).dt.tz_localize(tz2), columns=[1])
expected = pd.DataFrame(
{0: pd.Series([pd.NaT, pd.NaT]).dt.tz_localize(tz1),
1: pd.Series([pd.NaT, pd.NaT]).dt.tz_localize(tz2)}
)
result = pd.concat([first, second], axis=1)
assert_frame_equal(result, expected)

@pytest.mark.parametrize('tz1', [None, 'UTC'])
@pytest.mark.parametrize('tz2', [None, 'UTC'])
def test_concat_NaT_series_dataframe_all_NaT(self, tz1, tz2):
# GH 12396

# tz-naive
first = pd.Series([pd.NaT, pd.NaT]).dt.tz_localize(tz1)
second = pd.DataFrame([[pd.Timestamp('2015/01/01', tz=tz2)],
[pd.Timestamp('2016/01/01', tz=tz2)]],
index=[2, 3])

if tz1 is None and tz2 is None:
tz = None

# we are all NaT so this is ok
elif tz1 is None:
tz = tz2
elif tz1 == tz2:
tz = tz1
else:
tz = None
expected = pd.DataFrame([pd.NaT, pd.NaT,
pd.Timestamp('2015/01/01', tz=tz),
pd.Timestamp('2016/01/01', tz=tz)])

result = pd.concat([first, second])
assert_frame_equal(result, expected)

@pytest.mark.parametrize('tz', [None, 'UTC'])
def test_concat_NaT_dataframes(self, tz):
# GH 12396

first = pd.DataFrame([[pd.NaT], [pd.NaT]])
first = first.apply(lambda x: x.dt.tz_localize(tz))
second = pd.DataFrame([[pd.Timestamp('2015/01/01', tz=tz)],
[pd.Timestamp('2016/01/01', tz=tz)]],
index=[2, 3])
expected = pd.DataFrame([pd.NaT, pd.NaT,
pd.Timestamp('2015/01/01', tz=tz),
pd.Timestamp('2016/01/01', tz=tz)])

result = pd.concat([first, second], axis=0)
assert_frame_equal(result, expected)

def test_concat_period_series(self):
x = Series(pd.PeriodIndex(['2015-11-01', '2015-12-01'], freq='D'))
y = Series(pd.PeriodIndex(['2015-10-01', '2016-01-01'], freq='D'))
Expand Down Expand Up @@ -1978,6 +2064,21 @@ def test_concat_empty_series(self):
columns=['x', 0])
tm.assert_frame_equal(res, exp)

@pytest.mark.parametrize('tz', [None, 'UTC'])
@pytest.mark.parametrize('values', [[], [1, 2, 3]])
def test_concat_empty_series_timelike(self, tz, values):
# GH 18447

first = Series([], dtype='M8[ns]').dt.tz_localize(tz)
second = Series(values)
expected = DataFrame(
{0: pd.Series([pd.NaT] * len(values),
dtype='M8[ns]'
).dt.tz_localize(tz),
1: values})
result = concat([first, second], axis=1)
assert_frame_equal(result, expected)

def test_default_index(self):
# is_series and ignore_index
s1 = pd.Series([1, 2, 3], name='x')
Expand Down