Skip to content

BUG: Concatentation of TZ-aware dataframes (#12396) (#18447) #19327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

paul-mannino
Copy link
Contributor

@paul-mannino paul-mannino commented Jan 20, 2018

closes #12396
closes #18447

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@paul-mannino
Copy link
Contributor Author

paul-mannino commented Jan 20, 2018

Looking for guidance on this issue related to #12396 that I still haven't fixed.

In [6]: df2 = pd.DataFrame([[pd.Timestamp('2015/01/01', tz='UTC')], [pd.Timestamp('2016/01/01', tz='US/Eastern')]])

In [7]: pd.concat([df1, df2]).dtypes
Out[7]: 
0    object
dtype: object

In [8]: pd.concat([df1, df2])
Out[8]: 
                           0
0                        NaN
1                        NaN
0  2015-01-01 00:00:00+00:00
1  2016-01-01 00:00:00-05:00

You said

This is incorrect; we are coercing to object (correct), but the NaT are getting incorrectly corced to nan here.

What's the best approach for this? Right now, an object block will never have a property that indicates it contains only datelike data. So going forward, should a Block be able to have self.is_datetime/self.is_datetimtz/etc. = True? Or should this be determined when we decide on the fill_value for a concat operation?

@jreback jreback added Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 21, 2018
second = pd.DataFrame([[pd.NaT]])

result = pd.concat([first, second], axis=0)
assert_frame_equal(result, expected, check_datetimelike_compat=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't want to pass this flag on the comparion, that will allow comparisons of actual datetimes in an object column to succeed (not expected here, but still want to be strict)

first = pd.DataFrame([[pd.NaT], [pd.NaT]], dtype=dtype)

result = pd.concat([first, second], axis=0)
# upcasts for mixed case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put a blank line before comments

result = pd.concat([first, second], axis=1)
assert_frame_equal(result, expected, check_datetimelike_compat=True)

# both sides timezone-aware
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the comment right below to this, otherwise it looks like 2 cases

def test_concat_NaT_dataframes_mixed_timestamps_and_NaT(self):
# GH 12396

# non-timezone aware
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tz-naive

assert_frame_equal(result, expected, check_datetimelike_compat=True)

# one side timezone-aware
dtype = DatetimeTZDtype('ns', tz='UTC')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this use of DatetimeTZDtyep, you can just do an explict astype here

second = second.apply(lambda x: x.astype(dtype))

result = pd.concat([first, second], axis=0)
expected = expected.apply(lambda x: x.astype(dtype))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same don't use apply, rather be explicit

result = pd.concat([first, second], axis=1)
assert_frame_equal(result, expected, check_datetimelike_compat=True)

# one side timezone-aware
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't like using DatetimeTZDtype explicity like this its not user facing, instead

In [10]: pd.DataFrame(pd.Series([pd.NaT, pd.NaT]).dt.tz_localize('UTC'))
Out[10]: 
    0
0 NaT
1 NaT

In [11]: pd.DataFrame(pd.Series([pd.NaT, pd.NaT]).dt.tz_localize('UTC')).dtypes
Out[11]: 
0    datetime64[ns, UTC]
dtype: object

result = pd.concat([first, second], axis=0)
assert_frame_equal(result, expect, check_datetimelike_compat=True)

def test_concat_empty_datetime_series(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see if there are other very similar tests in this file, if so move them next to these (or put these tests next to them). we have to split this file as its getting too big anyhow.

@@ -5598,8 +5598,10 @@ def get_reindexed_values(self, empty_dtype, upcasted_na):
if len(values) and values[0] is None:
fill_value = None

if getattr(self.block, 'is_datetimetz', False):
pass
if getattr(self.block, 'is_datetimetz', False) or \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I would like to fix this more generally.

define something like this on the Block. This should just work generally

def empty(self):
    """ return a same shape block filled with the empty value for this block"""
   arr = np.full(np.prod(self.shape), self._na_value)
   arr = _block_shape(arr, self.ndim)
   return self.make_block_same_klass(arr)

if can get this to work , then can clean up a bunch of additional code.

Copy link
Contributor Author

@paul-mannino paul-mannino Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this simplifies. Won't you have to cast the result to the empty_dtype anyway, which doesn't necessarily match the dtype of the block?

Copy link
Contributor

@TomAugspurger TomAugspurger Apr 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would this do for non-NA holding blocks? Raise a TypeError probably?

IOW, what's more important: that you know you'll get a result, or that you know the result will be the same block type? I'd vote for same block type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And should we pass through placement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, perhaps something like

def empty(self, dtype=None):
    ...

Then you get back a block for dtype. By defualt you get the same type.

This solves two problems:

  1. you can do IntBlock.empty(dtype=float) just fine. (will still raise if that type can't hold NA)
  2. As @paul-mannino indicated, we have to cast to empty_dtype anyway.

# only fill if we are passing a non-None fill_value
if allow_fill and fill_value is not None:
if (indices < -1).any():
msg = ('When allow_fill=True and fill_value is not None, '
'all indices must be >= -1')
raise ValueError(msg)
if values.size == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only do this when fill_value is not None? Is that what we want? (I'm not sure). If so, maybe add a comment explaining it.

@jreback
Copy link
Contributor

jreback commented Feb 24, 2018

can you rebase

@TomAugspurger
Copy link
Contributor

@paul-mannino do you have time to update?

@TomAugspurger
Copy link
Contributor

ping @paul-mannino if you have time to update.

@jreback jreback added this to the 0.23.0 milestone Apr 24, 2018
@jreback
Copy link
Contributor

jreback commented Apr 24, 2018

I'll fix this up if @paul-mannino doesn't have time in next day or 2

@paul-mannino
Copy link
Contributor Author

I'll pick this up tomorrow evening

@codecov
Copy link

codecov bot commented Apr 26, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@648ca95). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #19327   +/-   ##
=========================================
  Coverage          ?   91.77%           
=========================================
  Files             ?      153           
  Lines             ?    49316           
  Branches          ?        0           
=========================================
  Hits              ?    45262           
  Misses            ?     4054           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.17% <100%> (?)
#single 41.88% <0%> (?)
Impacted Files Coverage Δ
pandas/core/internals.py 95.59% <100%> (ø)
pandas/core/indexes/base.py 96.63% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 648ca95...4323f5e. Read the comment docs.

mask = indices == -1
if mask.any():
taken[mask] = na_value
if mask.all():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger did we move the empty check into take (the new helper)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, algos.take simplifies this. The only thing it doesn't do is the check on lines 2174-2178, where we raise a ValueError when

  • indices has negative values
  • allow_fill is true
  • fill_value is specified

which seems like a strange set of conditions. Going to see what triggers it.

@@ -1865,6 +1865,135 @@ def test_concat_tz_series_tzlocal(self):
tm.assert_series_equal(result, pd.Series(x + y))
assert result.dtype == 'datetime64[ns, tzlocal()]'

def test_concat_NaT_dataframes_all_NaT_axis_0(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize these tests, this is too much copy-paste

mask = indices == -1
if mask.any():
taken[mask] = na_value
if mask.all():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can add a comment here. is this hit by the tests?

pass
if getattr(self.block, 'is_datetimetz', False) or \
is_datetimetz(empty_dtype):
missing_arr = np.full(np.prod(self.shape), fill_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this logic to the Block as I indicated above.

@paul-mannino
Copy link
Contributor Author

paul-mannino commented Apr 27, 2018

Didn't realize how much refactoring there was left to do. It seems like you want this done ASAP. I don't have time to do this right now--would be a week or two at least. Feel free to take it over.

@jreback jreback modified the milestone: 0.23.0 Apr 27, 2018
@TomAugspurger
Copy link
Contributor

No worries. I'll have some time to work on it in ~3 hours.

@TomAugspurger
Copy link
Contributor

No luck on the internals refactoring yet. I'm getting tripped up in the concat plan merging.

This reverts commit cf618db.
@TomAugspurger
Copy link
Contributor

I won't have time to get to this before the RC. Reverted my changes @paul-mannino.

@TomAugspurger TomAugspurger modified the milestones: 0.23.0, 0.23.1 Apr 28, 2018
@jreback jreback removed this from the 0.23.1 milestone May 10, 2018
@jreback jreback added this to the 0.23.0 milestone May 10, 2018
@jreback
Copy link
Contributor

jreback commented May 11, 2018

i refactored the tests and found another case, so still WIP.

@jreback
Copy link
Contributor

jreback commented May 11, 2018

superseded by #21014

@jreback jreback closed this May 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
3 participants