-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Concatentation of TZ-aware dataframes (#12396) (#18447) #19327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looking for guidance on this issue related to #12396 that I still haven't fixed.
You said
What's the best approach for this? Right now, an object block will never have a property that indicates it contains only datelike data. So going forward, should a Block be able to have self.is_datetime/self.is_datetimtz/etc. = True? Or should this be determined when we decide on the fill_value for a concat operation? |
pandas/tests/reshape/test_concat.py
Outdated
second = pd.DataFrame([[pd.NaT]]) | ||
|
||
result = pd.concat([first, second], axis=0) | ||
assert_frame_equal(result, expected, check_datetimelike_compat=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't want to pass this flag on the comparion, that will allow comparisons of actual datetimes in an object column to succeed (not expected here, but still want to be strict)
pandas/tests/reshape/test_concat.py
Outdated
first = pd.DataFrame([[pd.NaT], [pd.NaT]], dtype=dtype) | ||
|
||
result = pd.concat([first, second], axis=0) | ||
# upcasts for mixed case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put a blank line before comments
pandas/tests/reshape/test_concat.py
Outdated
result = pd.concat([first, second], axis=1) | ||
assert_frame_equal(result, expected, check_datetimelike_compat=True) | ||
|
||
# both sides timezone-aware |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the comment right below to this, otherwise it looks like 2 cases
pandas/tests/reshape/test_concat.py
Outdated
def test_concat_NaT_dataframes_mixed_timestamps_and_NaT(self): | ||
# GH 12396 | ||
|
||
# non-timezone aware |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tz-naive
pandas/tests/reshape/test_concat.py
Outdated
assert_frame_equal(result, expected, check_datetimelike_compat=True) | ||
|
||
# one side timezone-aware | ||
dtype = DatetimeTZDtype('ns', tz='UTC') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like this use of DatetimeTZDtyep, you can just do an explict astype here
pandas/tests/reshape/test_concat.py
Outdated
second = second.apply(lambda x: x.astype(dtype)) | ||
|
||
result = pd.concat([first, second], axis=0) | ||
expected = expected.apply(lambda x: x.astype(dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same don't use apply, rather be explicit
pandas/tests/reshape/test_concat.py
Outdated
result = pd.concat([first, second], axis=1) | ||
assert_frame_equal(result, expected, check_datetimelike_compat=True) | ||
|
||
# one side timezone-aware |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't like using DatetimeTZDtype explicity like this its not user facing, instead
In [10]: pd.DataFrame(pd.Series([pd.NaT, pd.NaT]).dt.tz_localize('UTC'))
Out[10]:
0
0 NaT
1 NaT
In [11]: pd.DataFrame(pd.Series([pd.NaT, pd.NaT]).dt.tz_localize('UTC')).dtypes
Out[11]:
0 datetime64[ns, UTC]
dtype: object
pandas/tests/reshape/test_concat.py
Outdated
result = pd.concat([first, second], axis=0) | ||
assert_frame_equal(result, expect, check_datetimelike_compat=True) | ||
|
||
def test_concat_empty_datetime_series(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see if there are other very similar tests in this file, if so move them next to these (or put these tests next to them). we have to split this file as its getting too big anyhow.
@@ -5598,8 +5598,10 @@ def get_reindexed_values(self, empty_dtype, upcasted_na): | |||
if len(values) and values[0] is None: | |||
fill_value = None | |||
|
|||
if getattr(self.block, 'is_datetimetz', False): | |||
pass | |||
if getattr(self.block, 'is_datetimetz', False) or \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I would like to fix this more generally.
define something like this on the Block. This should just work generally
def empty(self):
""" return a same shape block filled with the empty value for this block"""
arr = np.full(np.prod(self.shape), self._na_value)
arr = _block_shape(arr, self.ndim)
return self.make_block_same_klass(arr)
if can get this to work , then can clean up a bunch of additional code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how this simplifies. Won't you have to cast the result to the empty_dtype
anyway, which doesn't necessarily match the dtype of the block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would this do for non-NA holding blocks? Raise a TypeError probably?
IOW, what's more important: that you know you'll get a result, or that you know the result will be the same block type? I'd vote for same block type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And should we pass through placement
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, perhaps something like
def empty(self, dtype=None):
...
Then you get back a block for dtype
. By defualt you get the same type.
This solves two problems:
- you can do
IntBlock.empty(dtype=float)
just fine. (will still raise if that type can't hold NA) - As @paul-mannino indicated, we have to cast to empty_dtype anyway.
pandas/core/indexes/base.py
Outdated
# only fill if we are passing a non-None fill_value | ||
if allow_fill and fill_value is not None: | ||
if (indices < -1).any(): | ||
msg = ('When allow_fill=True and fill_value is not None, ' | ||
'all indices must be >= -1') | ||
raise ValueError(msg) | ||
if values.size == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only do this when fill_value
is not None? Is that what we want? (I'm not sure). If so, maybe add a comment explaining it.
can you rebase |
@paul-mannino do you have time to update? |
ping @paul-mannino if you have time to update. |
I'll fix this up if @paul-mannino doesn't have time in next day or 2 |
I'll pick this up tomorrow evening |
Codecov Report
@@ Coverage Diff @@
## master #19327 +/- ##
=========================================
Coverage ? 91.77%
=========================================
Files ? 153
Lines ? 49316
Branches ? 0
=========================================
Hits ? 45262
Misses ? 4054
Partials ? 0
Continue to review full report at Codecov.
|
mask = indices == -1 | ||
if mask.any(): | ||
taken[mask] = na_value | ||
if mask.all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger did we move the empty check into take
(the new helper)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, algos.take
simplifies this. The only thing it doesn't do is the check on lines 2174-2178, where we raise a ValueError when
- indices has negative values
- allow_fill is true
- fill_value is specified
which seems like a strange set of conditions. Going to see what triggers it.
pandas/tests/reshape/test_concat.py
Outdated
@@ -1865,6 +1865,135 @@ def test_concat_tz_series_tzlocal(self): | |||
tm.assert_series_equal(result, pd.Series(x + y)) | |||
assert result.dtype == 'datetime64[ns, tzlocal()]' | |||
|
|||
def test_concat_NaT_dataframes_all_NaT_axis_0(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parameterize these tests, this is too much copy-paste
mask = indices == -1 | ||
if mask.any(): | ||
taken[mask] = na_value | ||
if mask.all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can add a comment here. is this hit by the tests?
pass | ||
if getattr(self.block, 'is_datetimetz', False) or \ | ||
is_datetimetz(empty_dtype): | ||
missing_arr = np.full(np.prod(self.shape), fill_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this logic to the Block as I indicated above.
Didn't realize how much refactoring there was left to do. It seems like you want this done ASAP. I don't have time to do this right now--would be a week or two at least. Feel free to take it over. |
No worries. I'll have some time to work on it in ~3 hours. |
No luck on the internals refactoring yet. I'm getting tripped up in the concat plan merging. |
This reverts commit cf618db.
I won't have time to get to this before the RC. Reverted my changes @paul-mannino. |
i refactored the tests and found another case, so still WIP. |
superseded by #21014 |
closes #12396
closes #18447
git diff upstream/master -u -- "*.py" | flake8 --diff