-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: DataFrame.append with empty DataFrame and Series with tz-aware datetime value allocated object column #35038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…atetime value allocated object column
Linux py37_np_dev test failures unrelated, xref #35041 |
pandas/tests/reshape/test_concat.py
Outdated
expected["a"] = expected["a"].astype(float) | ||
expected["b"] = expected["b"].astype(float) | ||
expected["date"] = pd.to_datetime(expected["date"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you just fix expected to do this directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, the comment # These columns get cast to object after append now makes more sense.
pandas/tests/reshape/test_concat.py
Outdated
expected["a"] = expected["a"].astype(float) | ||
expected["b"] = expected["b"].astype(float) | ||
expected["date"] = pd.to_datetime(expected["date"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add tests for empty EA types (period, interval, categorical), we might already have them but this seems like a good place to put them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as additional values in the Series? a scalar Period and scalar Interval. not sure what to do for categorical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't work for interval or period. will raise an issue. This was fixed for datetime64[ns, tzutc()] 'by accident', when ensuring consistency between np.concatenate and concat_compat, see #35032 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a different issue for period and interval since concatenating list of Series doesn't work either, whereas for tz-aware datetime works on master. see OP
on master
>>> interval = pd.Interval(0, 1, closed="right")
>>> interval
Interval(0, 1, closed='right')
>>>
>>> period = pd.Period("2012", freq="A-DEC")
>>> period
Period('2012', 'A-DEC')
>>>
>>> s = pd.Series({"period": period, "interval": interval})
>>> s
period 2012
interval (0, 1]
dtype: object
>>>
>>> df = pd.DataFrame(columns=["c"])
>>> df
Empty DataFrame
Columns: [c]
Index: []
>>>
>>> result = df.append([s, s], ignore_index=True)
>>> result
c period interval
0 NaN 2012 (0, 1]
1 NaN 2012 (0, 1]
>>>
>>> result.dtypes
c object
period object
interval object
dtype: object
>>>
>>> df = pd.DataFrame([s, s])
>>> df
period interval
0 2012 (0, 1]
1 2012 (0, 1]
>>>
>>> df.dtypes
period period[A-DEC]
interval interval[int64]
dtype: object
>>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for period and interval, the issue is not limited to an empty dataframe
master
>>> df = pd.DataFrame([1, 2], columns=["c"])
>>> df
c
0 1
1 2
>>>
>>> result = df.append([s], ignore_index=True)
>>> result
c period interval
0 1.0 NaN NaN
1 2.0 NaN NaN
2 NaN 2012 (0, 1]
>>>
>>> result.dtypes
c float64
period object
interval object
dtype: object
>>>
>>> result = df.append([s, s], ignore_index=True)
>>> result
c period interval
0 1.0 NaN NaN
1 2.0 NaN NaN
2 NaN 2012 (0, 1]
3 NaN 2012 (0, 1]
>>>
>>> result.dtypes
c float64
period object
interval object
dtype: object
>>>
will raise an issue.
don't think necessary. I think covered by #22994 and #22957 and others linked in those issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk, yeah sometimes we have scattered tests and hard to tell whether we have exhaustive cases.
[[np.nan, np.nan, 1.0, 2.0, date]], | ||
columns=["c", "d", "a", "b", "date"], | ||
dtype=object, | ||
[[np.nan, np.nan, 1.0, 2.0, date]], columns=["c", "d", "a", "b", "date"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this fully replicate the OP test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extended test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need a whatsnew note? was this a change from 1.0.x?
I didn't add one since no issue had been raised. I will add a release note and reference this PR? |
yep that's good (and generally policy) |
release note added |
@jreback anything more to do here. as an aside and not wanting to weigh in on the 2d EA debate; in order to have a Categorical scalar, would we also need a 0d EA? I wondered this before when working on #33846 where IIRC I though at the time the special casing would not have been necessary with 0d EA. Also, while working on #35032, I think it is starting to make sense to not restrict EAs to 1d. |
thanks @simonjayhawkins |
can-o-worms. well i don't actually think you need a 0-d categorical scalar, a scalar will just do, of course then you cannot infer this to a categorical w/o context but that's ok. |
@@ -152,11 +152,11 @@ def is_nonempty(x) -> bool: | |||
target_dtype = find_common_type([x.dtype for x in to_concat]) | |||
to_concat = [_cast_to_common_type(arr, target_dtype) for arr in to_concat] | |||
|
|||
if isinstance(to_concat[0], ExtensionArray): | |||
if isinstance(to_concat[0], ExtensionArray) and axis == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonjayhawkins do you remember why is was needed to add this axis=0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #35032 (comment)
broken off #35032
This PR fixes situation where appending a series consecutively to a empty dataframe produces different result from appending the series in one step.
on master