Skip to content

BUG: DataFrame.append with empty DataFrame and Series with tz-aware datetime value allocated object column #35038

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 16, 2020

Conversation

simonjayhawkins
Copy link
Member

broken off #35032

This PR fixes situation where appending a series consecutively to a empty dataframe produces different result from appending the series in one step.

on master

>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1974.g0159cba6e'
>>>
>>> import dateutil
>>>
>>> date = pd.Timestamp("2018-10-24 07:30:00", tz=dateutil.tz.tzutc())
>>> date
Timestamp('2018-10-24 07:30:00+0000', tz='tzutc()')
>>>
>>> s = pd.Series({"date": date, "a": 1.0, "b": 2.0})
>>> s
date    2018-10-24 07:30:00+00:00
a                               1
b                               2
dtype: object
>>>
>>> df = pd.DataFrame(columns=["c", "d"])
>>> df
Empty DataFrame
Columns: [c, d]
Index: []
>>>
>>> result_a = df.append(s, ignore_index=True)
>>> result_a
     c    d    a    b                       date
0  NaN  NaN  1.0  2.0  2018-10-24 07:30:00+00:00
>>>
>>> result_a.dtypes
c        object
d        object
a       float64
b       float64
date     object
dtype: object
>>>
>>> result_b = result_a.append(s, ignore_index=True)
>>> result_b
     c    d    a    b                       date
0  NaN  NaN  1.0  2.0  2018-10-24 07:30:00+00:00
1  NaN  NaN  1.0  2.0  2018-10-24 07:30:00+00:00
>>>
>>> result_b.dtypes
c        object
d        object
a       float64
b       float64
date     object
dtype: object
>>>
>>> result = df.append([s, s], ignore_index=True)
>>> result
     c    d                      date    a    b
0  NaN  NaN 2018-10-24 07:30:00+00:00  1.0  2.0
1  NaN  NaN 2018-10-24 07:30:00+00:00  1.0  2.0
>>>
>>> result.dtypes
c                        object
d                        object
date    datetime64[ns, tzutc()]
a                       float64
b                       float64
dtype: object
>>>

@simonjayhawkins simonjayhawkins added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype labels Jun 28, 2020
@simonjayhawkins
Copy link
Member Author

Linux py37_np_dev test failures unrelated, xref #35041

expected["a"] = expected["a"].astype(float)
expected["b"] = expected["b"].astype(float)
expected["date"] = pd.to_datetime(expected["date"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just fix expected to do this directly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, the comment # These columns get cast to object after append now makes more sense.

expected["a"] = expected["a"].astype(float)
expected["b"] = expected["b"].astype(float)
expected["date"] = pd.to_datetime(expected["date"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add tests for empty EA types (period, interval, categorical), we might already have them but this seems like a good place to put them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as additional values in the Series? a scalar Period and scalar Interval. not sure what to do for categorical.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't work for interval or period. will raise an issue. This was fixed for datetime64[ns, tzutc()] 'by accident', when ensuring consistency between np.concatenate and concat_compat, see #35032 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a different issue for period and interval since concatenating list of Series doesn't work either, whereas for tz-aware datetime works on master. see OP

on master

>>> interval = pd.Interval(0, 1, closed="right")
>>> interval
Interval(0, 1, closed='right')
>>>
>>> period = pd.Period("2012", freq="A-DEC")
>>> period
Period('2012', 'A-DEC')
>>>
>>> s = pd.Series({"period": period, "interval": interval})
>>> s
period        2012
interval    (0, 1]
dtype: object
>>>
>>> df = pd.DataFrame(columns=["c"])
>>> df
Empty DataFrame
Columns: [c]
Index: []
>>>
>>> result = df.append([s, s], ignore_index=True)
>>> result
     c period interval
0  NaN   2012   (0, 1]
1  NaN   2012   (0, 1]
>>>
>>> result.dtypes
c           object
period      object
interval    object
dtype: object
>>>
>>> df = pd.DataFrame([s, s])
>>> df
  period interval
0   2012   (0, 1]
1   2012   (0, 1]
>>>
>>> df.dtypes
period        period[A-DEC]
interval    interval[int64]
dtype: object
>>>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for period and interval, the issue is not limited to an empty dataframe

master

>>> df = pd.DataFrame([1, 2], columns=["c"])
>>> df
   c
0  1
1  2
>>>
>>> result = df.append([s], ignore_index=True)
>>> result
     c period interval
0  1.0    NaN      NaN
1  2.0    NaN      NaN
2  NaN   2012   (0, 1]
>>>
>>> result.dtypes
c           float64
period       object
interval     object
dtype: object
>>>
>>> result = df.append([s, s], ignore_index=True)
>>> result
     c period interval
0  1.0    NaN      NaN
1  2.0    NaN      NaN
2  NaN   2012   (0, 1]
3  NaN   2012   (0, 1]
>>>
>>> result.dtypes
c           float64
period       object
interval     object
dtype: object
>>>

will raise an issue.

don't think necessary. I think covered by #22994 and #22957 and others linked in those issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk, yeah sometimes we have scattered tests and hard to tell whether we have exhaustive cases.

[[np.nan, np.nan, 1.0, 2.0, date]],
columns=["c", "d", "a", "b", "date"],
dtype=object,
[[np.nan, np.nan, 1.0, 2.0, date]], columns=["c", "d", "a", "b", "date"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this fully replicate the OP test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extended test

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need a whatsnew note? was this a change from 1.0.x?

@simonjayhawkins
Copy link
Member Author

I didn't add one since no issue had been raised. I will add a release note and reference this PR?

@jreback
Copy link
Contributor

jreback commented Jun 30, 2020

I didn't add one since no issue had been raised. I will add a release note and reference this PR?

yep that's good (and generally policy)

@simonjayhawkins
Copy link
Member Author

release note added

@simonjayhawkins
Copy link
Member Author

@jreback anything more to do here.

as an aside and not wanting to weigh in on the 2d EA debate; in order to have a Categorical scalar, would we also need a 0d EA? I wondered this before when working on #33846 where IIRC I though at the time the special casing would not have been necessary with 0d EA. Also, while working on #35032, I think it is starting to make sense to not restrict EAs to 1d.

@jreback jreback added this to the 1.1 milestone Jul 16, 2020
@jreback jreback merged commit 6509028 into pandas-dev:master Jul 16, 2020
@jreback
Copy link
Contributor

jreback commented Jul 16, 2020

thanks @simonjayhawkins

@jreback
Copy link
Contributor

jreback commented Jul 16, 2020

@jreback anything more to do here.

as an aside and not wanting to weigh in on the 2d EA debate; in order to have a Categorical scalar, would we also need a 0d EA? I wondered this before when working on #33846 where IIRC I though at the time the special casing would not have been necessary with 0d EA. Also, while working on #35032, I think it is starting to make sense to not restrict EAs to 1d.

can-o-worms. well i don't actually think you need a 0-d categorical scalar, a scalar will just do, of course then you cannot infer this to a categorical w/o context but that's ok.

@@ -152,11 +152,11 @@ def is_nonempty(x) -> bool:
target_dtype = find_common_type([x.dtype for x in to_concat])
to_concat = [_cast_to_common_type(arr, target_dtype) for arr in to_concat]

if isinstance(to_concat[0], ExtensionArray):
if isinstance(to_concat[0], ExtensionArray) and axis == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonjayhawkins do you remember why is was needed to add this axis=0 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants