Skip to content

DF.__setitem__ creates extension column when given extension scalar #34875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 42 commits into from
Jul 11, 2020
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
0ec5911
Bugfix to make DF.__setitem__ create extension column instead of obje…
justinessert Jun 19, 2020
9336955
removed bad whitespace
justinessert Jun 19, 2020
01fb076
Apply suggestions from code review
justinessert Jun 22, 2020
5c8b356
added missing :
justinessert Jun 22, 2020
2c1f640
modified cast_extension_scalar_to_array test to include an Interval type
justinessert Jun 22, 2020
d509bf4
added user-facing test for extension type bug
justinessert Jun 22, 2020
e231bb1
fixed pep8 issues
justinessert Jun 22, 2020
18ed043
added note about bug in setting series to scalar extension type
justinessert Jun 22, 2020
a6b18f4
corrected order of imports
justinessert Jun 22, 2020
cbc29be
corrected order of imports
justinessert Jun 22, 2020
2f79822
fixed black formatting errors
justinessert Jun 22, 2020
0f9178e
removed extra comma
justinessert Jun 22, 2020
bfa18fb
updated cast_scalar_to_arr to support tuple shape for extension dtype
justinessert Jun 23, 2020
e7e9a48
removed unneeded code
justinessert Jun 23, 2020
291eb2d
added coverage for datetime with timezone in extension_array test
justinessert Jun 23, 2020
3a788ed
added TODO
justinessert Jun 23, 2020
38d7ce5
correct line that was too long
justinessert Jun 23, 2020
a5e8df5
fixed dtype issue with tz test
justinessert Jun 23, 2020
5e439bd
creating distinct arrays for each column
justinessert Jun 24, 2020
6cc7959
resolving mypy error
justinessert Jun 24, 2020
7e27a6e
added docstring info and test
justinessert Jun 24, 2020
90a8570
removed unneeded import
justinessert Jun 24, 2020
39b2984
flattened else case in init
justinessert Jun 26, 2020
7a01041
refactored extension type column fix
justinessert Jun 26, 2020
03e528b
reverted docstring changes
justinessert Jun 26, 2020
7bb9553
reverted docstring changes
justinessert Jun 26, 2020
a3be9a6
removed unneeded imports
justinessert Jun 26, 2020
3a92164
reverted test changes
justinessert Jun 26, 2020
c93a847
fixed construct_1d_arraylike bug
justinessert Jun 26, 2020
966283a
reorganized if statements
justinessert Jun 30, 2020
f2aea7b
moved what's new statement to correct file
justinessert Jun 30, 2020
6495a36
created new test for period df construction
justinessert Jun 30, 2020
42e7afa
added assert_frame_equal to period_data test
justinessert Jun 30, 2020
8343df3
Using pandas array instead of df constructor for better test
justinessert Jul 7, 2020
a50a42c
changed wording
justinessert Jul 7, 2020
3452c20
Merge branch 'master' of https://github.com/justinessert/pandas
justinessert Jul 7, 2020
6f3fb51
pylint fixes
justinessert Jul 7, 2020
b95cdfc
parameterized test and added comment
justinessert Jul 8, 2020
6830fde
removed extra comma
justinessert Jul 8, 2020
6653ef8
Merge branch 'master' into master
justinessert Jul 10, 2020
c73a2de
parameterized test
justinessert Jul 10, 2020
100f334
renamed test
justinessert Jul 10, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1259,6 +1259,7 @@ ExtensionArray
- Bug in :class:`arrays.PandasArray` when setting a scalar string (:issue:`28118`, :issue:`28150`).
- Bug where nullable integers could not be compared to strings (:issue:`28930`)
- Bug where :class:`DataFrame` constructor raised ``ValueError`` with list-like data and ``dtype`` specified (:issue:`30280`)
- Bug where :class:`Series` set to scalar extension type was considered an object type rather than the extension type (:issue:`34832`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we had to change some other tests, I think we need to break this to a new section and show the changes from before. E.g. construction of a multi-column df now is object if we don't have unform datetimes? (we need to be very clear what is the change here since we had to change some tests)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized that I added this to the wrong file, so I'm moving this addition to v1.1.0.rst. But can you clarify what you would like me to do here? I'm not totally sure based on your comment.

I think that this line correctly describes the change. Are you asking to also include an example, such as

# This used to create an object type column, now it creates a Period type column
DataFrame(index=[0, 1], columns=["a"], data=pd.Period("2020-01"))

The example you gave, where a datetime column has multiple different timezones, this was always an object column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok tell you what. let's create an issue to refactor this (frame constructor). That's what I mainly have a problem with, we are adding if/then ALL over the place for extension types rather than a proper refactor.

So ok with merging this (just one small change on the naming in the tests). And please create an issue (and if you want / can refactor would be great) as a followup.


Other
^^^^^
Expand Down
12 changes: 11 additions & 1 deletion pandas/core/dtypes/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -1505,10 +1505,20 @@ def cast_scalar_to_array(shape, value, dtype: Optional[DtypeObj] = None) -> np.n

"""
if dtype is None:
dtype, fill_value = infer_dtype_from_scalar(value)
dtype, fill_value = infer_dtype_from_scalar(value, pandas_dtype=True)
else:
fill_value = value

# TODO: Update this function to add support for 3rd party extension types
# Issue #34959
if is_extension_array_dtype(dtype):
if isinstance(shape, int):
shape = (shape, 1)
return [
construct_1d_arraylike_from_scalar(value, shape[0], dtype)
for _ in range(shape[1])
]

values = np.empty(shape, dtype=dtype)
values.fill(fill_value)

Expand Down
18 changes: 15 additions & 3 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@
cast_scalar_to_array,
coerce_to_dtypes,
find_common_type,
infer_dtype_from_array,
infer_dtype_from_scalar,
invalidate_string_dtypes,
maybe_cast_to_datetime,
Expand Down Expand Up @@ -528,9 +529,15 @@ def __init__(
values = cast_scalar_to_array(
(len(index), len(columns)), data, dtype=dtype
)
mgr = init_ndarray(
values, index, columns, dtype=values.dtype, copy=False
)
if isinstance(values, list):
# Case 1: values is a list of extension arrays
dtype, _ = infer_dtype_from_array(values[0], pandas_dtype=True)
mgr = arrays_to_mgr(values, columns, index, columns, dtype=dtype)
else:
# Case 2: values is a numpy array
mgr = init_ndarray(
values, index, columns, dtype=values.dtype, copy=False
)
else:
raise ValueError("DataFrame constructor not properly called!")

Expand Down Expand Up @@ -3731,6 +3738,11 @@ def reindexer(value):

# upcast
value = cast_scalar_to_array(len(self.index), value)

# if extension dtype, value will be a list of length 1
if isinstance(value, list):
value = value[0]

value = maybe_cast_to_datetime(value, infer_dtype)

# return internal types directly
Expand Down
28 changes: 26 additions & 2 deletions pandas/tests/dtypes/cast/test_infer_dtype.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
infer_dtype_from_scalar,
)
from pandas.core.dtypes.common import is_dtype_equal
from pandas.core.dtypes.dtypes import DatetimeTZDtype, IntervalDtype, PeriodDtype

from pandas import (
Categorical,
Expand Down Expand Up @@ -187,14 +188,37 @@ def test_infer_dtype_from_array(arr, expected, pandas_dtype):
(1.1, np.float64),
(Timestamp("2011-01-01"), "datetime64[ns]"),
(Timestamp("2011-01-01", tz="US/Eastern"), object),
(Period("2011-01-01", freq="D"), object),
],
)
def test_cast_scalar_to_array(obj, dtype):
def test_cast_scalar_to_numpy_array(obj, dtype):
shape = (3, 2)

exp = np.empty(shape, dtype=dtype)
exp.fill(obj)

arr = cast_scalar_to_array(shape, obj, dtype=dtype)
tm.assert_numpy_array_equal(arr, exp)


@pytest.mark.parametrize(
"obj,dtype",
[
(Period("2011-01-01", freq="D"), PeriodDtype("D")),
(Interval(left=0, right=5), IntervalDtype("int64")),
(
Timestamp("2011-01-01", tz="US/Eastern"),
DatetimeTZDtype(unit="ns", tz="US/Eastern"),
),
],
)
def test_cast_scalar_to_extension_array(obj, dtype):
# GH: 34832
shape = 3

exp = dtype.construct_array_type()._from_sequence([obj] * shape)

arr = cast_scalar_to_array(shape, obj, dtype=dtype)
tm.assert_extension_array_equal(arr[0], exp)

arr = cast_scalar_to_array(shape, obj, dtype=None)
tm.assert_extension_array_equal(arr[0], exp)
32 changes: 31 additions & 1 deletion pandas/tests/frame/indexing/test_setitem.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,18 @@
import numpy as np
import pytest

from pandas import Categorical, DataFrame, Index, Series, Timestamp, date_range
from pandas.core.dtypes.dtypes import IntervalDtype, PeriodDtype

from pandas import (
Categorical,
DataFrame,
Index,
Interval,
Period,
Series,
Timestamp,
date_range,
)
import pandas._testing as tm
from pandas.core.arrays import SparseArray

Expand Down Expand Up @@ -150,3 +161,22 @@ def test_setitem_dict_preserves_dtypes(self):
"c": float(b),
}
tm.assert_frame_equal(df, expected)

def test_setitem_extension_types(self):
# GH: 34832
period_val = Period("2020-01")
interval_val = Interval(left=0, right=5)

expected = DataFrame(
{
"idx": [1, 2, 3],
"period": Series([period_val] * 3, dtype=PeriodDtype("M")),
"interval": Series([interval_val] * 3, dtype=IntervalDtype("int64")),
}
)

df = DataFrame({"idx": [1, 2, 3]})
df["period"] = period_val
df["interval"] = interval_val

tm.assert_frame_equal(df, expected)
7 changes: 6 additions & 1 deletion pandas/tests/frame/methods/test_combine_first.py
Original file line number Diff line number Diff line change
Expand Up @@ -199,12 +199,14 @@ def test_combine_first_timezone(self):
columns=["UTCdatetime", "abc"],
data=data1,
index=pd.date_range("20140627", periods=1),
dtype="object",
)
data2 = pd.to_datetime("20121212 12:12").tz_localize("UTC")
df2 = pd.DataFrame(
columns=["UTCdatetime", "xyz"],
data=data2,
index=pd.date_range("20140628", periods=1),
dtype="object",
)
res = df2[["UTCdatetime"]].combine_first(df1)
exp = pd.DataFrame(
Expand All @@ -217,10 +219,13 @@ def test_combine_first_timezone(self):
},
columns=["UTCdatetime", "abc"],
index=pd.date_range("20140627", periods=2, freq="D"),
dtype="object",
)
tm.assert_frame_equal(res, exp)
assert res["UTCdatetime"].dtype == "datetime64[ns, UTC]"
assert res["abc"].dtype == "datetime64[ns, UTC]"
# GH Issue 7509
res = res.astype("object")
tm.assert_frame_equal(res, exp)

# see gh-10567
dts1 = pd.date_range("2015-01-01", "2015-01-05", tz="UTC")
Expand Down