Skip to content

BUG: preserve object-dtype index when accessing DataFrame column / PERF: improve perf of Series fastpath constructor #42950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 12 additions & 8 deletions asv_bench/benchmarks/series_methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import numpy as np

from pandas import (
Index,
NaT,
Series,
date_range,
Expand All @@ -12,20 +13,23 @@


class SeriesConstructor:

params = [None, "dict"]
param_names = ["data"]

def setup(self, data):
def setup(self):
self.idx = date_range(
start=datetime(2015, 10, 26), end=datetime(2016, 1, 1), freq="50s"
)
dict_data = dict(zip(self.idx, range(len(self.idx))))
self.data = None if data is None else dict_data
self.data = dict(zip(self.idx, range(len(self.idx))))
self.array = np.array([1, 2, 3])
self.idx2 = Index(["a", "b", "c"])

def time_constructor(self, data):
def time_constructor_dict(self):
Series(data=self.data, index=self.idx)

def time_constructor_no_data(self):
Series(data=None, index=self.idx)

def time_constructor_fastpath(self):
Series(self.array, index=self.idx, name="name", fastpath=True)


class NSort:

Expand Down
40 changes: 22 additions & 18 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -441,7 +441,11 @@ def __init__(
data = SingleArrayManager.from_array(data, index)

generic.NDFrame.__init__(self, data)
self.name = name
if fastpath:
# skips validation of the name
object.__setattr__(self, "_name", name)
else:
self.name = name
self._set_axis(0, index, fastpath=True)

def _init_dict(self, data, index=None, dtype: Dtype | None = None):
Expand Down Expand Up @@ -531,23 +535,23 @@ def _set_axis(self, axis: int, labels, fastpath: bool = False) -> None:
if not fastpath:
labels = ensure_index(labels)

if labels._is_all_dates:
deep_labels = labels
if isinstance(labels, CategoricalIndex):
deep_labels = labels.categories

if not isinstance(
deep_labels, (DatetimeIndex, PeriodIndex, TimedeltaIndex)
):
try:
labels = DatetimeIndex(labels)
# need to set here because we changed the index
if fastpath:
self._mgr.set_axis(axis, labels)
except (tslibs.OutOfBoundsDatetime, ValueError):
# labels may exceeds datetime bounds,
# or not be a DatetimeIndex
pass
if labels._is_all_dates:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently this is being relied upon in the pytables code though ..

And looking into it further, the change I made here of course changes behaviour, as it no longer try to infer object dtype as datetime in the fastpath.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although that should probably be considered as a bug fix. With master:

In [1]: df = pd.DataFrame({'a': [1, 2, 3]}, index=pd.date_range("2012", periods=3).astype(object))

In [2]: df.index
Out[2]: Index([2012-01-01 00:00:00, 2012-01-02 00:00:00, 2012-01-03 00:00:00], dtype='object')

In [3]: df['a'].index
Out[3]: DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03'], dtype='datetime64[ns]', freq=None)

where the index of the column-as-Series becomes different ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we deprecated this, then un-deprecated bc we didn't have good warning handling/suppression, and were going to try to re-deprecate before long

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a reference (issue/PR) for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is #36697, but that's for the non-fastpath way, I think? While here it's specifically the fastpath that I am changing (which is not used when a user creates a Series directly themselves).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we did that i dont remember any distinction being made between fastpath vs non-fastpath

deep_labels = labels
if isinstance(labels, CategoricalIndex):
deep_labels = labels.categories

if not isinstance(
deep_labels, (DatetimeIndex, PeriodIndex, TimedeltaIndex)
):
try:
labels = DatetimeIndex(labels)
# need to set here because we changed the index
if fastpath:
self._mgr.set_axis(axis, labels)
except (tslibs.OutOfBoundsDatetime, ValueError):
# labels may exceeds datetime bounds,
# or not be a DatetimeIndex
pass

object.__setattr__(self, "_index", labels)
if not fastpath:
Expand Down