Skip to content

REGR: Series.__repr__ is broken for SparseDtype("datetime64[ns]") #35843

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dsaxton opened this issue Aug 21, 2020 · 17 comments · Fixed by #38332
Closed

REGR: Series.__repr__ is broken for SparseDtype("datetime64[ns]") #35843

dsaxton opened this issue Aug 21, 2020 · 17 comments · Fixed by #38332
Labels
Datetime Datetime data dtype Output-Formatting __repr__ of pandas objects, to_string Regression Functionality that used to work in a prior pandas version Sparse Sparse Data Type
Milestone

Comments

@dsaxton
Copy link
Member

dsaxton commented Aug 21, 2020

Regression due to 1fa0635 :

import pandas as pd

print(pd.__version__)

values = [pd.Timestamp('2012-05-01T01:00:00.000000'), pd.Timestamp('2016-05-01T01:00:00.000000')]
arr = pd.arrays.SparseArray(values)
ser = pd.Series(arr)
ser
1.2.0.dev0+147.g07983803b
Out[1]: ---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/opt/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/opt/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    392                         if cls is not object \
    393                                 and callable(cls.__dict__.get('__repr__')):
--> 394                             return _repr_pprint(obj, self, cycle)
    395
    396             return _default_pprint(obj, self, cycle)

~/opt/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    698     """A pprint that just redirects to the normal repr function."""
    699     # Find newlines and replace them with p.break_()
--> 700     output = repr(obj)
    701     lines = output.splitlines()
    702     with p.group():

~/pandas/pandas/core/series.py in __repr__(self)
   1315         show_dimensions = get_option("display.show_dimensions")
   1316
-> 1317         self.to_string(
   1318             buf=buf,
   1319             name=self.name,

~/pandas/pandas/core/series.py in to_string(self, buf, na_rep, float_format, header, index, length, dtype, name, max_rows, min_rows)
   1386             max_rows=max_rows,
   1387         )
-> 1388         result = formatter.to_string()
   1389
   1390         # catch contract violations

~/pandas/pandas/io/formats/format.py in to_string(self)
    356
    357         fmt_index, have_header = self._get_formatted_index()
--> 358         fmt_values = self._get_formatted_values()
    359
    360         if self.truncate_v:

~/pandas/pandas/io/formats/format.py in _get_formatted_values(self)
    341
    342     def _get_formatted_values(self) -> List[str]:
--> 343         return format_array(
    344             self.tr_series._values,
    345             None,

~/pandas/pandas/io/formats/format.py in format_array(values, formatter, float_format, na_rep, digits, space, justify, decimal, leading_space, quoting)
   1179     )
   1180
-> 1181     return fmt_obj.get_result()
   1182
   1183

~/pandas/pandas/io/formats/format.py in get_result(self)
   1210
   1211     def get_result(self) -> List[str]:
-> 1212         fmt_values = self._format_strings()
   1213         return _make_fixed_width(fmt_values, self.justify)
   1214

~/pandas/pandas/io/formats/format.py in _format_strings(self)
   1467
   1468         if not isinstance(values, DatetimeIndex):
-> 1469             values = DatetimeIndex(values)
   1470
   1471         if self.formatter is not None and callable(self.formatter):

~/pandas/pandas/core/indexes/datetimes.py in __new__(cls, data, freq, tz, normalize, closed, ambiguous, dayfirst, yearfirst, dtype, copy, name)
    269         name = maybe_extract_name(name, data, cls)
    270
--> 271         dtarr = DatetimeArray._from_sequence(
    272             data,
    273             dtype=dtype,

~/pandas/pandas/core/arrays/datetimes.py in _from_sequence(cls, data, dtype, copy, tz, freq, dayfirst, yearfirst, ambiguous)
    314         freq, freq_infer = dtl.maybe_infer_freq(freq)
    315
--> 316         subarr, tz, inferred_freq = sequence_to_dt64ns(
    317             data,
    318             dtype=dtype,

~/pandas/pandas/core/arrays/datetimes.py in sequence_to_dt64ns(data, dtype, copy, tz, dayfirst, yearfirst, ambiguous)
   1957     if is_datetime64tz_dtype(data_dtype):
   1958         # DatetimeArray -> ndarray
-> 1959         tz = _maybe_infer_tz(tz, data.tz)
   1960         result = data._data
   1961

AttributeError: 'SparseArray' object has no attribute 'tz'

xref #35762

@dsaxton dsaxton added Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version Sparse Sparse Data Type labels Aug 21, 2020
@dsaxton dsaxton added this to the 1.1.2 milestone Aug 21, 2020
@simonjayhawkins simonjayhawkins added the Output-Formatting __repr__ of pandas objects, to_string label Aug 24, 2020
@simonjayhawkins
Copy link
Member

cc @jbrockmendel

@jbrockmendel
Copy link
Member

jbrockmendel commented Aug 24, 2020

A few possible fixes:

  1. revert the dtype.kind == "M" check
  2. change SparseDtype.kind to something other than "M" in this case
  3. fix later-in-the-process code to accept the SparseDtype.

\1) is certainly the easiest change; 3) might be nice to get in place longer term

@dsaxton
Copy link
Member Author

dsaxton commented Aug 24, 2020

A few possible fixes:

  1. revert the dtype.kind == "M" check
  2. change SparseDtype.kind to something other than "M" in this case
  3. fix later-in-the-process code to accept the SparseDtype.
  4. is certainly the easiest change; 3) might be nice to get in place longer term

Option 2 seems to be along the lines of @jorisvandenbossche suggestion of potentially treating sparse dtypes as "just sparse" (a similar question came up in regard to problems with considering sparse categorical to be categorical). You would presumably still have SparseDtype.subtype.kind == "M" if it's needed.

@simonjayhawkins simonjayhawkins modified the milestones: 1.1.2, 1.1.3 Sep 7, 2020
@simonjayhawkins
Copy link
Member

moved off 1.1.2 milestone (scheduled for this week) as no PRs to fix in the pipeline

@simonjayhawkins
Copy link
Member

Regression due to 1fa0635 :

#33400

@simonjayhawkins simonjayhawkins modified the milestones: 1.1.3, 1.1.4 Oct 5, 2020
@simonjayhawkins
Copy link
Member

moved off 1.1.3 milestone (overdue) as no PRs to fix in the pipeline

@simonjayhawkins simonjayhawkins modified the milestones: 1.1.4, 1.1.5 Oct 29, 2020
@simonjayhawkins
Copy link
Member

moved off 1.1.4 milestone (scheduled for release tomorrow) as no PRs to fix in the pipeline

@jreback jreback modified the milestones: 1.1.5, Contributions Welcome Nov 25, 2020
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Dec 3, 2020
@simonjayhawkins
Copy link
Member

A few possible fixes:

  1. revert the dtype.kind == "M" check
  2. change SparseDtype.kind to something other than "M" in this case
  3. fix later-in-the-process code to accept the SparseDtype.
  4. is certainly the easiest change; 3) might be nice to get in place longer term

Option 2 seems to be along the lines of @jorisvandenbossche suggestion of potentially treating sparse dtypes as "just sparse" (a similar question came up in regard to problems with considering sparse categorical to be categorical). You would presumably still have SparseDtype.subtype.kind == "M" if it's needed.

we want to avoid undoing the performance gains, so ideally we don't want to revert #33400

option 2 does not sound like the correct solution for a regression fix, as SparseDtype.kind was also 'M' on 1.0.5, and changing that may introduce other regressions, but agree that we should perhaps move in that direction on master.

on 1.0.5 both is_datetime64_dtype and is_datetime64_any_dtype were True for both the array and dtype before the fastpaths were added.

However, is_datetime64tz_dtype was False for both the array and dtype on 1.0.5, but for 1.1.4 is still False for the array, but now True for the dtype.

on 1.1.4

>>> pd.__version__
'1.1.4'
>>> values = [
...     pd.Timestamp("2012-05-01T01:00:00.000000"),
...     pd.Timestamp("2016-05-01T01:00:00.000000"),
... ]
>>> arr = pd.arrays.SparseArray(values)
>>> arr
[2012-05-01 01:00:00, 2016-05-01 01:00:00]
Fill: NaT
IntIndex
Indices: array([0, 1])

>>> arr.kind
'integer'
>>> arr.dtype.kind
'M'
>>> pd.core.dtypes.common.is_datetime64_dtype(arr)
True
>>> pd.core.dtypes.common.is_datetime64_dtype(arr.dtype)
True
>>> pd.core.dtypes.common.is_datetime64_any_dtype(arr)
True
>>> pd.core.dtypes.common.is_datetime64_any_dtype(arr.dtype)
True
>>> pd.core.dtypes.common.is_datetime64tz_dtype(arr)
False
>>> pd.core.dtypes.common.is_datetime64tz_dtype(arr.dtype)
True

on 1.0.5

>>> pd.__version__
'1.0.5'
>>>
>>> values = [
...     pd.Timestamp("2012-05-01T01:00:00.000000"),
...     pd.Timestamp("2016-05-01T01:00:00.000000"),
... ]
>>> arr = pd.arrays.SparseArray(values)
>>> arr
[2012-05-01 01:00:00, 2016-05-01 01:00:00]
Fill: NaT
IntIndex
Indices: array([0, 1])

>>> arr.kind
'integer'
>>> arr.dtype.kind
'M'
>>> pd.core.dtypes.common.is_datetime64_dtype(arr)
True
>>> pd.core.dtypes.common.is_datetime64_dtype(arr.dtype)
True
>>> pd.core.dtypes.common.is_datetime64_any_dtype(arr)
True
>>> pd.core.dtypes.common.is_datetime64_any_dtype(arr.dtype)
True
>>> pd.core.dtypes.common.is_datetime64tz_dtype(arr)
False
>>> pd.core.dtypes.common.is_datetime64tz_dtype(arr.dtype)
False

so it appears that only is_datetime64tz_dtype needs to be changed.

@simonjayhawkins
Copy link
Member

so it appears that only is_datetime64tz_dtype needs to be changed.

so we could just remove

    if isinstance(arr_or_dtype, ExtensionDtype):
        # GH#33400 fastpath for dtype object
        return arr_or_dtype.kind == "M"

from is_datetime64tz_dtype and the test suite will pass and this issue would be fixed.

since this pattern (for EA checks) is also added for is_period_dtype, is_interval_dtype and is_categorical_dtype then I suspect there would still be latent bugs with SparseArray with these subtypes if these are not changed also.

@simonjayhawkins
Copy link
Member

In the formatting, rightly or wrongly, the SparseArray is converted to a DatetimeIndex and this is where the failure occurs.

so the underlying issue here is that creating an Index from a SparseDtype("datetime64[ns]") is broken from 1.0.5 to master.

on master

>>> pd.__version__
'1.2.0.dev0+1470.g22dbef1c71'
>>>
>>> values = [
...     pd.Timestamp("2012-05-01T01:00:00.000000"),
...     pd.Timestamp("2016-05-01T01:00:00.000000"),
... ]
>>> arr = pd.arrays.SparseArray(values)
>>> pd.Index(arr)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\simon\pandas\pandas\core\indexes\base.py", line 295, in __new__
    return _maybe_asobject(dtype, DatetimeIndex, data, copy, name, **kwargs)
  File "C:\Users\simon\pandas\pandas\core\indexes\base.py", line 6168, in _maybe_asobject
    return klass(data, dtype=dtype, copy=copy, name=name, **kwargs)
  File "C:\Users\simon\pandas\pandas\core\indexes\datetimes.py", line 307, in __new__
    dtarr = DatetimeArray._from_sequence_not_strict(
  File "C:\Users\simon\pandas\pandas\core\arrays\datetimes.py", line 325, in _from_sequence_not
_strict
    subarr, tz, inferred_freq = sequence_to_dt64ns(
  File "C:\Users\simon\pandas\pandas\core\arrays\datetimes.py", line 1984, in sequence_to_dt64n
s
    tz = _maybe_infer_tz(tz, data.tz)
AttributeError: 'SparseArray' object has no attribute 'tz'
>>>

on 1.0.5

>>> pd.__version__
'1.0.5'
>>>
>>> values = [
...     pd.Timestamp("2012-05-01T01:00:00.000000"),
...     pd.Timestamp("2016-05-01T01:00:00.000000"),
... ]
>>> arr = pd.arrays.SparseArray(values)
>>> pd.Index(arr)
DatetimeIndex(['2012-05-01 01:00:00', '2016-05-01 01:00:00'], dtype='datetime64[ns]', freq=None
)
>>>

if we don't want to revert parts of #33400, we could special case for SparseArray.

I'm not a fan of special casing, but special casing was used to fix Series construction in #35838, a regression also caused by #33400

If we want to do similar here to fix the Index construction case, we could patch as follows...

diff --git a/pandas/core/arrays/datetimes.py b/pandas/core/arrays/datetimes.py
index ce70f929c..67b57851a 100644
--- a/pandas/core/arrays/datetimes.py
+++ b/pandas/core/arrays/datetimes.py
@@ -38,6 +38,7 @@ from pandas.core.dtypes.common import (
     is_float_dtype,
     is_object_dtype,
     is_period_dtype,
+    is_sparse,
     is_string_dtype,
     is_timedelta64_dtype,
     pandas_dtype,
@@ -1921,7 +1922,6 @@ def sequence_to_dt64ns(
     ------
     TypeError : PeriodDType data is passed
     """
-
     inferred_freq = None
 
     dtype = _validate_dt64_dtype(dtype)
@@ -1956,7 +1956,11 @@ def sequence_to_dt64ns(
     data, copy = maybe_convert_dtype(data, copy)
     data_dtype = getattr(data, "dtype", None)
 
-    if is_object_dtype(data_dtype) or is_string_dtype(data_dtype):
+    if (
+        is_object_dtype(data_dtype)
+        or is_string_dtype(data_dtype)
+        or is_sparse(data_dtype)
+    ):
         # TODO: We do not have tests specific to string-dtypes,
         #  also complex or categorical or other extension
         copy = False

@jreback jreback modified the milestones: Contributions Welcome, 1.1.5 Dec 6, 2020
@jbrockmendel
Copy link
Member

    if isinstance(arr_or_dtype, ExtensionDtype):
        # GH#33400 fastpath for dtype object
        return arr_or_dtype.kind == "M"

instead of checking .kind, could directly check isinstance(arr_or_dtype, Datetime64TZDtype)?

@simonjayhawkins
Copy link
Member

If you want to make changes to the is_foo_dtype checks, fine with me. we won't need #38332 and can maybe revert #35838 too.

rather than papering over the cracks, I would also be happy to revert #33400 for now.

@jbrockmendel
Copy link
Member

rather than papering over the cracks

you've given more thought than i have to what is proper solution vs what constitutes crack-papering. If reverting #33400 is necessary, i'll choose correctness over performance every time.

@simonjayhawkins
Copy link
Member

A proper solution is probably not suitable for regression fixes. IIUC the proper solution has been discussed elsewhere, #35843 (comment)

@jreback
Copy link
Contributor

jreback commented Dec 6, 2020

yeah let's just do what @simonjayhawkins proposed for expediency

we need to release

if want to fix differently for 1.2 ok with that

@simonjayhawkins
Copy link
Member

by 'papering over the cracks', i was referring to the PRs adding the special casing for SparseArray.

yeah let's just do what @simonjayhawkins proposed for expediency

is that revert #33400 (and maybe #35838 too) or merge and backport #38332 to just fix the known issue.

@simonjayhawkins
Copy link
Member

#34986 related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Output-Formatting __repr__ of pandas objects, to_string Regression Functionality that used to work in a prior pandas version Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants