BUG: preserve freq in DTI/TDI factorize #33836

jbrockmendel · 2020-04-28T00:58:46Z

…eq-factorize

jbrockmendel · 2020-04-28T03:12:44Z

@mroeschke do we need to pin a numba dep?

mroeschke · 2020-04-28T03:43:48Z

The test failures don't look numba related, but I may need to silence more warnings in some tests

EDIT: I have this filterwarning on all numba tests that use parallel. Locally it's filtering the warning not sure why it's raising here

@pytest.mark.filterwarnings("ignore:\\nThe keyword argument")

pandas/core/algorithms.py

jorisvandenbossche · 2020-04-28T11:00:11Z

pandas/core/indexes/datetimelike.py

+            return codes, self[:]
+            # TODO: In the sort=True case we could check for montonic_decreasing
+            #  and operate on self[::-1]
+        return super().factorize(sort=sort, na_sentinel=na_sentinel)


This seems to be duplicating the array version, can't that be reused? (directly, or by having pd.factorize call the array version)

Can you check this comment?

Its close; the main difference is that the Index version returns an Index for uniques

But that is handled in factorize (depending on the input, it will wrap the uniques in an Index or not)?

im not clear on what youre suggesting? that we dont need to override this here at all?

Well, you might need to update factorize to preserve freq (it's using _shallow_copy right now)

(or we could decide that factorize is not an operation that should preserve the freq)

Well, you might need to update factorize to preserve freq (it's using _shallow_copy right now)

If we're getting into Index-subclass-specific logic, I think that belongs on the Index subclass.

(or we could decide that factorize is not an operation that should preserve the freq)

I'd be OK with this

If we're getting into Index-subclass-specific logic, I think that belongs on the Index subclass.

Not necessarily, I would say, as it already has index-specific handling as well.
But there is no "shallow_copy"-like method that preserves attributes like freq ?

But there is no "shallow_copy"-like method that preserves attributes like freq ?

There used to be, but it was being used in places where freq shouldnt be retained, so was changed as it was a footgun.

pandas/core/algorithms.py

…eq-factorize

pandas/core/arrays/datetimelike.py

jorisvandenbossche · 2020-04-28T19:01:55Z

pandas/core/indexes/datetimelike.py

+            return codes, self[:]
+            # TODO: In the sort=True case we could check for montonic_decreasing
+            #  and operate on self[::-1]
+        return super().factorize(sort=sort, na_sentinel=na_sentinel)


Can you check this comment?

pandas/core/base.py

…eq-factorize

jbrockmendel · 2020-04-29T00:45:31Z

hmm ATM this fixes dti.factorize() but not pd.factorize(dti)

update fixed

jreback

can you explain why we are exposing .factorize() as a public method on Index? I don't think we really want this (private method that pd.factorize calls on EA generally is ok.

jbrockmendel · 2020-04-30T14:40:00Z

can you explain why we are exposing .factorize() as a public method on Index?

No idea, its there on both Index and Series in the status quo.

jbrockmendel · 2020-05-03T00:59:09Z

Gentle ping, this is a blocker for freq-check in assert-index-equal

jreback · 2020-05-03T01:30:34Z

you haven’t answered my question above

jbrockmendel · 2020-05-03T02:32:54Z

you haven’t answered my question above

you asked why we are exposing Index.factorize, and I answered here
that i am not aware of a good reason, but we do it in master.

jreback · 2020-05-10T14:46:50Z

pandas/core/arrays/datetimelike.py

@@ -437,6 +437,13 @@ def _with_freq(self, freq):
        arr._freq = freq
        return arr

+    def factorize(self, na_sentinel=-1):


we don't want to expose factorize on Index at all.

It is already there in the status quo. If you want to remove it, that needs a deprecation cycle, is a separate issue

/Users/jreback/pandas bash-3.2$ grep -r factorize pandas/core/indexes/ pandas/core/indexes//multi.py:from pandas.core.arrays.categorical import factorize_from_iterables pandas/core/indexes//multi.py: indexer_from_factorized, pandas/core/indexes//multi.py: codes, levels = factorize_from_iterables(arrays) pandas/core/indexes//multi.py: codes, levels = factorize_from_iterables(iterables) pandas/core/indexes//multi.py: codes, uniques = algos.factorize(indexer, sort=True) pandas/core/indexes//multi.py: ok_codes, uniques = algos.factorize(indexer[mask], sort=True) pandas/core/indexes//multi.py: indexer = indexer_from_factorized(primary, primshp, compress=False) Binary file pandas/core/indexes//__pycache__/multi.cpython-38.pyc matches

pls show where

hmm i guess its in the base class.

In any event. I don't think we actually want to change this.

Comment in existing tests seems to think this is already the behavior: https://github.com/pandas-dev/pandas/pull/33836/files#diff-8cf55ac38b6988b09a7f7f5d7280eb0fL360

Also should improve perf since this short-circuits the expensive part of factorize

jbrockmendel · 2020-06-11T17:27:06Z

Mothballing. If no one else cares about these bugs, not worth pursuing.

jbrockmendel added 3 commits April 27, 2020 17:57

BUG: preserve freq in DTI/TDI factorize

6d741b1

Merge branch 'master' of https://github.com/pandas-dev/pandas into fr…

e466d71

…eq-factorize

mypy fixup

a553174

jorisvandenbossche requested changes Apr 28, 2020

View reviewed changes

jbrockmendel added 4 commits April 28, 2020 07:38

dummy commit to force CI

23911ef

Merge branch 'master' of https://github.com/pandas-dev/pandas into fr…

516d232

…eq-factorize

refactor per joris suggestion

0e51930

32bit compat

678251d

jorisvandenbossche reviewed Apr 28, 2020

View reviewed changes

jbrockmendel added 2 commits April 28, 2020 12:19

return copy

abb5913

Merge branch 'master' of https://github.com/pandas-dev/pandas into fr…

c96c1ac

…eq-factorize

preserve freq in pd.factorize

7c66389

jreback requested changes Apr 30, 2020

View reviewed changes

jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Frequency DateOffsets labels Apr 30, 2020

jbrockmendel mentioned this pull request May 6, 2020

TST: check freq on series.index in assert_series_equal #33815

Merged

jreback requested changes May 10, 2020

View reviewed changes

jbrockmendel closed this Jun 11, 2020

jbrockmendel added the Mothballed Temporarily-closed PR the author plans to return to label Jun 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: preserve freq in DTI/TDI factorize #33836

BUG: preserve freq in DTI/TDI factorize #33836

jbrockmendel commented Apr 28, 2020

jbrockmendel commented Apr 28, 2020

mroeschke commented Apr 28, 2020 •

edited

Loading

jorisvandenbossche Apr 28, 2020

jorisvandenbossche Apr 28, 2020

jbrockmendel Apr 28, 2020

jorisvandenbossche Apr 28, 2020

jbrockmendel Apr 28, 2020

jorisvandenbossche Apr 28, 2020

jbrockmendel Apr 28, 2020

jbrockmendel Apr 28, 2020

jorisvandenbossche Apr 28, 2020

jbrockmendel Apr 28, 2020

jorisvandenbossche Apr 28, 2020

jbrockmendel commented Apr 29, 2020 •

edited

Loading

jreback left a comment

jbrockmendel commented Apr 30, 2020

jbrockmendel commented May 3, 2020

jreback commented May 3, 2020

jbrockmendel commented May 3, 2020

jreback May 10, 2020

jbrockmendel May 10, 2020

jreback May 10, 2020

jreback May 10, 2020

jbrockmendel May 10, 2020

jbrockmendel commented Jun 11, 2020

BUG: preserve freq in DTI/TDI factorize #33836

BUG: preserve freq in DTI/TDI factorize #33836

Conversation

jbrockmendel commented Apr 28, 2020

jbrockmendel commented Apr 28, 2020

mroeschke commented Apr 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Apr 29, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

jbrockmendel commented Apr 30, 2020

jbrockmendel commented May 3, 2020

jreback commented May 3, 2020

jbrockmendel commented May 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Jun 11, 2020

mroeschke commented Apr 28, 2020 •

edited

Loading

jbrockmendel commented Apr 29, 2020 •

edited

Loading