-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: preserve freq in DTI/TDI factorize #33836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@mroeschke do we need to pin a numba dep? |
The test failures don't look numba related, but I may need to silence more warnings in some tests EDIT: I have this
|
return codes, self[:] | ||
# TODO: In the sort=True case we could check for montonic_decreasing | ||
# and operate on self[::-1] | ||
return super().factorize(sort=sort, na_sentinel=na_sentinel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be duplicating the array version, can't that be reused? (directly, or by having pd.factorize
call the array version)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its close; the main difference is that the Index version returns an Index for uniques
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that is handled in factorize
(depending on the input, it will wrap the uniques in an Index or not)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im not clear on what youre suggesting? that we dont need to override this here at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you might need to update factorize
to preserve freq
(it's using _shallow_copy
right now)
(or we could decide that factorize
is not an operation that should preserve the freq)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you might need to update factorize to preserve freq (it's using _shallow_copy right now)
If we're getting into Index-subclass-specific logic, I think that belongs on the Index subclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or we could decide that factorize is not an operation that should preserve the freq)
I'd be OK with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're getting into Index-subclass-specific logic, I think that belongs on the Index subclass.
Not necessarily, I would say, as it already has index-specific handling as well.
But there is no "shallow_copy"-like method that preserves attributes like freq
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there is no "shallow_copy"-like method that preserves attributes like freq ?
There used to be, but it was being used in places where freq shouldnt be retained, so was changed as it was a footgun.
return codes, self[:] | ||
# TODO: In the sort=True case we could check for montonic_decreasing | ||
# and operate on self[::-1] | ||
return super().factorize(sort=sort, na_sentinel=na_sentinel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check this comment?
hmm ATM this fixes dti.factorize() but not pd.factorize(dti) update fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain why we are exposing .factorize()
as a public method on Index? I don't think we really want this (private method that pd.factorize
calls on EA generally is ok.
No idea, its there on both Index and Series in the status quo. |
Gentle ping, this is a blocker for freq-check in assert-index-equal |
you haven’t answered my question above |
you asked why we are exposing Index.factorize, and I answered here |
@@ -437,6 +437,13 @@ def _with_freq(self, freq): | |||
arr._freq = freq | |||
return arr | |||
|
|||
def factorize(self, na_sentinel=-1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want to expose factorize on Index at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is already there in the status quo. If you want to remove it, that needs a deprecation cycle, is a separate issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/Users/jreback/pandas
bash-3.2$ grep -r factorize pandas/core/indexes/
pandas/core/indexes//multi.py:from pandas.core.arrays.categorical import factorize_from_iterables
pandas/core/indexes//multi.py: indexer_from_factorized,
pandas/core/indexes//multi.py: codes, levels = factorize_from_iterables(arrays)
pandas/core/indexes//multi.py: codes, levels = factorize_from_iterables(iterables)
pandas/core/indexes//multi.py: codes, uniques = algos.factorize(indexer, sort=True)
pandas/core/indexes//multi.py: ok_codes, uniques = algos.factorize(indexer[mask], sort=True)
pandas/core/indexes//multi.py: indexer = indexer_from_factorized(primary, primshp, compress=False)
Binary file pandas/core/indexes//__pycache__/multi.cpython-38.pyc matches
pls show where
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm i guess its in the base class.
In any event. I don't think we actually want to change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment in existing tests seems to think this is already the behavior: https://github.com/pandas-dev/pandas/pull/33836/files#diff-8cf55ac38b6988b09a7f7f5d7280eb0fL360
Also should improve perf since this short-circuits the expensive part of factorize
Mothballing. If no one else cares about these bugs, not worth pursuing. |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
xref #33830