-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
CLN: Move capitalize_first_letter to where it's used #57096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/dtypes/dtypes.py
Outdated
@@ -1087,7 +1085,7 @@ def na_value(self) -> NaTType: | |||
|
|||
def __eq__(self, other: object) -> bool: | |||
if isinstance(other, str): | |||
return other in [self.name, capitalize_first_letter(self.name)] | |||
return other in {self.name, self.name.title()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The equivalent of the deleted function is str.capitalize()
. str.title()
makes the first letter of every word uppercase, not only the first letter of the whole string. Not sure if in this case it makes a difference, but maybe worth keeping the same exact behavior.
return other in {self.name, self.name.title()} | |
return other in {self.name, self.name.capitalize()} |
In any case, nice clean up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like neither is correct
In [2]: "period[D]".capitalize()
Out[2]: 'Period[d]' # we don't want to lowercase the [D]
In [4]: "Period[us]".title()
Out[4]: 'Period[Us]' # we don't want to uppercase the [Us]
I found string.capwords
works for what we want to compare here so I'll use that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the current implementation better. But I think it simply does a case insensitive comparison, which is fine. But if that's what we really want, I'd just apply .lower()
to both sides. The code with string.capwords
does exactly the same I think, but when reading it feels like there must be a reason to use string.capwords
in particular, so I find it a bit misleading.
@MarcoGorelli @natmokval any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string.capwords
allows a case insensitive match for the first letter only which is what we want. lower
would do a case insensitive match on the whole string which is incorrect because we don't want to do that for the frequency in the brackets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like, if we replace string.capwords
with .lower
all tests pass, but I prefer not to change the frequency in the brackets.
Thouhg for dtype
we use both "period[D]"
and "Period[D]"
, it's better to keep the correct case for the frequency.
The reason: currently we are working on creating a consistent naming for aliases (xref #56346).
Deprecated lowercase strings in favour of uppercase strings denoting:
- period aliases daily, weekly, monthly and bigger frequency
- offsets aliases representing DateOffset subclasses that are a week or bigger (Week, MonthBegin, MonthEnd, etc.)
- timedelta units for week, month or bigger
Deprecated uppercase strings in favour of lowercase strings denoting:
- period aliases hourly, minutely or smaller frequency
- offsets aliases representing DateOffset subclasses that are an hour or smaller (Hour, Minute, etc.)
- timedelta units for hour, minute or smaller
For some frequencies the deprecation is done, for example we show FutureWarning
if freq="q"
>>> pd.Period("2015-01-01", freq="q")
<stdin>:1: FutureWarning: 'q' is deprecated and will be removed in a future version, please use 'Q' instead.
For others, for example Day/Daily we can use both 'D'
and 'd'
>>> pd.Period("2015-01-01", freq="d")
Period('2015-01-01', 'D')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to insist @mroeschke, but I don't think the code in the PR does what you think it does (or maybe I'm missing something, but I can't see what).
string.capwords
is making the comparison fully case insensitive:
>>> string.capwords('Period[US]') == string.capwords('Period[us]')
True
Any case variation of period[us]
would be converted to Period[us]
and the comparison will be true regardless on which part of the string has different capitalization.
If the resulting string is going to be used for anything else than comparing, I agree the exact function used is relevant. But in this case, any of .upper()
, .lower()
, .title()
, string.capwords()
... would be equivalent, and I think .lower()
is the one that makes the code clearer.
Or maybe we know that one of the sides of the comparison is in the Period[ms]
or period[ms]
capitalization? Then we should remove the call to string.capwords
in that side of the comparison, and string.capwords
will do what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK thanks for talking that through. I understand the situation clearer now. I think with the requirement to keep the frequency in the brackets unmodified, I think the only solution is to use the original helper function to the latest commit changed it back to that
pandas/core/dtypes/dtypes.py
Outdated
@@ -1087,7 +1090,7 @@ def na_value(self) -> NaTType: | |||
|
|||
def __eq__(self, other: object) -> bool: | |||
if isinstance(other, str): | |||
return other in [self.name, capitalize_first_letter(self.name)] | |||
return other in {self.name, capitalize_first_letter(self.name)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm still a bit confused.
First thing is that self.name
is always going to be period[...
in lower case, this is the implementation (just two methods above this part):
@property
def name(self) -> str_type:
return f"period[{self._freqstr}]"
Then, if we want to compare strings, why not simply:
return other in {self.name, capitalize_first_letter(self.name)} | |
return other in {self.name, capitalize_first_letter(self.name)} | |
return other[0].lower() + other[1:] == self.name |
This will make:
assert PeriodDtype('3D') == 'Period[3D]'
assert PeriodDtype('3D') == 'period[3D]'
assert PeriodDtype('3D') != 'Period[3d]'
Which for what understand is what we want, no?
Then, there is another case to consider:
>>> PeriodDtype(freq='1s1ms') == PeriodDtype(freq='1001ms')
True
>>> PeriodDtype(freq='1s1ms') == '1001ms'
False
Is this what we want? I personally think this is a bug. I'd say that a better implementation here is simply:
return other in {self.name, capitalize_first_letter(self.name)} | |
other = self.__class__(other) |
So the comparison with string is consistent with the comparison with the types directly. If we want to do this, we probably don't want to raise when the string is not a valid frequency string, so I guess this makes more sense:
return other in {self.name, capitalize_first_letter(self.name)} | |
try: | |
other = self.__class__(other) | |
except ValueError: | |
return False |
If we go with this option, we probably should add a test, as this would be a bug fix, not a clean up anymore.
What do you think? I think it's a quick change, but if you prefer that I open a new PR for this I can do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use pd.PeriodDtype(freq='1s1ms').freq.freqstrt
to make the comparison
>>> pd.PeriodDtype(freq='1s1ms') == pd.PeriodDtype(freq='1001ms')
True
>>> pd.PeriodDtype(freq='1s1ms').freq.freqstr == '1001ms'
True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this PR to use return other[:1].lower() + other[1:] == self.name
For the PeriodDtype(freq='1s1ms') == '1001ms'
, I would prefer not to make a call about this case in this PR. My gut would prefer to be stricter in eq and require period[...]
, while just the frequency string can be compared exactly to freqstr
as @natmokval suggested. For example, 1001ms
could be "timedelta-like" for a user which shouldn't be made equal to a "period-like`"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Nice clean up, thanks for taking care of this.
I don't think the other option I gave would equal to true when the dtype is a timedelta or anything that it's not a PeriodDtype
, but happy to get this merge, and we can consider the other case separately if it makes sense.
@natmokval if you want to have a look here and add any comment before we merge...
lgtm, thanks @mroeschke for the clean up. |
No description provided.