Skip to content

CLN: Move capitalize_first_letter to where it's used #57096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Feb 2, 2024

Conversation

mroeschke
Copy link
Member

No description provided.

@@ -1087,7 +1085,7 @@ def na_value(self) -> NaTType:

def __eq__(self, other: object) -> bool:
if isinstance(other, str):
return other in [self.name, capitalize_first_letter(self.name)]
return other in {self.name, self.name.title()}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The equivalent of the deleted function is str.capitalize(). str.title() makes the first letter of every word uppercase, not only the first letter of the whole string. Not sure if in this case it makes a difference, but maybe worth keeping the same exact behavior.

Suggested change
return other in {self.name, self.name.title()}
return other in {self.name, self.name.capitalize()}

In any case, nice clean up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like neither is correct

In [2]: "period[D]".capitalize()
Out[2]: 'Period[d]'   # we don't want to lowercase the [D]

In [4]: "Period[us]".title()
Out[4]: 'Period[Us]'  # we don't want to uppercase the [Us]

I found string.capwords works for what we want to compare here so I'll use that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the current implementation better. But I think it simply does a case insensitive comparison, which is fine. But if that's what we really want, I'd just apply .lower() to both sides. The code with string.capwords does exactly the same I think, but when reading it feels like there must be a reason to use string.capwords in particular, so I find it a bit misleading.

@MarcoGorelli @natmokval any thoughts on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string.capwords allows a case insensitive match for the first letter only which is what we want. lower would do a case insensitive match on the whole string which is incorrect because we don't want to do that for the frequency in the brackets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like, if we replace string.capwords with .lower all tests pass, but I prefer not to change the frequency in the brackets.
Thouhg for dtype we use both "period[D]" and "Period[D]", it's better to keep the correct case for the frequency.

The reason: currently we are working on creating a consistent naming for aliases (xref #56346).
Deprecated lowercase strings in favour of uppercase strings denoting:

  • period aliases daily, weekly, monthly and bigger frequency
  • offsets aliases representing DateOffset subclasses that are a week or bigger (Week, MonthBegin, MonthEnd, etc.)
  • timedelta units for week, month or bigger

Deprecated uppercase strings in favour of lowercase strings denoting:

  • period aliases hourly, minutely or smaller frequency
  • offsets aliases representing DateOffset subclasses that are an hour or smaller (Hour, Minute, etc.)
  • timedelta units for hour, minute or smaller

For some frequencies the deprecation is done, for example we show FutureWarning if freq="q"

>>> pd.Period("2015-01-01", freq="q")
<stdin>:1: FutureWarning: 'q' is deprecated and will be removed in a future version, please use 'Q' instead.

For others, for example Day/Daily we can use both 'D' and 'd'

>>> pd.Period("2015-01-01", freq="d")
Period('2015-01-01', 'D')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to insist @mroeschke, but I don't think the code in the PR does what you think it does (or maybe I'm missing something, but I can't see what).

string.capwords is making the comparison fully case insensitive:

>>> string.capwords('Period[US]') == string.capwords('Period[us]')
True

Any case variation of period[us] would be converted to Period[us] and the comparison will be true regardless on which part of the string has different capitalization.

If the resulting string is going to be used for anything else than comparing, I agree the exact function used is relevant. But in this case, any of .upper(), .lower(), .title(), string.capwords()... would be equivalent, and I think .lower() is the one that makes the code clearer.

Or maybe we know that one of the sides of the comparison is in the Period[ms] or period[ms] capitalization? Then we should remove the call to string.capwords in that side of the comparison, and string.capwords will do what we want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK thanks for talking that through. I understand the situation clearer now. I think with the requirement to keep the frequency in the brackets unmodified, I think the only solution is to use the original helper function to the latest commit changed it back to that

@mroeschke mroeschke changed the title CLN: Use str.captialize instead of helper CLN: Use string.capwords instead of helper Jan 29, 2024
@mroeschke mroeschke added this to the 3.0 milestone Jan 29, 2024
@mroeschke mroeschke changed the title CLN: Use string.capwords instead of helper CLN: Move capitalize_first_letter to where it's used Jan 31, 2024
@@ -1087,7 +1090,7 @@ def na_value(self) -> NaTType:

def __eq__(self, other: object) -> bool:
if isinstance(other, str):
return other in [self.name, capitalize_first_letter(self.name)]
return other in {self.name, capitalize_first_letter(self.name)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm still a bit confused.

First thing is that self.name is always going to be period[... in lower case, this is the implementation (just two methods above this part):

    @property
    def name(self) -> str_type:
        return f"period[{self._freqstr}]"

Then, if we want to compare strings, why not simply:

Suggested change
return other in {self.name, capitalize_first_letter(self.name)}
return other in {self.name, capitalize_first_letter(self.name)}
return other[0].lower() + other[1:] == self.name

This will make:

assert PeriodDtype('3D') == 'Period[3D]'
assert PeriodDtype('3D') == 'period[3D]'
assert PeriodDtype('3D') != 'Period[3d]'

Which for what understand is what we want, no?

Then, there is another case to consider:

>>> PeriodDtype(freq='1s1ms') == PeriodDtype(freq='1001ms')
True
>>> PeriodDtype(freq='1s1ms') == '1001ms'
False

Is this what we want? I personally think this is a bug. I'd say that a better implementation here is simply:

Suggested change
return other in {self.name, capitalize_first_letter(self.name)}
other = self.__class__(other)

So the comparison with string is consistent with the comparison with the types directly. If we want to do this, we probably don't want to raise when the string is not a valid frequency string, so I guess this makes more sense:

Suggested change
return other in {self.name, capitalize_first_letter(self.name)}
try:
other = self.__class__(other)
except ValueError:
return False

If we go with this option, we probably should add a test, as this would be a bug fix, not a clean up anymore.

What do you think? I think it's a quick change, but if you prefer that I open a new PR for this I can do it.

Copy link
Contributor

@natmokval natmokval Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use pd.PeriodDtype(freq='1s1ms').freq.freqstrt to make the comparison

>>> pd.PeriodDtype(freq='1s1ms') == pd.PeriodDtype(freq='1001ms')
True
>>> pd.PeriodDtype(freq='1s1ms').freq.freqstr == '1001ms'
True

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this PR to use return other[:1].lower() + other[1:] == self.name

For the PeriodDtype(freq='1s1ms') == '1001ms', I would prefer not to make a call about this case in this PR. My gut would prefer to be stricter in eq and require period[...], while just the frequency string can be compared exactly to freqstr as @natmokval suggested. For example, 1001ms could be "timedelta-like" for a user which shouldn't be made equal to a "period-like`"

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Nice clean up, thanks for taking care of this.

I don't think the other option I gave would equal to true when the dtype is a timedelta or anything that it's not a PeriodDtype, but happy to get this merge, and we can consider the other case separately if it makes sense.

@natmokval if you want to have a look here and add any comment before we merge...

@natmokval
Copy link
Contributor

lgtm, thanks @mroeschke for the clean up.

@datapythonista datapythonista merged commit 680b215 into pandas-dev:main Feb 2, 2024
@mroeschke mroeschke deleted the cln/helper branch February 2, 2024 17:42
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants