Skip to content

CLN: Move capitalize_first_letter to where it's used #57096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Feb 2, 2024
4 changes: 1 addition & 3 deletions pandas/core/dtypes/dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,6 @@
is_list_like,
)

from pandas.util import capitalize_first_letter

if not pa_version_under10p1:
import pyarrow as pa

Expand Down Expand Up @@ -1087,7 +1085,7 @@ def na_value(self) -> NaTType:

def __eq__(self, other: object) -> bool:
if isinstance(other, str):
return other in [self.name, capitalize_first_letter(self.name)]
return other in {self.name, self.name.title()}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The equivalent of the deleted function is str.capitalize(). str.title() makes the first letter of every word uppercase, not only the first letter of the whole string. Not sure if in this case it makes a difference, but maybe worth keeping the same exact behavior.

Suggested change
return other in {self.name, self.name.title()}
return other in {self.name, self.name.capitalize()}

In any case, nice clean up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like neither is correct

In [2]: "period[D]".capitalize()
Out[2]: 'Period[d]'   # we don't want to lowercase the [D]

In [4]: "Period[us]".title()
Out[4]: 'Period[Us]'  # we don't want to uppercase the [Us]

I found string.capwords works for what we want to compare here so I'll use that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the current implementation better. But I think it simply does a case insensitive comparison, which is fine. But if that's what we really want, I'd just apply .lower() to both sides. The code with string.capwords does exactly the same I think, but when reading it feels like there must be a reason to use string.capwords in particular, so I find it a bit misleading.

@MarcoGorelli @natmokval any thoughts on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string.capwords allows a case insensitive match for the first letter only which is what we want. lower would do a case insensitive match on the whole string which is incorrect because we don't want to do that for the frequency in the brackets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like, if we replace string.capwords with .lower all tests pass, but I prefer not to change the frequency in the brackets.
Thouhg for dtype we use both "period[D]" and "Period[D]", it's better to keep the correct case for the frequency.

The reason: currently we are working on creating a consistent naming for aliases (xref #56346).
Deprecated lowercase strings in favour of uppercase strings denoting:

  • period aliases daily, weekly, monthly and bigger frequency
  • offsets aliases representing DateOffset subclasses that are a week or bigger (Week, MonthBegin, MonthEnd, etc.)
  • timedelta units for week, month or bigger

Deprecated uppercase strings in favour of lowercase strings denoting:

  • period aliases hourly, minutely or smaller frequency
  • offsets aliases representing DateOffset subclasses that are an hour or smaller (Hour, Minute, etc.)
  • timedelta units for hour, minute or smaller

For some frequencies the deprecation is done, for example we show FutureWarning if freq="q"

>>> pd.Period("2015-01-01", freq="q")
<stdin>:1: FutureWarning: 'q' is deprecated and will be removed in a future version, please use 'Q' instead.

For others, for example Day/Daily we can use both 'D' and 'd'

>>> pd.Period("2015-01-01", freq="d")
Period('2015-01-01', 'D')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to insist @mroeschke, but I don't think the code in the PR does what you think it does (or maybe I'm missing something, but I can't see what).

string.capwords is making the comparison fully case insensitive:

>>> string.capwords('Period[US]') == string.capwords('Period[us]')
True

Any case variation of period[us] would be converted to Period[us] and the comparison will be true regardless on which part of the string has different capitalization.

If the resulting string is going to be used for anything else than comparing, I agree the exact function used is relevant. But in this case, any of .upper(), .lower(), .title(), string.capwords()... would be equivalent, and I think .lower() is the one that makes the code clearer.

Or maybe we know that one of the sides of the comparison is in the Period[ms] or period[ms] capitalization? Then we should remove the call to string.capwords in that side of the comparison, and string.capwords will do what we want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK thanks for talking that through. I understand the situation clearer now. I think with the requirement to keep the frequency in the brackets unmodified, I think the only solution is to use the original helper function to the latest commit changed it back to that


return super().__eq__(other)

Expand Down
4 changes: 0 additions & 4 deletions pandas/util/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,3 @@ def __getattr__(key: str):
return cache_readonly

raise AttributeError(f"module 'pandas.util' has no attribute '{key}'")


def capitalize_first_letter(s: str) -> str:
return s[:1].upper() + s[1:]