-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
CLN avoid some upcasting when its not the purpose of the test #50493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLN avoid some upcasting when its not the purpose of the test #50493
Conversation
414010b
to
159e4d9
Compare
df = DataFrame(np.random.randn(n, 3)).astype({0: object}) | ||
df["dates1"] = date_range("1/1/2012", periods=n) | ||
df["dates3"] = date_range("1/1/2014", periods=n) | ||
df.iloc[0, 0] = pd.NaT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here, df[0]
is first created of int
dtype, and then df.iloc[0, 0] = pd.NaT
upcasts it to object
. Might as well create it of dtype object
in the first place, as the purpose of this test comes in the lines below (df.query(...
)
df = DataFrame(np.random.randn(n, 3)) | ||
df = DataFrame(np.random.randn(n, 3)).astype({0: object}) | ||
df["dates1"] = date_range("1/1/2012", periods=n) | ||
df["dates3"] = date_range("1/1/2014", periods=n) | ||
df.iloc[0, 0] = pd.NaT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
# GH 9201 | ||
df1 = DataFrame(np.random.randn(5, 3), columns=["foo", "bar", "baz"]) | ||
df1 = DataFrame(np.random.randn(5, 3), columns=["foo", "bar", "baz"]).astype( | ||
{"foo": object} | ||
) | ||
# set one entry to a number in str format | ||
df1.loc[0, "foo"] = "100" | ||
|
||
df2 = DataFrame(np.random.randn(5, 3), columns=["foo", "bar", "baz"]) | ||
df2 = DataFrame(np.random.randn(5, 3), columns=["foo", "bar", "baz"]).astype( | ||
{"foo": object} | ||
) | ||
# set one entry to a non-number str | ||
df2.loc[0, "foo"] = "a" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in both df1
and df2
, column 'foo'
is initially created to be of dtype 'int'
, and then by setting an element to a string it is upcasting to object
. This is done on purpose to check the numeric_only
flag below
Might as well declare column 'foo'
to be of dtype object
from the start then
@@ -102,7 +102,7 @@ def test_groupby_with_timegrouper(self): | |||
index=date_range( | |||
"20130901", "20131205", freq="5D", name="Date", inclusive="left" | |||
), | |||
) | |||
).astype({"Buyer": object}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
column 'Buyer'
is originally set to 0
, and then some strings are set (thus upcasting to object
). Might as well construct it as object
from the start - furthermore, this is the expected
dataframe, so it doesn't change what's being tested
df = pd.DataFrame(np.zeros((3, 3))) | ||
df = pd.DataFrame(np.zeros((3, 3))).astype({2: object}) | ||
df.iloc[2, 2] = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the third column df[2]
is initially made up of ints, and then an element is set to ' '
(thus upcasting to object)
might as well set to object
from the start, as the purpose of the test is on the line below df.replace(...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - I would recommend adding comments as otherwise it seems easy for a contributor to see thiese tests and think the astype
is unnecessary.
159e4d9
to
d4ecbc8
Compare
good call, thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Generally, tests follow the pattern:
There are a few places when, whilst creating the test input, some column is created of some dtype (e.g. 'int'), and then some incompatible type is set (e.g. a string
'foo'
), thus upcasting the column to dtype object. After that, this input is put through the function which the test is meant to test.Given that the upcasting isn't the purpose of these tests, it'd be cleaner to just create the input right away with object dtype.
The "avoid upcasting to object" part of #50424 does seem uncontroversial, so this would help reduce the diff when a set of PRs to address that will be raised