Skip to content

CLN avoid some upcasting when its not the purpose of the test #50493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Dec 29, 2022

Generally, tests follow the pattern:

  • create input
  • do something to it
  • check that it matches expected output

There are a few places when, whilst creating the test input, some column is created of some dtype (e.g. 'int'), and then some incompatible type is set (e.g. a string 'foo'), thus upcasting the column to dtype object. After that, this input is put through the function which the test is meant to test.

Given that the upcasting isn't the purpose of these tests, it'd be cleaner to just create the input right away with object dtype.

The "avoid upcasting to object" part of #50424 does seem uncontroversial, so this would help reduce the diff when a set of PRs to address that will be raised

@MarcoGorelli MarcoGorelli changed the title avoid some upcasting when its not the purpose of the test CLN avoid some upcasting when its not the purpose of the test Dec 29, 2022
@MarcoGorelli MarcoGorelli force-pushed the remove-upcasting-from-some-tests branch from 414010b to 159e4d9 Compare December 29, 2022 21:21
Comment on lines +451 to 455
df = DataFrame(np.random.randn(n, 3)).astype({0: object})
df["dates1"] = date_range("1/1/2012", periods=n)
df["dates3"] = date_range("1/1/2014", periods=n)
df.iloc[0, 0] = pd.NaT
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, df[0] is first created of int dtype, and then df.iloc[0, 0] = pd.NaT upcasts it to object. Might as well create it of dtype object in the first place, as the purpose of this test comes in the lines below (df.query(...)

Comment on lines 811 to 816
df = DataFrame(np.random.randn(n, 3))
df = DataFrame(np.random.randn(n, 3)).astype({0: object})
df["dates1"] = date_range("1/1/2012", periods=n)
df["dates3"] = date_range("1/1/2014", periods=n)
df.iloc[0, 0] = pd.NaT
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Comment on lines 450 to 461
# GH 9201
df1 = DataFrame(np.random.randn(5, 3), columns=["foo", "bar", "baz"])
df1 = DataFrame(np.random.randn(5, 3), columns=["foo", "bar", "baz"]).astype(
{"foo": object}
)
# set one entry to a number in str format
df1.loc[0, "foo"] = "100"

df2 = DataFrame(np.random.randn(5, 3), columns=["foo", "bar", "baz"])
df2 = DataFrame(np.random.randn(5, 3), columns=["foo", "bar", "baz"]).astype(
{"foo": object}
)
# set one entry to a non-number str
df2.loc[0, "foo"] = "a"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in both df1 and df2, column 'foo' is initially created to be of dtype 'int', and then by setting an element to a string it is upcasting to object. This is done on purpose to check the numeric_only flag below

Might as well declare column 'foo' to be of dtype object from the start then

@@ -102,7 +102,7 @@ def test_groupby_with_timegrouper(self):
index=date_range(
"20130901", "20131205", freq="5D", name="Date", inclusive="left"
),
)
).astype({"Buyer": object})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

column 'Buyer' is originally set to 0, and then some strings are set (thus upcasting to object). Might as well construct it as object from the start - furthermore, this is the expected dataframe, so it doesn't change what's being tested

Comment on lines 19 to 21
df = pd.DataFrame(np.zeros((3, 3)))
df = pd.DataFrame(np.zeros((3, 3))).astype({2: object})
df.iloc[2, 2] = ""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the third column df[2] is initially made up of ints, and then an element is set to ' ' (thus upcasting to object)

might as well set to object from the start, as the purpose of the test is on the line below df.replace(...

@MarcoGorelli MarcoGorelli marked this pull request as ready for review December 29, 2022 21:26
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - I would recommend adding comments as otherwise it seems easy for a contributor to see thiese tests and think the astype is unnecessary.

@MarcoGorelli MarcoGorelli force-pushed the remove-upcasting-from-some-tests branch from 159e4d9 to d4ecbc8 Compare December 30, 2022 09:00
@MarcoGorelli
Copy link
Member Author

good call, thanks

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants