A few issues encountered when teaching #71

brunosmaniotto · 2025-03-01T15:35:37Z

Notebook 1 (regression):

data.corr() needs to be replaced by data.corr(numeric_only = True)
mean_squared_error(y_true,y_pred,squared = False) needs to be replaced by root_mean_squared_error. This includes importing the function, and replacing all instances.

Notebook 2 (regularization):

In the part "Low Number of Samples: The most common scenario where you might overfit is when you have many features, but not many samples. Adding the penalty term stabilizes the model in these scenarios. There's not a great intuition for this without diving into the math, so you can just take it at face value." I think here it could be added that the intuition is also that variance is higher. Something like this: "The idea here is that when sample sizes are low, there is a higher probability of encountering unlikely samples, and thus the parameter estimates can fluctuate more. By applying regularization, we prevent the model from fitting too closely to any anomalies, making it more robust."
mean_squared_error(y_true,y_pred,squared = False) needs to be replaced by root_mean_squared_error. This includes importing the function, and replacing all instances.

Notebook 3 (preprocessing):

OneHotEncoder(categories='auto', drop='first', sparse=False) should be replaced by OneHotEncoder(categories='auto', drop='first', sparse_output=False)
I think it would be useful to add at some point that we are planning on predicting the species of the Penguins - that's why we drop the "species" column and don't preprocess it.
"Island where the penguin was found [Torgersen, Biscoe]" should be "Island where the penguin was found [Torgersen, Biscoe, Dream]"
For the solutions, I think it would be helpful to discuss data leakage a little bit more. For example: "One example would be that of imputing missing values as the mean of the entire dataset, rather than just the mean of the training dataset. By including the test set, we are using information from this subsample to create the estimate. By doing so, our estimate will be tailored to this dataset, which overestimates how well the model can extrapolate to new data."
Also, there are no solutions to the bonus questions and challenge 3 - but I'm not sure if that's on purpose or not.

Notebook 4 (classification):

All instances of float.round(3) should be replaced by round(float,3)
I'm getting different values for the solution of Challenge 1.

Overall: I think this workshop could be divided into four 1-1.5 hours parts. The problem isn't the time to teach the workshop, but that it is too much information for just two meetings. We talked with students and they thought the same, especially for the first day. Also, 3 consecutive hours is hard to fit in a schedule, so a few people had to leave before the workshop was over, or came in late.

brunosmaniotto added bug enhancement labels Mar 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A few issues encountered when teaching #71

A few issues encountered when teaching #71

brunosmaniotto commented Mar 1, 2025 •

edited

Loading

A few issues encountered when teaching #71

A few issues encountered when teaching #71

Comments

brunosmaniotto commented Mar 1, 2025 • edited Loading

brunosmaniotto commented Mar 1, 2025 •

edited

Loading