Skip to content

A few issues encountered when teaching #71

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
brunosmaniotto opened this issue Mar 1, 2025 · 0 comments
Open

A few issues encountered when teaching #71

brunosmaniotto opened this issue Mar 1, 2025 · 0 comments

Comments

@brunosmaniotto
Copy link

brunosmaniotto commented Mar 1, 2025

Notebook 1 (regression):

  1. data.corr() needs to be replaced by data.corr(numeric_only = True)
  2. mean_squared_error(y_true,y_pred,squared = False) needs to be replaced by root_mean_squared_error. This includes importing the function, and replacing all instances.

Notebook 2 (regularization):

  1. In the part "Low Number of Samples: The most common scenario where you might overfit is when you have many features, but not many samples. Adding the penalty term stabilizes the model in these scenarios. There's not a great intuition for this without diving into the math, so you can just take it at face value." I think here it could be added that the intuition is also that variance is higher. Something like this: "The idea here is that when sample sizes are low, there is a higher probability of encountering unlikely samples, and thus the parameter estimates can fluctuate more. By applying regularization, we prevent the model from fitting too closely to any anomalies, making it more robust."

  2. mean_squared_error(y_true,y_pred,squared = False) needs to be replaced by root_mean_squared_error. This includes importing the function, and replacing all instances.

Notebook 3 (preprocessing):

  1. OneHotEncoder(categories='auto', drop='first', sparse=False) should be replaced by OneHotEncoder(categories='auto', drop='first', sparse_output=False)
  2. I think it would be useful to add at some point that we are planning on predicting the species of the Penguins - that's why we drop the "species" column and don't preprocess it.
  3. "Island where the penguin was found [Torgersen, Biscoe]" should be "Island where the penguin was found [Torgersen, Biscoe, Dream]"
  4. For the solutions, I think it would be helpful to discuss data leakage a little bit more. For example: "One example would be that of imputing missing values as the mean of the entire dataset, rather than just the mean of the training dataset. By including the test set, we are using information from this subsample to create the estimate. By doing so, our estimate will be tailored to this dataset, which overestimates how well the model can extrapolate to new data."
  5. Also, there are no solutions to the bonus questions and challenge 3 - but I'm not sure if that's on purpose or not.

Notebook 4 (classification):

  1. All instances of float.round(3) should be replaced by round(float,3)
  2. I'm getting different values for the solution of Challenge 1.

Overall: I think this workshop could be divided into four 1-1.5 hours parts. The problem isn't the time to teach the workshop, but that it is too much information for just two meetings. We talked with students and they thought the same, especially for the first day. Also, 3 consecutive hours is hard to fit in a schedule, so a few people had to leave before the workshop was over, or came in late.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant