-
Notifications
You must be signed in to change notification settings - Fork 15
Chapter 9 production polish #141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ys the same number
…y default in sklearn
source/clustering.md
Outdated
ks = range(1, 10) | ||
for k in ks: | ||
# Save the computed inertia for each k | ||
inertias.append(KMeans(n_clusters=k).fit(standardized_data).inertia_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more straight-forward than the current approach with apply. Since we already are teaching for loops in chapter 10 I thought it made sense to have one here too. We can of course discuss removing them at both places if we don't think we need them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my other comment, but yeah unfortunately we cannot cover for loops
source/clustering.md
Outdated
However, | ||
it is possible to have an elbow plot | ||
where the WSSD increases at one of the steps, | ||
causing a small bump in the line. | ||
This is because K-means can get "stuck" in a bad solution | ||
as we mentioned earlier in the chapter. | ||
|
||
> **Note:** It is rare that the KMeans function from scikit-learn | ||
> gets stuck in a bad solution, | ||
> because the selection of the centroid starting points | ||
> is optimized to prevent this from happening. | ||
> If you still find yourself in a situation where you have a bump in the elbow plot, | ||
> you can increase the `n_init` parameter above the default value of 10 | ||
> to try more different starting points for the centroids. | ||
> The larger the value the better from an analysis perspective, | ||
> but there is a trade-off that doing many clusterings could take a long time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the default in sklearn is that everything works as expected, I thought it was better to move this to a note instead of starting with an odd formulation of the params that we wouldn't actually use often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the default in sklearn is that everything works as expected
I don't know what that means. Kmeans in sklearn is initialized using Kmeans++, which is not bullet-proof -- you can still end up in local optima. It's just less likely. But it's still important to explain to students (in simpler terms) that Kmeans is a local optimizer, it doesn't find the global optimum and can output different answers each time.
There's no need to replicate the R book exactly (in terms of code), though
Honestly it might be neat to cover Kmeans++ in an optional section (but that's for the future -- no need to do it now. Just open an issue to remind us about the idea later. )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point I am trying to make is that with the default sklearn Kmeans param of using Kmeans++ for initialization and 10 runs by default, it is very unlikely to get stuck in a local minimum. In fact, I tried seed-hacking with the 10,000 first seed and not a single one resulted in a bump on the chart.
I agree with you that it is important to mention that K-means is a local optimizer, but the example used was contrived and not a realistic scenario of parameter combinations that the students are ever likely to use. That is why I removed the example and still explained the signs of sub-optimal local optimization in the added text.
Of note, scikit-learn is changing the default number of runs in 1.4 to be 1 for kmeans++ (I ran with n_iter=1
through the 10,000 first seeds here as well and still there was only 3.5% of cases resulting in a bump on the chart, but possibly when we updrade to 1.4 we could seedhack our way to a bump):
Code for reference
sum([
(pd.Series([
KMeans(n_clusters=k, random_state=seed, n_init=1).fit(standardized_data).inertia_
for k in range(1, 10)
]).diff() > 0).any()
for seed in range(10_000)
])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good! Some changes requested
source/clustering.md
Outdated
- small flipper length and small bill length (<font color="#D55E00">orange cluster</font>), | ||
- small flipper length and large bill length (<font color="#0072B2">blue cluster</font>). | ||
- and large flipper length and large bill length (<font color="#F0E442">yellow cluster</font>). | ||
- small flipper length and small bill length (<font color="#4c78a8">blue cluster</font>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These labels are not correct I don't think -- randomness changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Forgot to update them after changing this plot to be consistent with the clustering plot further down.
source/clustering.md
Outdated
scaler = StandardScaler() | ||
standardized_data = pd.DataFrame( | ||
scaler.fit_transform(not_standardized_data), columns = ['bill_length_mm', 'flipper_length_mm']) | ||
scaler.fit_transform(not_standardized_data), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the new chapters 5,6,7,8 and try to make the use of StandardScaler / preprocessing follow the same pattern as there
we use make_column_transformer there if i recall
predictions = penguin_clust.predict(standardized_data) | ||
predictions | ||
|
||
labels = penguin_clust.labels_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not print -- should make it print. otherwise it's not clear what it is
source/clustering.md
Outdated
we can find the best value for K | ||
by finding where the "elbow" occurs in the plot of total WSSD versus the number of clusters. | ||
The total WSSD is stored in the `.inertia_` attribute | ||
of the clustering object ("inertia" is another term for WSSD). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have never heard anyone use the name "inertia" -- so I might change this wording to say something like "this is what scikit-learn calls wssd"
source/clustering.md
Outdated
import numpy as np | ||
penguin_clust_ks = pd.DataFrame({"k": np.array(range(1, 10)).transpose()}) | ||
numbers = range(1, 10) | ||
for number in numbers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't teach for loops in this class (in fact, we can't! we can talk offline about it if you want)
but to your point in the conversation thread: unfortunately we have to remove for loops from both chapters.
try to find a reasonable way to accomplish the same thing w/o loops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to list comprehension as in ch 10. I think it is more succinct and easier to understand that for loops here even!
source/clustering.md
Outdated
ks = range(1, 10) | ||
for k in ks: | ||
# Save the computed inertia for each k | ||
inertias.append(KMeans(n_clusters=k).fit(standardized_data).inertia_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my other comment, but yeah unfortunately we cannot cover for loops
@@ -785,61 +726,27 @@ A plot showing the total WSSD versus the number of clusters. | |||
```{index} K-means; init argument | |||
``` | |||
|
|||
It looks like 3 clusters is the right choice for this data. | |||
But why is there a "bump" in the total WSSD plot here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep this example... I'm not sure why it was removed here
You can feel free to seed-hack (in a hidden cell) as much as you like to recreate the example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my reply to your next comment
source/clustering.md
Outdated
However, | ||
it is possible to have an elbow plot | ||
where the WSSD increases at one of the steps, | ||
causing a small bump in the line. | ||
This is because K-means can get "stuck" in a bad solution | ||
as we mentioned earlier in the chapter. | ||
|
||
> **Note:** It is rare that the KMeans function from scikit-learn | ||
> gets stuck in a bad solution, | ||
> because the selection of the centroid starting points | ||
> is optimized to prevent this from happening. | ||
> If you still find yourself in a situation where you have a bump in the elbow plot, | ||
> you can increase the `n_init` parameter above the default value of 10 | ||
> to try more different starting points for the centroids. | ||
> The larger the value the better from an analysis perspective, | ||
> but there is a trade-off that doing many clusterings could take a long time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the default in sklearn is that everything works as expected
I don't know what that means. Kmeans in sklearn is initialized using Kmeans++, which is not bullet-proof -- you can still end up in local optima. It's just less likely. But it's still important to explain to students (in simpler terms) that Kmeans is a local optimizer, it doesn't find the global optimum and can output different answers each time.
There's no need to replicate the R book exactly (in terms of code), though
Honestly it might be neat to cover Kmeans++ in an optional section (but that's for the future -- no need to do it now. Just open an issue to remind us about the idea later. )
I think it's OK to leave as-is for now, but at some point we should re-make them in altair. Open an issue for it please :)
No interactive figs, this is going to go in a hardcopy book at some point most likely so we want to keep things static as much as possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trevorcampbell Thanks for the comments, ready for a second look!
source/clustering.md
Outdated
- small flipper length and small bill length (<font color="#D55E00">orange cluster</font>), | ||
- small flipper length and large bill length (<font color="#0072B2">blue cluster</font>). | ||
- and large flipper length and large bill length (<font color="#F0E442">yellow cluster</font>). | ||
- small flipper length and small bill length (<font color="#4c78a8">blue cluster</font>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Forgot to update them after changing this plot to be consistent with the clustering plot further down.
source/clustering.md
Outdated
import numpy as np | ||
penguin_clust_ks = pd.DataFrame({"k": np.array(range(1, 10)).transpose()}) | ||
numbers = range(1, 10) | ||
for number in numbers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to list comprehension as in ch 10. I think it is more succinct and easier to understand that for loops here even!
@@ -785,61 +726,27 @@ A plot showing the total WSSD versus the number of clusters. | |||
```{index} K-means; init argument | |||
``` | |||
|
|||
It looks like 3 clusters is the right choice for this data. | |||
But why is there a "bump" in the total WSSD plot here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my reply to your next comment
source/clustering.md
Outdated
However, | ||
it is possible to have an elbow plot | ||
where the WSSD increases at one of the steps, | ||
causing a small bump in the line. | ||
This is because K-means can get "stuck" in a bad solution | ||
as we mentioned earlier in the chapter. | ||
|
||
> **Note:** It is rare that the KMeans function from scikit-learn | ||
> gets stuck in a bad solution, | ||
> because the selection of the centroid starting points | ||
> is optimized to prevent this from happening. | ||
> If you still find yourself in a situation where you have a bump in the elbow plot, | ||
> you can increase the `n_init` parameter above the default value of 10 | ||
> to try more different starting points for the centroids. | ||
> The larger the value the better from an analysis perspective, | ||
> but there is a trade-off that doing many clusterings could take a long time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point I am trying to make is that with the default sklearn Kmeans param of using Kmeans++ for initialization and 10 runs by default, it is very unlikely to get stuck in a local minimum. In fact, I tried seed-hacking with the 10,000 first seed and not a single one resulted in a bump on the chart.
I agree with you that it is important to mention that K-means is a local optimizer, but the example used was contrived and not a realistic scenario of parameter combinations that the students are ever likely to use. That is why I removed the example and still explained the signs of sub-optimal local optimization in the added text.
Of note, scikit-learn is changing the default number of runs in 1.4 to be 1 for kmeans++ (I ran with n_iter=1
through the 10,000 first seeds here as well and still there was only 3.5% of cases resulting in a bump on the chart, but possibly when we updrade to 1.4 we could seedhack our way to a bump):
Code for reference
sum([
(pd.Series([
KMeans(n_clusters=k, random_state=seed, n_init=1).fit(standardized_data).inertia_
for k in range(1, 10)
]).diff() > 0).any()
for seed in range(10_000)
])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not 100% happy with removing the bump example -- I don't buy your arguments regarding kmeans++. If it can happen 3.5% of the time when initializing once, that's absolutely frequent enough that I'd want to cover it. Even just keeping this in sync with the R version is enough of a reason for me.
But it's not worth delaying the PR for it. I may go in and edit it later on a second global pass through the book.
Not too much in this chapter. Are we changing ggplot plots to altair or leaving as is? We could make 9.10 an interactive figure instead of faceted to make it a bit less busy if we want, but I think it is good as is too.