Skip to content

Chapter 9 production polish #141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jan 27, 2023
Merged

Chapter 9 production polish #141

merged 18 commits into from
Jan 27, 2023

Conversation

joelostblom
Copy link
Contributor

Not too much in this chapter. Are we changing ggplot plots to altair or leaving as is? We could make 9.10 an interactive figure instead of faceted to make it a bit less busy if we want, but I think it is good as is too.

Comment on lines 691 to 694
ks = range(1, 10)
for k in ks:
# Save the computed inertia for each k
inertias.append(KMeans(n_clusters=k).fit(standardized_data).inertia_)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more straight-forward than the current approach with apply. Since we already are teaching for loops in chapter 10 I thought it made sense to have one here too. We can of course discuss removing them at both places if we don't think we need them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my other comment, but yeah unfortunately we cannot cover for loops

Comment on lines 734 to 749
However,
it is possible to have an elbow plot
where the WSSD increases at one of the steps,
causing a small bump in the line.
This is because K-means can get "stuck" in a bad solution
as we mentioned earlier in the chapter.

> **Note:** It is rare that the KMeans function from scikit-learn
> gets stuck in a bad solution,
> because the selection of the centroid starting points
> is optimized to prevent this from happening.
> If you still find yourself in a situation where you have a bump in the elbow plot,
> you can increase the `n_init` parameter above the default value of 10
> to try more different starting points for the centroids.
> The larger the value the better from an analysis perspective,
> but there is a trade-off that doing many clusterings could take a long time.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the default in sklearn is that everything works as expected, I thought it was better to move this to a note instead of starting with an odd formulation of the params that we wouldn't actually use often.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the default in sklearn is that everything works as expected

I don't know what that means. Kmeans in sklearn is initialized using Kmeans++, which is not bullet-proof -- you can still end up in local optima. It's just less likely. But it's still important to explain to students (in simpler terms) that Kmeans is a local optimizer, it doesn't find the global optimum and can output different answers each time.

There's no need to replicate the R book exactly (in terms of code), though

Honestly it might be neat to cover Kmeans++ in an optional section (but that's for the future -- no need to do it now. Just open an issue to remind us about the idea later. )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point I am trying to make is that with the default sklearn Kmeans param of using Kmeans++ for initialization and 10 runs by default, it is very unlikely to get stuck in a local minimum. In fact, I tried seed-hacking with the 10,000 first seed and not a single one resulted in a bump on the chart.

I agree with you that it is important to mention that K-means is a local optimizer, but the example used was contrived and not a realistic scenario of parameter combinations that the students are ever likely to use. That is why I removed the example and still explained the signs of sub-optimal local optimization in the added text.

Of note, scikit-learn is changing the default number of runs in 1.4 to be 1 for kmeans++ (I ran with n_iter=1 through the 10,000 first seeds here as well and still there was only 3.5% of cases resulting in a bump on the chart, but possibly when we updrade to 1.4 we could seedhack our way to a bump):

image

Code for reference

sum([
    (pd.Series([
        KMeans(n_clusters=k, random_state=seed, n_init=1).fit(standardized_data).inertia_
        for k in range(1, 10)
    ]).diff() > 0).any()
    for seed in range(10_000)
])

Copy link
Contributor

@trevorcampbell trevorcampbell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! Some changes requested

- small flipper length and small bill length (<font color="#D55E00">orange cluster</font>),
- small flipper length and large bill length (<font color="#0072B2">blue cluster</font>).
- and large flipper length and large bill length (<font color="#F0E442">yellow cluster</font>).
- small flipper length and small bill length (<font color="#4c78a8">blue cluster</font>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These labels are not correct I don't think -- randomness changed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Forgot to update them after changing this plot to be consistent with the clustering plot further down.

scaler = StandardScaler()
standardized_data = pd.DataFrame(
scaler.fit_transform(not_standardized_data), columns = ['bill_length_mm', 'flipper_length_mm'])
scaler.fit_transform(not_standardized_data),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the new chapters 5,6,7,8 and try to make the use of StandardScaler / preprocessing follow the same pattern as there

we use make_column_transformer there if i recall

predictions = penguin_clust.predict(standardized_data)
predictions

labels = penguin_clust.labels_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not print -- should make it print. otherwise it's not clear what it is

we can find the best value for K
by finding where the "elbow" occurs in the plot of total WSSD versus the number of clusters.
The total WSSD is stored in the `.inertia_` attribute
of the clustering object ("inertia" is another term for WSSD).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never heard anyone use the name "inertia" -- so I might change this wording to say something like "this is what scikit-learn calls wssd"

import numpy as np
penguin_clust_ks = pd.DataFrame({"k": np.array(range(1, 10)).transpose()})
numbers = range(1, 10)
for number in numbers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't teach for loops in this class (in fact, we can't! we can talk offline about it if you want)

but to your point in the conversation thread: unfortunately we have to remove for loops from both chapters.

try to find a reasonable way to accomplish the same thing w/o loops

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to list comprehension as in ch 10. I think it is more succinct and easier to understand that for loops here even!

Comment on lines 691 to 694
ks = range(1, 10)
for k in ks:
# Save the computed inertia for each k
inertias.append(KMeans(n_clusters=k).fit(standardized_data).inertia_)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my other comment, but yeah unfortunately we cannot cover for loops

@@ -785,61 +726,27 @@ A plot showing the total WSSD versus the number of clusters.
```{index} K-means; init argument
```

It looks like 3 clusters is the right choice for this data.
But why is there a "bump" in the total WSSD plot here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep this example... I'm not sure why it was removed here

You can feel free to seed-hack (in a hidden cell) as much as you like to recreate the example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my reply to your next comment

Comment on lines 734 to 749
However,
it is possible to have an elbow plot
where the WSSD increases at one of the steps,
causing a small bump in the line.
This is because K-means can get "stuck" in a bad solution
as we mentioned earlier in the chapter.

> **Note:** It is rare that the KMeans function from scikit-learn
> gets stuck in a bad solution,
> because the selection of the centroid starting points
> is optimized to prevent this from happening.
> If you still find yourself in a situation where you have a bump in the elbow plot,
> you can increase the `n_init` parameter above the default value of 10
> to try more different starting points for the centroids.
> The larger the value the better from an analysis perspective,
> but there is a trade-off that doing many clusterings could take a long time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the default in sklearn is that everything works as expected

I don't know what that means. Kmeans in sklearn is initialized using Kmeans++, which is not bullet-proof -- you can still end up in local optima. It's just less likely. But it's still important to explain to students (in simpler terms) that Kmeans is a local optimizer, it doesn't find the global optimum and can output different answers each time.

There's no need to replicate the R book exactly (in terms of code), though

Honestly it might be neat to cover Kmeans++ in an optional section (but that's for the future -- no need to do it now. Just open an issue to remind us about the idea later. )

@trevorcampbell
Copy link
Contributor

Are we changing ggplot plots to altair or leaving as is?

I think it's OK to leave as-is for now, but at some point we should re-make them in altair. Open an issue for it please :)

We could make 9.10 an interactive figure instead of faceted to make it a bit less busy if we want, but I think it is good as is too.

No interactive figs, this is going to go in a hardcopy book at some point most likely so we want to keep things static as much as possible

Copy link
Contributor Author

@joelostblom joelostblom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trevorcampbell Thanks for the comments, ready for a second look!

- small flipper length and small bill length (<font color="#D55E00">orange cluster</font>),
- small flipper length and large bill length (<font color="#0072B2">blue cluster</font>).
- and large flipper length and large bill length (<font color="#F0E442">yellow cluster</font>).
- small flipper length and small bill length (<font color="#4c78a8">blue cluster</font>),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Forgot to update them after changing this plot to be consistent with the clustering plot further down.

import numpy as np
penguin_clust_ks = pd.DataFrame({"k": np.array(range(1, 10)).transpose()})
numbers = range(1, 10)
for number in numbers:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to list comprehension as in ch 10. I think it is more succinct and easier to understand that for loops here even!

@@ -785,61 +726,27 @@ A plot showing the total WSSD versus the number of clusters.
```{index} K-means; init argument
```

It looks like 3 clusters is the right choice for this data.
But why is there a "bump" in the total WSSD plot here?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my reply to your next comment

Comment on lines 734 to 749
However,
it is possible to have an elbow plot
where the WSSD increases at one of the steps,
causing a small bump in the line.
This is because K-means can get "stuck" in a bad solution
as we mentioned earlier in the chapter.

> **Note:** It is rare that the KMeans function from scikit-learn
> gets stuck in a bad solution,
> because the selection of the centroid starting points
> is optimized to prevent this from happening.
> If you still find yourself in a situation where you have a bump in the elbow plot,
> you can increase the `n_init` parameter above the default value of 10
> to try more different starting points for the centroids.
> The larger the value the better from an analysis perspective,
> but there is a trade-off that doing many clusterings could take a long time.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point I am trying to make is that with the default sklearn Kmeans param of using Kmeans++ for initialization and 10 runs by default, it is very unlikely to get stuck in a local minimum. In fact, I tried seed-hacking with the 10,000 first seed and not a single one resulted in a bump on the chart.

I agree with you that it is important to mention that K-means is a local optimizer, but the example used was contrived and not a realistic scenario of parameter combinations that the students are ever likely to use. That is why I removed the example and still explained the signs of sub-optimal local optimization in the added text.

Of note, scikit-learn is changing the default number of runs in 1.4 to be 1 for kmeans++ (I ran with n_iter=1 through the 10,000 first seeds here as well and still there was only 3.5% of cases resulting in a bump on the chart, but possibly when we updrade to 1.4 we could seedhack our way to a bump):

image

Code for reference

sum([
    (pd.Series([
        KMeans(n_clusters=k, random_state=seed, n_init=1).fit(standardized_data).inertia_
        for k in range(1, 10)
    ]).diff() > 0).any()
    for seed in range(10_000)
])

@trevorcampbell trevorcampbell self-requested a review January 27, 2023 19:32
Copy link
Contributor

@trevorcampbell trevorcampbell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not 100% happy with removing the bump example -- I don't buy your arguments regarding kmeans++. If it can happen 3.5% of the time when initializing once, that's absolutely frequent enough that I'd want to cover it. Even just keeping this in sync with the R version is enough of a reason for me.

But it's not worth delaying the PR for it. I may go in and edit it later on a second global pass through the book.

@trevorcampbell trevorcampbell merged commit 9255d54 into main Jan 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants