Chapter 9 production polish #141

joelostblom · 2023-01-23T20:10:20Z

Not too much in this chapter. Are we changing ggplot plots to altair or leaving as is? We could make 9.10 an interactive figure instead of faceted to make it a bit less busy if we want, but I think it is good as is too.

…ys the same number

…ot later

…y default in sklearn

joelostblom · 2023-01-23T20:11:39Z

source/clustering.md

+ks = range(1, 10)
+for k in ks:
+    # Save the computed inertia for each k
+    inertias.append(KMeans(n_clusters=k).fit(standardized_data).inertia_)


I think this is more straight-forward than the current approach with apply. Since we already are teaching for loops in chapter 10 I thought it made sense to have one here too. We can of course discuss removing them at both places if we don't think we need them.

see my other comment, but yeah unfortunately we cannot cover for loops

joelostblom · 2023-01-23T20:13:43Z

source/clustering.md

+However,
+it is possible to have an elbow plot
+where the WSSD increases at one of the steps,
+causing a small bump in the line.
+This is because K-means can get "stuck" in a bad solution
+as we mentioned earlier in the chapter.
+
+> **Note:** It is rare that the KMeans function from scikit-learn
+> gets stuck in a bad solution,
+> because the selection of the centroid starting points
+> is optimized to prevent this from happening.
+> If you still find yourself in a situation where you have a bump in the elbow plot,
+> you can increase the `n_init` parameter above the default value of 10
+> to try more different starting points for the centroids.
+> The larger the value the better from an analysis perspective,
+> but there is a trade-off that doing many clusterings could take a long time.


Since the default in sklearn is that everything works as expected, I thought it was better to move this to a note instead of starting with an odd formulation of the params that we wouldn't actually use often.

Since the default in sklearn is that everything works as expected

I don't know what that means. Kmeans in sklearn is initialized using Kmeans++, which is not bullet-proof -- you can still end up in local optima. It's just less likely. But it's still important to explain to students (in simpler terms) that Kmeans is a local optimizer, it doesn't find the global optimum and can output different answers each time.

There's no need to replicate the R book exactly (in terms of code), though

Honestly it might be neat to cover Kmeans++ in an optional section (but that's for the future -- no need to do it now. Just open an issue to remind us about the idea later. )

The point I am trying to make is that with the default sklearn Kmeans param of using Kmeans++ for initialization and 10 runs by default, it is very unlikely to get stuck in a local minimum. In fact, I tried seed-hacking with the 10,000 first seed and not a single one resulted in a bump on the chart.

I agree with you that it is important to mention that K-means is a local optimizer, but the example used was contrived and not a realistic scenario of parameter combinations that the students are ever likely to use. That is why I removed the example and still explained the signs of sub-optimal local optimization in the added text.

Of note, scikit-learn is changing the default number of runs in 1.4 to be 1 for kmeans++ (I ran with n_iter=1 through the 10,000 first seeds here as well and still there was only 3.5% of cases resulting in a bump on the chart, but possibly when we updrade to 1.4 we could seedhack our way to a bump):

Code for reference

sum([ (pd.Series([ KMeans(n_clusters=k, random_state=seed, n_init=1).fit(standardized_data).inertia_ for k in range(1, 10) ]).diff() > 0).any() for seed in range(10_000) ])

trevorcampbell

Looks pretty good! Some changes requested

trevorcampbell · 2023-01-24T03:47:30Z

source/clustering.md

- small flipper length and small bill length (<font color="#D55E00">orange cluster</font>),
- small flipper length and large bill length (<font color="#0072B2">blue cluster</font>).
- and large flipper length and large bill  length (<font color="#F0E442">yellow cluster</font>).
+- small flipper length and small bill length (<font color="#4c78a8">blue cluster</font>),


These labels are not correct I don't think -- randomness changed

Good catch! Forgot to update them after changing this plot to be consistent with the clustering plot further down.

trevorcampbell · 2023-01-24T03:50:47Z

source/clustering.md

 scaler = StandardScaler()
 standardized_data = pd.DataFrame(
-    scaler.fit_transform(not_standardized_data), columns = ['bill_length_mm', 'flipper_length_mm'])
+    scaler.fit_transform(not_standardized_data),


See the new chapters 5,6,7,8 and try to make the use of StandardScaler / preprocessing follow the same pattern as there

we use make_column_transformer there if i recall

trevorcampbell · 2023-01-24T03:52:35Z

source/clustering.md

-predictions = penguin_clust.predict(standardized_data)
-predictions
-
+labels = penguin_clust.labels_


this does not print -- should make it print. otherwise it's not clear what it is

trevorcampbell · 2023-01-24T03:53:46Z

source/clustering.md

+we can find the best value for K
+by finding where the "elbow" occurs in the plot of total WSSD versus the number of clusters.
+The total WSSD is stored in the `.inertia_` attribute
+of the clustering object ("inertia" is another term for WSSD).


I have never heard anyone use the name "inertia" -- so I might change this wording to say something like "this is what scikit-learn calls wssd"

trevorcampbell · 2023-01-24T03:55:22Z

source/clustering.md

-import numpy as np
-penguin_clust_ks = pd.DataFrame({"k": np.array(range(1, 10)).transpose()})
+numbers = range(1, 10)
+for number in numbers:


we don't teach for loops in this class (in fact, we can't! we can talk offline about it if you want)

but to your point in the conversation thread: unfortunately we have to remove for loops from both chapters.

try to find a reasonable way to accomplish the same thing w/o loops

Changed to list comprehension as in ch 10. I think it is more succinct and easier to understand that for loops here even!

trevorcampbell · 2023-01-24T03:56:33Z

source/clustering.md

+ks = range(1, 10)
+for k in ks:
+    # Save the computed inertia for each k
+    inertias.append(KMeans(n_clusters=k).fit(standardized_data).inertia_)


see my other comment, but yeah unfortunately we cannot cover for loops

trevorcampbell · 2023-01-24T03:58:36Z

source/clustering.md

@@ -785,61 +726,27 @@ A plot showing the total WSSD versus the number of clusters.
 ```{index} K-means; init argument
 ```

-It looks like 3 clusters is the right choice for this data.
-But why is there a "bump" in the total WSSD plot here?


We should keep this example... I'm not sure why it was removed here

You can feel free to seed-hack (in a hidden cell) as much as you like to recreate the example

See my reply to your next comment

trevorcampbell · 2023-01-24T04:02:06Z

source/clustering.md

+However,
+it is possible to have an elbow plot
+where the WSSD increases at one of the steps,
+causing a small bump in the line.
+This is because K-means can get "stuck" in a bad solution
+as we mentioned earlier in the chapter.
+
+> **Note:** It is rare that the KMeans function from scikit-learn
+> gets stuck in a bad solution,
+> because the selection of the centroid starting points
+> is optimized to prevent this from happening.
+> If you still find yourself in a situation where you have a bump in the elbow plot,
+> you can increase the `n_init` parameter above the default value of 10
+> to try more different starting points for the centroids.
+> The larger the value the better from an analysis perspective,
+> but there is a trade-off that doing many clusterings could take a long time.


Since the default in sklearn is that everything works as expected

I don't know what that means. Kmeans in sklearn is initialized using Kmeans++, which is not bullet-proof -- you can still end up in local optima. It's just less likely. But it's still important to explain to students (in simpler terms) that Kmeans is a local optimizer, it doesn't find the global optimum and can output different answers each time.

There's no need to replicate the R book exactly (in terms of code), though

Honestly it might be neat to cover Kmeans++ in an optional section (but that's for the future -- no need to do it now. Just open an issue to remind us about the idea later. )

trevorcampbell · 2023-01-24T05:37:20Z

Are we changing ggplot plots to altair or leaving as is?

I think it's OK to leave as-is for now, but at some point we should re-make them in altair. Open an issue for it please :)

We could make 9.10 an interactive figure instead of faceted to make it a bit less busy if we want, but I think it is good as is too.

No interactive figs, this is going to go in a hardcopy book at some point most likely so we want to keep things static as much as possible

joelostblom

@trevorcampbell Thanks for the comments, ready for a second look!

joelostblom · 2023-01-26T16:48:44Z

source/clustering.md

- small flipper length and small bill length (<font color="#D55E00">orange cluster</font>),
- small flipper length and large bill length (<font color="#0072B2">blue cluster</font>).
- and large flipper length and large bill  length (<font color="#F0E442">yellow cluster</font>).
+- small flipper length and small bill length (<font color="#4c78a8">blue cluster</font>),


Good catch! Forgot to update them after changing this plot to be consistent with the clustering plot further down.

joelostblom · 2023-01-26T17:53:03Z

source/clustering.md

-import numpy as np
-penguin_clust_ks = pd.DataFrame({"k": np.array(range(1, 10)).transpose()})
+numbers = range(1, 10)
+for number in numbers:


Changed to list comprehension as in ch 10. I think it is more succinct and easier to understand that for loops here even!

joelostblom · 2023-01-26T18:04:44Z

source/clustering.md

@@ -785,61 +726,27 @@ A plot showing the total WSSD versus the number of clusters.
 ```{index} K-means; init argument
 ```

-It looks like 3 clusters is the right choice for this data.
-But why is there a "bump" in the total WSSD plot here?


See my reply to your next comment

joelostblom · 2023-01-26T18:22:51Z

source/clustering.md

+However,
+it is possible to have an elbow plot
+where the WSSD increases at one of the steps,
+causing a small bump in the line.
+This is because K-means can get "stuck" in a bad solution
+as we mentioned earlier in the chapter.
+
+> **Note:** It is rare that the KMeans function from scikit-learn
+> gets stuck in a bad solution,
+> because the selection of the centroid starting points
+> is optimized to prevent this from happening.
+> If you still find yourself in a situation where you have a bump in the elbow plot,
+> you can increase the `n_init` parameter above the default value of 10
+> to try more different starting points for the centroids.
+> The larger the value the better from an analysis perspective,
+> but there is a trade-off that doing many clusterings could take a long time.


The point I am trying to make is that with the default sklearn Kmeans param of using Kmeans++ for initialization and 10 runs by default, it is very unlikely to get stuck in a local minimum. In fact, I tried seed-hacking with the 10,000 first seed and not a single one resulted in a bump on the chart.

I agree with you that it is important to mention that K-means is a local optimizer, but the example used was contrived and not a realistic scenario of parameter combinations that the students are ever likely to use. That is why I removed the example and still explained the signs of sub-optimal local optimization in the added text.

Of note, scikit-learn is changing the default number of runs in 1.4 to be 1 for kmeans++ (I ran with n_iter=1 through the 10,000 first seeds here as well and still there was only 3.5% of cases resulting in a bump on the chart, but possibly when we updrade to 1.4 we could seedhack our way to a bump):

Code for reference

sum([ (pd.Series([ KMeans(n_clusters=k, random_state=seed, n_init=1).fit(standardized_data).inertia_ for k in range(1, 10) ]).diff() > 0).any() for seed in range(10_000) ])

trevorcampbell

I'm still not 100% happy with removing the bump example -- I don't buy your arguments regarding kmeans++. If it can happen 3.5% of the time when initializing once, that's absolutely frequent enough that I'd want to cover it. Even just keeping this in sync with the R version is enough of a reason for me.

But it's not worth delaying the PR for it. I may go in and edit it later on a second global pass through the book.

joelostblom added 10 commits January 23, 2023 20:58

Make warnings more specific

eed73e6

Remove top imports to add them where they are used in the file instead

172bed3

Set random see to something different to show students it is not alwa…

1d4eb5a

…ys the same number

Make colored labels consistent in demo plot with the actual KMeans pl…

5577cad

…ot later

Remove unnecessary plot config

d2da17a

Match text labels with clusters

b5243ea

FIx minor formatting

cc5c1a3

Remove unnecessary printing of cluster info

86729c4

Improve explanation of cluster labels

1e5f41d

Change explanation of bump in elbow plot since this does not happen b…

10d38a5

…y default in sklearn

joelostblom requested review from trevorcampbell and lheagy January 23, 2023 20:10

joelostblom commented Jan 23, 2023

View reviewed changes

trevorcampbell requested changes Jan 24, 2023

View reviewed changes

joelostblom added 7 commits January 26, 2023 18:35

Fix typo in text cluster colors

f4bb78d

Add automatically changed cell tag quotation

a605434

Make standard scaler approach consistent with other chapters

c3cae57

Print labels

d0f78ef

Note that inertia is sklearn specific

328e940

Use list comprehension instead of for loop

bd40d47

Clarify unfortunate centroid initialization

4e0fe04

joelostblom commented Jan 26, 2023

View reviewed changes

minor polish

ba02cc0

trevorcampbell self-requested a review January 27, 2023 19:32

trevorcampbell approved these changes Jan 27, 2023

View reviewed changes

trevorcampbell merged commit 9255d54 into main Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter 9 production polish #141

Chapter 9 production polish #141

joelostblom commented Jan 23, 2023

joelostblom Jan 23, 2023

trevorcampbell Jan 24, 2023

joelostblom Jan 23, 2023

trevorcampbell Jan 24, 2023

joelostblom Jan 26, 2023

trevorcampbell left a comment

trevorcampbell Jan 24, 2023

joelostblom Jan 26, 2023

trevorcampbell Jan 24, 2023

trevorcampbell Jan 24, 2023

trevorcampbell Jan 24, 2023

trevorcampbell Jan 24, 2023

joelostblom Jan 26, 2023

trevorcampbell Jan 24, 2023

trevorcampbell Jan 24, 2023

joelostblom Jan 26, 2023

trevorcampbell Jan 24, 2023

trevorcampbell commented Jan 24, 2023

joelostblom left a comment

joelostblom Jan 26, 2023

joelostblom Jan 26, 2023

joelostblom Jan 26, 2023

joelostblom Jan 26, 2023

trevorcampbell left a comment •

edited

Loading

Chapter 9 production polish #141

Chapter 9 production polish #141

Conversation

joelostblom commented Jan 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trevorcampbell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trevorcampbell commented Jan 24, 2023

joelostblom left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trevorcampbell left a comment • edited Loading

Choose a reason for hiding this comment

trevorcampbell left a comment •

edited

Loading