Chapter 10 production polish #142

joelostblom · 2023-01-24T21:55:48Z

Phew, this chapter was definitely more work! The details of each change are in the commit messages. The big ones are changing to use list comprehensions instead of for loops, removing unnecessary resampling to speed things up, being consistent in using sample instead of switching to resample, using value_counts instead of manual calculations, and big simplifications to the visualization syntax. I also simplified quite a bit and removed a fair amount of verbose code.

…ounts instead of manual computation

… and list comprehensions

joelostblom · 2023-01-24T22:01:02Z

source/inference.md

@@ -41,6 +39,7 @@ populations and then introduce two common techniques in statistical inference:
 *point estimation* and *interval estimation*.

 ## Chapter learning objectives
+
 By the end of the chapter, readers will be able to do the following:

 * Describe real-world examples of questions that can be answered with statistical inference.


We don't say anything about sd in the chapter but it is mentioned in the line below

joelostblom · 2023-01-24T22:03:16Z

source/inference.md

-    samples.append(sample)
-samples = pd.concat([samples[i] for i in range(len(samples))])
-
+samples = pd.concat([airbnb.sample(40).assign(replicate=n) for n in range(20_000)])


This is what I suggest we use instead of the loop. I considered making the replicate numbers start at 1 as in the R book, but decided it was unnecessary visual noise just to make it more like R.

I like it. I also think it's unnecessary to have "1 - 20,000" like in R. 0 - 19,999 is fine.

So the original is:

samples = [] for rep in range(20000): sample = airbnb.sample(40) sample = sample.assign(replicate=rep) samples.append(sample) samples = pd.concat([samples[i] for i in range(len(samples))])

and the replacement is very good:

samples = pd.concat([airbnb.sample(40).assign(replicate=n) for n in range(20_000)])

but could we maybe split onto two lines as in:

rows = [airbnb.sample(40).assign(replicate=n) for n in range(20_000)] samples = pd.concat(rows)

While I was thinking about this, I wondered if for loops are not allowed,
then maybe we can simplify the accidental complexity by using a function:

def sample(data, n): return data.sample(n) samples_list = [sample(data, n).assign(replicate=rep) for rep in range(20000)] samples = pd.concat(samples_list)

and while we're at it, maybe make the function do the whole infer-like chunk-of-dataframe generation:

def getsample(data, n, rep): sample = data.sample(n) samplerows = sample.assign(replicate=rep) return samplerows samples = pd.concat(getsample(data, n, rep) for rep in range(20000))

To be clear, I think the current re-write is fine, just throwing ideas out there for possible simplifications.

Oh yeah and Hello, I'm Ivan. I recently learned about he UBC data science courses and I think y'all are doing a great job. I'm particularly interested in the Python parts (bcs I'm only beginner in R). I was following this repo and thought I'd jump in to offer some suggestions.

I'm dealing with similar issues around using Python to teach stats using the modern curriculum (simulation, permutations tests, and bootstrap estimation), but also not trying to teach all of Python (because students don't identify as "coders" so we don't want to alienate them from learning stats).

I'm starting to think learning Python is inevitable though.... It seems to me that it will take longer to teach learner X statistics directly, than to offer the same learner a Python tutorial, after which they can learn modern statistics based on computational methods based on Python+pandas+scipy.stats+viz(alt,sns,plotnine).

Hello @ivanistheone, thank your for taking the time to chime in here. We don't teach functions in this course we can't use those suggestions unfortunately. Regarding breaking up into two steps, I think that is a good idea when we first introduce list comprehensions in the previous chapter, which I am going to review now and keep your suggestion in mind if there is a similar situation occurring there (I also added a sentence to clarify here). I wouldn't personally break it down in two steps on such as big comprehension to avoid storing the data twice, but I think it is good in a toy example for educational purposes.

Thanks again for your comments and kind words about the UBC data science courses!

+1 -- thanks for your input @ivanistheone !

joelostblom · 2023-01-24T22:04:42Z

source/inference.md

-    samples.query("room_type == 'Entire home/apt'")
-    .groupby("replicate")["room_type"]
-    .count()
-    .reset_index()
-    .rename(columns={"room_type": "counts"})
+    samples
+    .groupby('replicate')
+    ['room_type']
+    .value_counts(normalize=True)
+    .reset_index(name='sample_proportion')
+    .query('room_type=="Entire home/apt"')
 )
-
-# calculate the proportion
-sample_estimates["sample_proportion"] = sample_estimates["counts"] / 40
-
-# drop the count column
-sample_estimates = sample_estimates.drop(columns=["counts"])
-


While I find that this section is cleaner than it used to be, I wish it could be even simpler. Especially the index collision and need to use name= is unfortunate since I don't think we mention that in the wrangling chapter, so we might want to add a sentence if we stick with it here.

yep see my earlier comment

I couldn't find a good way to do this that didn't include something new such as value counts on a data frame instead of series, or using as_index=False with groupby (both also would include an extra rename step). The current behavior is listed asa bug with a pending PR in pandas, so in 2.0 it will work without the name= in the way we have written it now.

joelostblom · 2023-01-24T22:05:59Z

source/inference.md

+        y=alt.value(-10),
+        text=alt.datum(f'97.5th percentile ({ci_bounds[0.975].round(1)})')
+    ),
+    width=500
 )
 ```



The additional resources at the end of the chapter are all about R books..

trevorcampbell

Very nice! minor changes requested

Also, one more thing that I couldn't point out in my review: at the end of the chapter in the Exercises section, make sure to fix the "worksheets repository" link to point to the python stuff. You can get the URL from any of the earlier chapters I edited

trevorcampbell · 2023-01-24T23:49:12Z

source/inference.md

-    samples.append(sample)
-samples = pd.concat([samples[i] for i in range(len(samples))])
-
+samples = pd.concat([airbnb.sample(40).assign(replicate=n) for n in range(20_000)])


I like it. I also think it's unnecessary to have "1 - 20,000" like in R. 0 - 19,999 is fine.

trevorcampbell · 2023-01-24T23:49:56Z

source/inference.md

+    ['room_type']
+    .value_counts(normalize=True)
+    .reset_index(name='sample_proportion')
+    .query('room_type=="Entire home/apt"')


I don't think we use query much anywhere in the book, and we only really briefly mention it in wrangling. I would prefer one of our more common ways of filtering here if possible

trevorcampbell · 2023-01-24T23:50:41Z

source/inference.md

+    .groupby('replicate')
+    ['room_type']
+    .value_counts(normalize=True)
+    .reset_index(name='sample_proportion')


This use of reset_index needs to be explained. It would be nice if it were not used at all, but for a first version I'm OK with it as long as it's described well

trevorcampbell · 2023-01-24T23:50:57Z

source/inference.md

-    samples.query("room_type == 'Entire home/apt'")
-    .groupby("replicate")["room_type"]
-    .count()
-    .reset_index()
-    .rename(columns={"room_type": "counts"})
+    samples
+    .groupby('replicate')
+    ['room_type']
+    .value_counts(normalize=True)
+    .reset_index(name='sample_proportion')
+    .query('room_type=="Entire home/apt"')
 )
-
-# calculate the proportion
-sample_estimates["sample_proportion"] = sample_estimates["counts"] / 40
-
-# drop the count column
-sample_estimates = sample_estimates.drop(columns=["counts"])
-


yep see my earlier comment

trevorcampbell · 2023-01-24T23:53:21Z

source/inference.md

-glue("sample_propotion_center", round(sample_estimates["sample_proportion"].mean(), 1))
-glue("sample_propotion_min", round(sample_estimates["sample_proportion"].min(), 1))
-glue("sample_propotion_max", round(sample_estimates["sample_proportion"].max(), 1))
+glue("sample_proportion_center", round(sample_estimates["sample_proportion"].mean(), 1))


I think the rounding here is too aggressive -- it says the plot is "centered around 0.7" but to me it looks more like 0.78 or so.

Also I would say not to use the min/max here for the upper/lower bound -- I would say like 5th / 95th or 10th / 90th percentile would be better

trevorcampbell · 2023-01-25T00:03:14Z

source/inference.md

@@ -1276,7 +1024,7 @@ our point estimate to behave if we took another sample.

 ```{code-cell} ipython3
 boot20000_means = boot20000.groupby("replicate")["price"].mean().reset_index().rename(


probably worth explaining this line more carefully (and formatting it better)

Agree, I reformatted this and added an explanation in the previous code cell which is where this occurs for the first time.

source/inference.md

…tion

joelostblom

@trevorcampbell Thanks for the comments, ready for another look!

joelostblom · 2023-01-25T12:53:04Z

source/inference.md

@@ -1276,7 +1024,7 @@ our point estimate to behave if we took another sample.

 ```{code-cell} ipython3
 boot20000_means = boot20000.groupby("replicate")["price"].mean().reset_index().rename(


Agree, I reformatted this and added an explanation in the previous code cell which is where this occurs for the first time.

joelostblom · 2023-01-26T16:44:08Z

source/inference.md

-    samples.append(sample)
-samples = pd.concat([samples[i] for i in range(len(samples))])
-
+samples = pd.concat([airbnb.sample(40).assign(replicate=n) for n in range(20_000)])


Hello @ivanistheone, thank your for taking the time to chime in here. We don't teach functions in this course we can't use those suggestions unfortunately. Regarding breaking up into two steps, I think that is a good idea when we first introduce list comprehensions in the previous chapter, which I am going to review now and keep your suggestion in mind if there is a similar situation occurring there (I also added a sentence to clarify here). I wouldn't personally break it down in two steps on such as big comprehension to avoid storing the data twice, but I think it is good in a toy example for educational purposes.

Thanks again for your comments and kind words about the UBC data science courses!

trevorcampbell

One remaining bugfix needed

trevorcampbell · 2023-01-27T19:40:54Z

source/inference.md

+    .groupby("replicate")
+    ["price"]
+    .mean()
+    .rename(columns={"price": "sample_mean"})


@joelostblom I'm seeing a stack trace here and another nearby (which then causes more downstream errors)

Please fix + verify and let me know!

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[39], line 2 1 ( ----> 2 six_bootstrap_samples 3 .groupby("replicate") 4 ["price"] 5 .mean() 6 .rename(columns={"price": "sample_mean"}) 7 .reset_index() 8 ) TypeError: Series.rename() got an unexpected keyword argument 'columns'

(aside from that the PR looks good!)

Woops, I swapped the two last lines and forgot to swap them back, fixed.

…ed to

joelostblom added 23 commits January 24, 2023 17:39

Remove top level imports and add warning supression

9c2e050

Remove commented out sections

86b25ae

Remove unnecessary assignment, set the seed properly, and use value_c…

9760f17

…ounts instead of manual computation

Replace for loop with list comprehension

18364de

Use value_counts instead of manual computation

1e26ae5

Fix typo

dfbf592

Correct viz syntax

c850890

Remove unnecessary assignment, and simplify computation

6fddb54

Remove unnecessary second loop

75576d9

Simplify plotting syntax and clarify variable name

d93648f

Simplify comparison plot of sample sizes

ad623dc

Use bullets to make points clearer

0e39c33

Remove display option

0c0a956

Remove incorrect query for means

9315dfe

Reduce height to make histograms easier to read

60a104d

Change from sklearns resample and for loops to keep using .sample…

4fd81f1

… and list comprehensions

Simplify visualization code

a287e8f

Fix minor formatting and wording issue

be19704

Clarify language around sampling

c718eac

Switch to orange annotations for better contrast

fb3184b

Improve plotting syntax and remove re-specification of seed

3393808

Change to actual number of samples

1cb4cf5

Round to one decimal to reduce noise

5b4bbac

joelostblom requested review from trevorcampbell and lheagy January 24, 2023 21:55

joelostblom commented Jan 24, 2023

View reviewed changes

trevorcampbell requested changes Jan 25, 2023

View reviewed changes

joelostblom added 3 commits January 25, 2023 13:53

Round histogram limits more nicely

bb548d1

Reformat and improve the explanation of the bootstrapped mean calcula…

f8de446

…tion

Add description about quantiles

5cbc443

joelostblom added 3 commits January 25, 2023 13:58

Link to python version of worksheets

be9f872

Explain the name param in reset index

c216738

Fix typo

bbcfeb3

joelostblom force-pushed the ch10 branch from d01abc8 to bbcfeb3 Compare January 26, 2023 16:37

Clarify concatenation

8c3b8b6

joelostblom commented Jan 26, 2023

View reviewed changes

joelostblom requested a review from trevorcampbell January 26, 2023 16:44

trevorcampbell requested changes Jan 27, 2023

View reviewed changes

Move rename last so we can use the same rename syntax students are us…

325881a

…ed to

joelostblom requested a review from trevorcampbell January 27, 2023 20:36

trevorcampbell approved these changes Jan 27, 2023

View reviewed changes

trevorcampbell merged commit 8289d51 into main Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter 10 production polish #142

Chapter 10 production polish #142

joelostblom commented Jan 24, 2023 •

edited

Loading

joelostblom Jan 24, 2023

joelostblom Jan 24, 2023

trevorcampbell Jan 24, 2023

ivanistheone Jan 25, 2023 •

edited

Loading

ivanistheone Jan 25, 2023

joelostblom Jan 26, 2023

trevorcampbell Jan 27, 2023

joelostblom Jan 24, 2023

trevorcampbell Jan 24, 2023

joelostblom Jan 26, 2023

joelostblom Jan 24, 2023

trevorcampbell left a comment •

edited

Loading

trevorcampbell Jan 24, 2023

trevorcampbell Jan 24, 2023

trevorcampbell Jan 24, 2023

trevorcampbell Jan 24, 2023

trevorcampbell Jan 24, 2023

trevorcampbell Jan 25, 2023

joelostblom Jan 25, 2023

joelostblom left a comment

joelostblom Jan 25, 2023

joelostblom Jan 26, 2023

trevorcampbell left a comment

trevorcampbell Jan 27, 2023

trevorcampbell Jan 27, 2023

joelostblom Jan 27, 2023

		@@ -1276,7 +1024,7 @@ our point estimate to behave if we took another sample.

		```{code-cell} ipython3
		boot20000_means = boot20000.groupby("replicate")["price"].mean().reset_index().rename(

Chapter 10 production polish #142

Chapter 10 production polish #142

Conversation

joelostblom commented Jan 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivanistheone Jan 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trevorcampbell left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joelostblom left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trevorcampbell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joelostblom commented Jan 24, 2023 •

edited

Loading

ivanistheone Jan 25, 2023 •

edited

Loading

trevorcampbell left a comment •

edited

Loading