Skip to content

Chapter 10 production polish #142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jan 27, 2023
Merged

Chapter 10 production polish #142

merged 31 commits into from
Jan 27, 2023

Conversation

joelostblom
Copy link
Contributor

@joelostblom joelostblom commented Jan 24, 2023

Phew, this chapter was definitely more work! The details of each change are in the commit messages. The big ones are changing to use list comprehensions instead of for loops, removing unnecessary resampling to speed things up, being consistent in using sample instead of switching to resample, using value_counts instead of manual calculations, and big simplifications to the visualization syntax. I also simplified quite a bit and removed a fair amount of verbose code.

@@ -41,6 +39,7 @@ populations and then introduce two common techniques in statistical inference:
*point estimation* and *interval estimation*.

## Chapter learning objectives

By the end of the chapter, readers will be able to do the following:

* Describe real-world examples of questions that can be answered with statistical inference.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't say anything about sd in the chapter but it is mentioned in the line below

samples.append(sample)
samples = pd.concat([samples[i] for i in range(len(samples))])

samples = pd.concat([airbnb.sample(40).assign(replicate=n) for n in range(20_000)])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I suggest we use instead of the loop. I considered making the replicate numbers start at 1 as in the R book, but decided it was unnecessary visual noise just to make it more like R.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. I also think it's unnecessary to have "1 - 20,000" like in R. 0 - 19,999 is fine.

Copy link

@ivanistheone ivanistheone Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the original is:

samples = []
for rep in range(20000):
    sample = airbnb.sample(40)
    sample = sample.assign(replicate=rep)
    samples.append(sample)
samples = pd.concat([samples[i] for i in range(len(samples))])

and the replacement is very good:

samples = pd.concat([airbnb.sample(40).assign(replicate=n) for n in range(20_000)])

but could we maybe split onto two lines as in:

rows = [airbnb.sample(40).assign(replicate=n) for n in range(20_000)]
samples = pd.concat(rows)

While I was thinking about this, I wondered if for loops are not allowed,
then maybe we can simplify the accidental complexity by using a function:

def sample(data, n):
    return data.sample(n)
samples_list = [sample(data, n).assign(replicate=rep) for rep in range(20000)]
samples = pd.concat(samples_list)

and while we're at it, maybe make the function do the whole infer-like chunk-of-dataframe generation:

def getsample(data, n, rep):
    sample = data.sample(n)
    samplerows = sample.assign(replicate=rep)
    return samplerows
samples = pd.concat(getsample(data, n, rep) for rep in range(20000))

To be clear, I think the current re-write is fine, just throwing ideas out there for possible simplifications.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah and Hello, I'm Ivan. I recently learned about he UBC data science courses and I think y'all are doing a great job. I'm particularly interested in the Python parts (bcs I'm only beginner in R). I was following this repo and thought I'd jump in to offer some suggestions.

I'm dealing with similar issues around using Python to teach stats using the modern curriculum (simulation, permutations tests, and bootstrap estimation), but also not trying to teach all of Python (because students don't identify as "coders" so we don't want to alienate them from learning stats).

I'm starting to think learning Python is inevitable though.... It seems to me that it will take longer to teach learner X statistics directly, than to offer the same learner a Python tutorial, after which they can learn modern statistics based on computational methods based on Python+pandas+scipy.stats+viz(alt,sns,plotnine).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @ivanistheone, thank your for taking the time to chime in here. We don't teach functions in this course we can't use those suggestions unfortunately. Regarding breaking up into two steps, I think that is a good idea when we first introduce list comprehensions in the previous chapter, which I am going to review now and keep your suggestion in mind if there is a similar situation occurring there (I also added a sentence to clarify here). I wouldn't personally break it down in two steps on such as big comprehension to avoid storing the data twice, but I think it is good in a toy example for educational purposes.

Thanks again for your comments and kind words about the UBC data science courses!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 -- thanks for your input @ivanistheone !

Comment on lines 350 to 362
samples.query("room_type == 'Entire home/apt'")
.groupby("replicate")["room_type"]
.count()
.reset_index()
.rename(columns={"room_type": "counts"})
samples
.groupby('replicate')
['room_type']
.value_counts(normalize=True)
.reset_index(name='sample_proportion')
.query('room_type=="Entire home/apt"')
)

# calculate the proportion
sample_estimates["sample_proportion"] = sample_estimates["counts"] / 40

# drop the count column
sample_estimates = sample_estimates.drop(columns=["counts"])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I find that this section is cleaner than it used to be, I wish it could be even simpler. Especially the index collision and need to use name= is unfortunate since I don't think we mention that in the wrangling chapter, so we might want to add a sentence if we stick with it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep see my earlier comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a good way to do this that didn't include something new such as value counts on a data frame instead of series, or using as_index=False with groupby (both also would include an extra rename step). The current behavior is listed asa bug with a pending PR in pandas, so in 2.0 it will work without the name= in the way we have written it now.

y=alt.value(-10),
text=alt.datum(f'97.5th percentile ({ci_bounds[0.975].round(1)})')
),
width=500
)
```

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional resources at the end of the chapter are all about R books..

Copy link
Contributor

@trevorcampbell trevorcampbell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! minor changes requested

Also, one more thing that I couldn't point out in my review: at the end of the chapter in the Exercises section, make sure to fix the "worksheets repository" link to point to the python stuff. You can get the URL from any of the earlier chapters I edited

samples.append(sample)
samples = pd.concat([samples[i] for i in range(len(samples))])

samples = pd.concat([airbnb.sample(40).assign(replicate=n) for n in range(20_000)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. I also think it's unnecessary to have "1 - 20,000" like in R. 0 - 19,999 is fine.

['room_type']
.value_counts(normalize=True)
.reset_index(name='sample_proportion')
.query('room_type=="Entire home/apt"')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we use query much anywhere in the book, and we only really briefly mention it in wrangling. I would prefer one of our more common ways of filtering here if possible

.groupby('replicate')
['room_type']
.value_counts(normalize=True)
.reset_index(name='sample_proportion')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use of reset_index needs to be explained. It would be nice if it were not used at all, but for a first version I'm OK with it as long as it's described well

Comment on lines 350 to 362
samples.query("room_type == 'Entire home/apt'")
.groupby("replicate")["room_type"]
.count()
.reset_index()
.rename(columns={"room_type": "counts"})
samples
.groupby('replicate')
['room_type']
.value_counts(normalize=True)
.reset_index(name='sample_proportion')
.query('room_type=="Entire home/apt"')
)

# calculate the proportion
sample_estimates["sample_proportion"] = sample_estimates["counts"] / 40

# drop the count column
sample_estimates = sample_estimates.drop(columns=["counts"])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep see my earlier comment

glue("sample_propotion_center", round(sample_estimates["sample_proportion"].mean(), 1))
glue("sample_propotion_min", round(sample_estimates["sample_proportion"].min(), 1))
glue("sample_propotion_max", round(sample_estimates["sample_proportion"].max(), 1))
glue("sample_proportion_center", round(sample_estimates["sample_proportion"].mean(), 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the rounding here is too aggressive -- it says the plot is "centered around 0.7" but to me it looks more like 0.78 or so.

Also I would say not to use the min/max here for the upper/lower bound -- I would say like 5th / 95th or 10th / 90th percentile would be better

@@ -1276,7 +1024,7 @@ our point estimate to behave if we took another sample.

```{code-cell} ipython3
boot20000_means = boot20000.groupby("replicate")["price"].mean().reset_index().rename(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth explaining this line more carefully (and formatting it better)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I reformatted this and added an explanation in the previous code cell which is where this occurs for the first time.

Copy link
Contributor Author

@joelostblom joelostblom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trevorcampbell Thanks for the comments, ready for another look!

@@ -1276,7 +1024,7 @@ our point estimate to behave if we took another sample.

```{code-cell} ipython3
boot20000_means = boot20000.groupby("replicate")["price"].mean().reset_index().rename(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I reformatted this and added an explanation in the previous code cell which is where this occurs for the first time.

samples.append(sample)
samples = pd.concat([samples[i] for i in range(len(samples))])

samples = pd.concat([airbnb.sample(40).assign(replicate=n) for n in range(20_000)])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @ivanistheone, thank your for taking the time to chime in here. We don't teach functions in this course we can't use those suggestions unfortunately. Regarding breaking up into two steps, I think that is a good idea when we first introduce list comprehensions in the previous chapter, which I am going to review now and keep your suggestion in mind if there is a similar situation occurring there (I also added a sentence to clarify here). I wouldn't personally break it down in two steps on such as big comprehension to avoid storing the data twice, but I think it is good in a toy example for educational purposes.

Thanks again for your comments and kind words about the UBC data science courses!

Copy link
Contributor

@trevorcampbell trevorcampbell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One remaining bugfix needed

.groupby("replicate")
["price"]
.mean()
.rename(columns={"price": "sample_mean"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joelostblom I'm seeing a stack trace here and another nearby (which then causes more downstream errors)

Please fix + verify and let me know!

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[39], line 2
      1 (
----> 2     six_bootstrap_samples
      3     .groupby("replicate")
      4     ["price"]
      5     .mean()
      6     .rename(columns={"price": "sample_mean"})
      7     .reset_index()
      8 )

TypeError: Series.rename() got an unexpected keyword argument 'columns'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(aside from that the PR looks good!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops, I swapped the two last lines and forgot to swap them back, fixed.

@trevorcampbell trevorcampbell merged commit 8289d51 into main Jan 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants