|
362 | 362 | "source": [
|
363 | 363 | "### Lowercasing\n",
|
364 | 364 | "\n",
|
365 |
| - "While we acknowledge that the **casing** of words is informative, we often don't work in contexts where we can properly utilize this information.\n", |
| 365 | + "While we acknowledge that a word's casing is informative, we often don't work in contexts where we can properly utilize this information.\n", |
366 | 366 | "\n",
|
367 | 367 | "More often, the subsequent analysis we perform is **case-insensitive**. For instance, in frequency analysis, we want to account for various forms of the same word. Lowercasing the text data aids in this process and simplifies our analysis.\n",
|
368 | 368 | "\n",
|
|
435 | 435 | "\n",
|
436 | 436 | "Our goal in this workshop is not to provide a deep (or even shallow) dive into regex; instead, we want to expose you to them so that you are better prepared to do deep dives in the future!\n",
|
437 | 437 | "\n",
|
438 |
| - "The following example is a poem by William Wordsworth. Like many poems, the text may contain extra line breaks (i.e., newline characters, `\\n`) that we want to remove.\n", |
439 |
| - "\n", |
440 |
| - "Let's read the data in!" |
| 438 | + "The following example is a poem by William Wordsworth. Like many poems, the text may contain extra line breaks (i.e., newline characters, `\\n`) that we want to remove." |
441 | 439 | ]
|
442 | 440 | },
|
443 | 441 | {
|
|
1096 | 1094 | "\n",
|
1097 | 1095 | "The first package we'll be using is called **Natural Language Toolkit**, or `nltk`. \n",
|
1098 | 1096 | "\n",
|
1099 |
| - "Let's install a couple modules within the package." |
| 1097 | + "Let's install a couple modules from the package." |
1100 | 1098 | ]
|
1101 | 1099 | },
|
1102 | 1100 | {
|
|
1841 | 1839 | "\n",
|
1842 | 1840 | "In this section, we will demonstrate tokenization in **BERT** (Bidirectional Encoder Representations from Transformers), which utilizes a tokenization algorithm called [**WordPiece**](https://huggingface.co/learn/nlp-course/en/chapter6/6). \n",
|
1843 | 1841 | "\n",
|
1844 |
| - "We will load the tokenizer of BERT from the package `transformers`, which hosts a number of Transformer-based LLMs (e.g., GPT-2). We won't go into the architecture of Transformer in this workshop, but feel free to check out the D-lab workshop on [GPT Fundamentals](https://github.com/dlab-berkeley/GPT-Fundamentals)!" |
| 1842 | + "We will load the tokenizer of BERT from the package `transformers`, which hosts a number of Transformer-based LLMs (e.g., BERT). We won't go into the architecture of Transformer in this workshop, but feel free to check out the D-lab workshop on [GPT Fundamentals](https://github.com/dlab-berkeley/GPT-Fundamentals)!" |
1845 | 1843 | ]
|
1846 | 1844 | },
|
1847 | 1845 | {
|
|
0 commit comments