From cd2dee5ff655d0fc0aa53285a4ce0d27a5cb862b Mon Sep 17 00:00:00 2001 From: Emuide Date: Mon, 24 Mar 2025 03:02:25 +0000 Subject: [PATCH 1/8] --- lessons/01_preprocessing.ipynb | 104 ++++++++++++++++++++++----------- 1 file changed, 69 insertions(+), 35 deletions(-) diff --git a/lessons/01_preprocessing.ipynb b/lessons/01_preprocessing.ipynb index de33786..8a1bb5b 100644 --- a/lessons/01_preprocessing.ipynb +++ b/lessons/01_preprocessing.ipynb @@ -28,9 +28,9 @@ "1. [Preprocessing](#section1)\n", "2. [Tokenization](#section2)\n", "\n", - "In this three-part workshop series, we'll learn the building blocks for performing text analysis in Python. These techniques lie in the domain of Natural Language Processing (NLP). NLP is a field that deals with identifying and extracting patterns of language, primarily in written texts. Throughout the workshop series, we'll interact with various packages for performing text analysis: starting from simple string methods to specific NLP packages, such as `nltk`, `spaCy`, and more recent ones on Large Language Models (`BERT`).\n", + "En esta serie de tres talleres, aprenderemos los fundamentos para realizar análisis de texto en Python. Estas técnicas pertenecen al Procesamiento del Lenguaje Natural (NLP), un campo que se enfoca en identificar y extraer patrones lingüísticos en textos escritos. A lo largo del taller, usaremos diversos paquetes, desde métodos simples de strings hasta bibliotecas específicas como nltk, spaCy y otras relacionadas con LLMs (BERT, etc.).\n", "\n", - "Now, let's have these packages properly installed before diving into the materials." + "Instalemos los paquetes necesarios antes de comenzar." ] }, { @@ -56,14 +56,19 @@ "\n", "# Preprocessing\n", "\n", - "In Part 1 of this workshop, we'll address the first step of text analysis. Our goal is to convert the raw, messy text data into a consistent format. This process is often called **preprocessing**, **text cleaning**, or **text normalization**.\n", + "Preprocesamiento\n", "\n", - "You'll notice that at the end of preprocessing, our data is still in a format that we can read and understand. In Parts 2 and 3, we will begin our foray into converting the text data into a numerical representation—a format that can be more readily handled by computers. \n", + "En la Parte 1 de este taller, abordaremos el primer paso del análisis de texto: convertir datos de texto crudos y desorganizados en un formato consistente. Este proceso se conoce como preprocesamiento, limpieza de texto o normalización de texto.\n", "\n", - "🔔 **Question**: Let's pause for a minute to reflect on **your** previous experiences working on text data. \n", - "- What is the format of the text data you have interacted with (plain text, CSV, or XML)?\n", - "- Where does it come from (structured corpus, scraped from the web, survey data)?\n", - "- Is it messy (i.e., is the data formatted consistently)?" + "Al final del preprocesamiento, los datos siguen siendo legibles. En las Partes 2 y 3, convertiremos el texto en representaciones numéricas, más manejables por computadoras.\n", + "\n", + "Pregunta: Reflexiona sobre tu experiencia previa con datos de texto:\n", + "\n", + "- ¿Qué formato tenían (texto plano, CSV, XML)?\n", + "\n", + "- ¿De dónde provenían (corpus estructurado, web, encuestas)?\n", + "\n", + "- ¿Estaban desorganizados?" ] }, { @@ -71,21 +76,25 @@ "id": "4b35911a-3b3f-4a48-a7d1-9882aab04851", "metadata": {}, "source": [ - "## Common Processes\n", + "## Procesos Comunes\n", "\n", - "Preprocessing is not something we can accomplish with a single line of code. We often start by familiarizing ourselves with the data, and along the way, we gain a clearer understanding of the granularity of preprocessing we want to apply.\n", + "El preprocesamiento no es algo que podamos lograr con una sola línea de código. Normalmente comenzamos familiarizándonos con los datos y, en el camino, obtenemos una comprensión más clara del nivel de detalle que queremos aplicar.\n", "\n", - "Typically, we begin by applying a set of commonly used processes to clean the data. These operations don't substantially alter the form or meaning of the data; they serve as a standardized procedure to reshape the data into a consistent format.\n", + "Por lo general, empezamos aplicando un conjunto de procesos comunes para limpiar los datos. Estas operaciones no alteran sustancialmente la forma o el significado de los datos, sino que sirven como un procedimiento estandarizado para darles un formato consistente.\n", "\n", - "The following processes, for examples, are commonly applied to preprocess English texts of various genres. These operations can be done using built-in Python functions, such as `string` methods, and Regular Expressions. \n", - "- Lowercase the text\n", - "- Remove punctuation marks\n", - "- Remove extra whitespace characters\n", - "- Remove stop words\n", + "Los siguientes procesos, por ejemplo, se aplican comúnmente para preprocesar textos en inglés de diversos géneros. Estas operaciones pueden realizarse utilizando funciones integradas de Python, como métodos de `strings` y expresiones regulares:\n", "\n", - "After the initial processing, we may choose to perform task-specific processes, the specifics of which often depend on the downstream task we want to perform and the nature of the text data (i.e., its stylistic and linguistic features). \n", + "- Convertir el texto a minúsculas\n", "\n", - "Before we jump into these operations, let's take a look at our data!" + "- Eliminar signos de puntuación\n", + "\n", + "- Eliminar espacios en blanco adicionales\n", + "\n", + "- Eliminar palabras vacías (stop words)\n", + "\n", + "Después del procesamiento inicial, podemos aplicar procesos específicos según la tarea. Estos dependen del objetivo final que queremos lograr y de las características del texto (ej: su estilo y rasgos lingüísticos).\n", + "\n", + "¡Antes de profundizar en estas operaciones, veamos nuestros datos!" ] }, { @@ -93,11 +102,11 @@ "id": "ec5d7350-9a1e-4db9-b828-a87fe1676d8d", "metadata": {}, "source": [ - "### Import the Text Data\n", + "### Importar los datos de texto\n", "\n", - "The text data we'll be working with is a CSV file. It contains tweets about U.S. airlines, scrapped from Feb 2015. \n", + "Los datos de texto con los que trabajaremos son un archivo CSV. Este contiene tweets sobre aerolíneas estadounidenses, extraídos de febrero de 2015.\n", "\n", - "Let's read the file `airline_tweets.csv` into dataframe with `pandas`." + "Leamos el archivo `airline_tweets.csv` en un DataFrame utilizando `pandas`." ] }, { @@ -308,13 +317,19 @@ "id": "ae3b339f-45cf-465d-931c-05f9096fd510", "metadata": {}, "source": [ - "The dataframe has one row per tweet. The text of tweet is shown in the `text` column.\n", - "- `text` (`str`): the text of the tweet.\n", + "El DataFrame tiene una fila por cada tweet. El texto del tweet se muestra en la columna `text`.\n", + "\n", + "- `text (str)`: el texto del tweet.\n", + "\n", + "Otros metadatos de interés incluyen:\n", "\n", - "Other metadata we are interested in include: \n", - "- `airline_sentiment` (`str`): the sentiment of the tweet, labeled as \"neutral,\" \"positive,\" or \"negative.\"\n", - "- `airline` (`str`): the airline that is tweeted about.\n", - "- `retweet count` (`int`): how many times the tweet was retweeted." + "- `airline_sentiment` (str): el sentimiento del tweet, etiquetado como \"neutral\", \"positivo\" o \"negativo\".\n", + "\n", + "- `airline (str)`: la aerolínea mencionada en el tweet.\n", + "\n", + "- `retweet count (int)`: la cantidad de veces que el tweet fue retuiteado.\n", + "\n", + "Echemos un vistazo a algunos de los tweets:" ] }, { @@ -352,7 +367,26 @@ "id": "8adc05fa-ad30-4402-ab56-086bcb09a166", "metadata": {}, "source": [ - "🔔 **Question**: What have you noticed? What are the stylistic features of tweets?" + "🔔 **Pregunta**: ¿Qué has observado? ¿Cuáles son las características estilísticas de los tweets?\n", + "\n", + "Respuesta:\n", + "Los tweets tienen características únicas:\n", + "\n", + "1- Brevedad: Límite de caracteres (antes 140, ahora 280).\n", + "\n", + "2- Elementos informales:\n", + "\n", + "- Hashtags (#ViajePerfecto).\n", + "\n", + "- Menciones (@Usuario).\n", + "\n", + "- URLs acortadas (http://t.co/...).\n", + "\n", + "- Contracciones (\"q\" por \"que\", \"xq\" por \"por qué\").\n", + "\n", + "3- Lenguaje coloquial: Emoticonos 😊, jerga, y errores tipográficos.\n", + "\n", + "4- Contexto temporal: Referencias a eventos o tendencias actuales." ] }, { @@ -360,15 +394,15 @@ "id": "c3460393-00a6-461c-b02a-9e98f9b5d1af", "metadata": {}, "source": [ - "### Lowercasing\n", + "### Convertir a minúsculas\n", "\n", - "While we acknowledge that a word's casing is informative, we often don't work in contexts where we can properly utilize this information.\n", + "Aunque reconocemos que las mayúsculas aportan información, en muchos contextos no podemos utilizarla adecuadamente.\n", "\n", - "More often, the subsequent analysis we perform is **case-insensitive**. For instance, in frequency analysis, we want to account for various forms of the same word. Lowercasing the text data aids in this process and simplifies our analysis.\n", + "Frecuentemente, el análisis posterior que realizamos es insensible a mayúsculas y minúsculas. Por ejemplo, en un análisis de frecuencia, queremos considerar distintas formas de una misma palabra como equivalentes. Convertir el texto a minúsculas simplifica este proceso y facilita el análisis.\n", "\n", - "We can easily achieve lowercasing with the string method [`.lower()`](https://docs.python.org/3/library/stdtypes.html#str.lower); see [documentation](https://docs.python.org/3/library/stdtypes.html#string-methods) for more useful functions.\n", + "Podemos lograr esto fácilmente con el método de cadena [`.lower()`](https://docs.python.org/3/library/stdtypes.html#str.lower); see [documentation](https://docs.python.org/3/library/stdtypes.html#string-methods) Para más funciones útiles, consulte la documentación.\n", "\n", - "Let's apply it to the following example:" + "Apliquémoslo al siguiente ejemplo:" ] }, { @@ -2151,7 +2185,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "Python 3", "language": "python", "name": "python3" }, @@ -2165,7 +2199,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.4" + "version": "3.12.1" } }, "nbformat": 4, From f529981453a0b5d3f15d207100358bf9db46a669 Mon Sep 17 00:00:00 2001 From: cchicaizap Date: Wed, 26 Mar 2025 01:04:53 +0000 Subject: [PATCH 2/8] Primer avance --- lessons/01_preprocessing.ipynb | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/lessons/01_preprocessing.ipynb b/lessons/01_preprocessing.ipynb index de33786..bbacb2b 100644 --- a/lessons/01_preprocessing.ipynb +++ b/lessons/01_preprocessing.ipynb @@ -5,17 +5,24 @@ "id": "d3e7ea21-6437-48e8-a9e4-3bdc05f709c9", "metadata": {}, "source": [ - "# Python Text Analysis: Preprocessing\n", + "# Análisis de texto en python: Preprocesamiento\n", "\n", "* * * \n", "\n", + "## Grupo 4\n", + "### Integrantes\n", + "* Carlos Chicaiza\n", + "* Emilio Mayorga\n", + "* Juan Vizuete\n", + "* Jessica Llumiguano\n", + "\n", "
\n", " \n", - "### Learning Objectives \n", + "### Objetivos de Aprendizaje\n", " \n", - "* Learn common steps for preprocessing text data, as well as specific operations for preprocessing Twitter data.\n", - "* Know commonly used NLP packages and what they are capable of.\n", - "* Understand tokenizers, and how they have changed since the advent of Large Language Models.\n", + "* Aprender cuales son los pasos comunes para el procesamiento de datos, asi como tambien las operaciones que se realizan para el procesamiento de datos en twitter.\n", + "* Conocer los paquete de procesamiento de lenguaje natural mas utilizados y sus capacidades.\n", + "* Entender los tokenizadores y como han cambiado desde la aparición de los modelos de lenguaje en gran escala.\n", "
\n", "\n", "### Icons Used in This Notebook\n", From a1c6a5fb967297f20f85c711db4f475953d22c6f Mon Sep 17 00:00:00 2001 From: juvizueteva Date: Wed, 26 Mar 2025 01:42:21 +0000 Subject: [PATCH 3/8] primer cambio --- lessons/02_bag_of_words.ipynb | 233 ++++++++++++++++++---------------- 1 file changed, 122 insertions(+), 111 deletions(-) diff --git a/lessons/02_bag_of_words.ipynb b/lessons/02_bag_of_words.ipynb index cbc9046..25868f6 100644 --- a/lessons/02_bag_of_words.ipynb +++ b/lessons/02_bag_of_words.ipynb @@ -75,7 +75,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 9, "id": "f3862ffd-918f-4184-8c90-8a39a8a2a069", "metadata": {}, "outputs": [], @@ -104,7 +104,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 19, "id": "4190e351-97b7-4c5b-866e-07aa6cbd42c2", "metadata": {}, "outputs": [], @@ -116,7 +116,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 20, "id": "79acbaf2-6625-4abb-b50f-97ea54ba0d11", "metadata": {}, "outputs": [ @@ -290,7 +290,7 @@ "4 2015-02-24 11:14:45 -0800 NaN Pacific Time (US & Canada) " ] }, - "execution_count": 3, + "execution_count": 20, "metadata": {}, "output_type": "execute_result" } @@ -316,7 +316,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 21, "id": "a1faaf90-8c01-4d25-9468-90c01823f0d5", "metadata": {}, "outputs": [], @@ -334,7 +334,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 22, "id": "438830e6-1064-47fe-b578-a1ca693a0ed0", "metadata": {}, "outputs": [ @@ -369,13 +369,13 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 23, "id": "01955158-6954-447a-acb6-2989d02a49c3", "metadata": {}, "outputs": [ { "data": { - "image/png": "", + "image/png": "", "text/plain": [ "
" ] @@ -404,7 +404,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 24, "id": "428ddde7-af73-4eb6-92c9-041a1791ca59", "metadata": {}, "outputs": [ @@ -417,7 +417,7 @@ "Name: retweet_count, dtype: float64" ] }, - "execution_count": 7, + "execution_count": 24, "metadata": {}, "output_type": "execute_result" } @@ -439,7 +439,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 25, "id": "12aa9f2d-d655-494a-bb72-08ad973518f3", "metadata": {}, "outputs": [ @@ -519,7 +519,7 @@ "Virgin America 0.543544 0.456456" ] }, - "execution_count": 8, + "execution_count": 25, "metadata": {}, "output_type": "execute_result" } @@ -581,19 +581,30 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 14, "id": "21738b02-9ab9-4a61-b41f-ff75888aa747", "metadata": { "tags": [] }, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/workspaces/Python-Text-Analysis_grupo_4/lessons/utils.py:4: SyntaxWarning: invalid escape sequence '\\d'\n", + " digit_pattern = '\\d+'\n", + "/workspaces/Python-Text-Analysis_grupo_4/lessons/utils.py:14: SyntaxWarning: invalid escape sequence '\\d'\n", + " digit_pattern = '\\d+'\n" + ] + } + ], "source": [ "from utils import placeholder" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 26, "id": "03569f0d-34ba-492d-aa1d-1dce9d34f792", "metadata": {}, "outputs": [], @@ -618,7 +629,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 27, "id": "8990cefd-5d04-46ba-ada2-29978c28cfe8", "metadata": {}, "outputs": [ @@ -628,7 +639,7 @@ "text": [ "lol @justinbeiber and @BillGates are like soo 2000 #yesterday #amiright saw it on https://twitter.com #yolo\n", "==================================================\n", - "lol USER and USER are like soo DIGIT HASHTAG HASHTAG saw it on URL HASHTAG\n" + "Ellipsis\n" ] } ], @@ -645,7 +656,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 28, "id": "a5f7bb6a-f064-48cc-b650-12c4ef2fbb88", "metadata": { "scrolled": true @@ -654,15 +665,15 @@ { "data": { "text/plain": [ - "0 USER plus you've added commercials to the expe...\n", - "1 USER it's really aggressive to blast obnoxious...\n", - "2 USER and it's a really big bad thing about it\n", - "3 USER seriously would pay $ DIGIT a flight for ...\n", - "4 USER yes, nearly every time i fly vx this “ear...\n", + "0 Ellipsis\n", + "1 Ellipsis\n", + "2 Ellipsis\n", + "3 Ellipsis\n", + "4 Ellipsis\n", "Name: text_processed, dtype: object" ] }, - "execution_count": 12, + "execution_count": 28, "metadata": {}, "output_type": "execute_result" } @@ -687,17 +698,17 @@ "metadata": {}, "source": [ "\n", - "# The Bag-of-Words Representation\n", + "# La Representación Bag-of-Words\n", "\n", - "The idea of bag-of-words (BoW), as the name suggests, is quite intuitive: we take a document and toss it in a bag. The action of \"throwing\" the document in a bag disregards the relative position between words, so what is \"in the bag\" is essentially \"an unsorted set of words\" [(Jurafsky & Martin, 2024)](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf). In return, we have a list of unique words and the frequency of each of them. \n", + "La idea de bag-of-words (BoW), como sugiere el nombre, es bastante intuitiva: tomamos un documento y lo arrojamos en una bolsa. La acción de \"arrojar\" el documento en una bolsa ignora la posición relativa entre las palabras, por lo que lo que queda \"en la bolsa\" es esencialmente \"un conjunto desordenado de palabras\" [(Jurafsky & Martin, 2024)](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf). A cambio, obtenemos una lista de palabras únicas y la frecuencia de cada una de ellas. \n", "\n", - "For example, as shown in the following illustration, the word \"coffee\" appears twice. \n", + "Por ejemplo, como se muestra en la siguiente ilustración, la palabra \"coffee\" aparece dos veces. \n", "\n", "\"BoW-Part2\"\n", "\n", - "With a bag-of-words representation, we make heavy use of word frequency but not too much of word order. \n", + "Con una representación bag-of-words, hacemos un uso intensivo de la frecuencia de las palabras, pero no tanto del orden en que aparecen. \n", "\n", - "In the context of sentiment analysis, the sentiment of a tweet is conveyed more strongly by specific words. For example, if a tweet contains the word \"happy,\" it likely conveys positive sentiment, but not always (e.g., \"not happy\" denotes the opposite sentiment). When these words come up more often, they'll probably more strongly convey the sentiment." + "En el contexto del análisis de sentimiento, el sentimiento de un tweet se transmite más fuertemente a través de palabras específicas. Por ejemplo, si un tweet contiene la palabra \"happy\", es probable que transmita un sentimiento positivo, aunque no siempre (por ejemplo, \"not happy\" denota el sentimiento opuesto). Cuando estas palabras aparecen con mayor frecuencia, probablemente transmitirán el sentimiento con más fuerza.\n" ] }, { @@ -707,13 +718,13 @@ "source": [ "## Document Term Matrix\n", "\n", - "Now let's implement the idea of bag-of-words. Before we dive deeper, let's step back for a moment. In practice, text analysis often involves handling many documents; from now on, we use the term **document** to represent a piece of text on which we perform analysis. It could be a phrase, a sentence, a tweet, or any other text—as long as it can be represented by a string, the length dosen't really matter. \n", + "Ahora implementemos la idea de bag-of-words. Antes de profundizar, retrocedamos un momento. En la práctica, el análisis de texto a menudo implica manejar múltiples documentos; de ahora en adelante, utilizaremos el término **document** para representar un fragmento de texto sobre el cual realizamos análisis. Puede ser una frase, una oración, un tweet o cualquier otro texto—mientras pueda representarse como una cadena de caracteres, su longitud no es realmente un problema. \n", "\n", - "Imagine we have four documents (i.e., the four phrases shown above), and we toss them all in the bag. Instead of a word-frequency list, we'd expect a document-term matrix (DTM) in return. In a DTM, the word list is the **vocabulary** (V) that holds all unique words occur across the documents. For each **document** (D), we count the number of occurence of each word in the vocabulary, and then plug the number into the matrix. In other words, the DTM we will construct is a $D \\times V$ matrix, where each row corresponds to a document, and each column corresponds to a token (or \"term\").\n", + "Imagina que tenemos cuatro documentos (es decir, las cuatro frases mostradas anteriormente) y los arrojamos todos en la bolsa. En lugar de obtener una lista de frecuencias de palabras, obtendremos una document-term matrix (DTM). En una DTM, la lista de palabras constituye el **vocabulary** (V), que contiene todas las palabras únicas que aparecen en los documentos. Para cada **document** (D), contamos la cantidad de veces que aparece cada palabra en el vocabulario y luego colocamos ese número en la matriz. En otras palabras, la DTM que construiremos es una matriz $D \\times V$, donde cada fila corresponde a un documento y cada columna a un token (o \"término\"). \n", "\n", - "The unique tokens in this set of documents, arranged in alphabetical order, form the columns. For each document, we mark the occurence of each word present in the document. The numerical representation for each document is a row in the matrix. For example, the first document, \"the coffee roaster,\" has the numerical representation $[0, 1, 0, 0, 0, 1, 1, 0]$.\n", + "Los tokens únicos en este conjunto de documentos, organizados en orden alfabético, forman las columnas. Para cada documento, marcamos la frecuencia de cada palabra presente en el documento. La representación numérica de cada documento es una fila en la matriz. Por ejemplo, el primer documento, \"the coffee roaster\", tiene la representación numérica $[0, 1, 0, 0, 0, 1, 1, 0]$. \n", "\n", - "Note that the left index column now displays these documents as text, but typically we would just assign an index to each of them. \n", + "Nota que la columna de índices a la izquierda muestra estos documentos como texto, pero típicamente solo se les asignaría un número de índice. \n", "\n", "$$\n", "\\begin{array}{c|cccccccccccc}\n", @@ -725,12 +736,12 @@ "\\end{array}\n", "$$\n", "\n", - "To create a DTM, we will use `CountVectorizer` from the package `sklearn`." + "Para crear una DTM, utilizaremos `CountVectorizer` del paquete `sklearn`.\n" ] }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 1, "id": "cd2adf56-ba93-459d-8cfa-16ce8dc9284b", "metadata": {}, "outputs": [], @@ -743,11 +754,11 @@ "id": "4989781d-6b40-417a-be70-eeba05cd8a50", "metadata": {}, "source": [ - "The following illustration depicts the three-step workflow of creating a DTM with `CountVectorizr`.\n", + "La siguiente ilustración muestra el flujo de trabajo en tres pasos para crear una DTM con `CountVectorizer`.\n", "\n", "\"CountVectorizer\"\n", "\n", - "Let's walk through these steps with the toy example shown above." + "Repasemos estos pasos utilizando el ejemplo simple mostrado anteriormente." ] }, { @@ -755,12 +766,12 @@ "id": "34174034-46b9-43e2-a511-5972d378cb00", "metadata": {}, "source": [ - "### A Toy Example" + "### Un Ejemplo Sencillo\n" ] }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 2, "id": "4da2bd3d-0460-4b5f-9b9e-02940db0d7ca", "metadata": {}, "outputs": [], @@ -777,14 +788,14 @@ "id": "dff7c1d3-fcee-4e20-b9a7-17306ebd5fc2", "metadata": {}, "source": [ - "The first step is to initialize a `CountVectorizer` object. Within the round paratheses, we can specify parameter settings if desired. Let's take a look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and see what options are available. \n", + "El primer paso es inicializar un objeto `CountVectorizer`. Dentro de los paréntesis, podemos especificar parámetros de configuración si lo deseamos. Echemos un vistazo a la [documentación](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) para ver qué opciones están disponibles. \n", "\n", - "For now we can just leave it blank to use the default settings. " + "Por ahora, podemos dejarlo en blanco para usar la configuración predeterminada. " ] }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 3, "id": "9de3fe6a-9abf-4e11-aad1-e54c891567bb", "metadata": {}, "outputs": [], @@ -798,14 +809,14 @@ "id": "1b5a7d0d-0bfc-4fb9-8e5f-e91e39797fb5", "metadata": {}, "source": [ - "The second step is to `fit` this `CountVectorizer` object to the data, which means creating a vocabulary of tokens from the set of documents. Thirdly, we `transform` our data according to the \"fitted\" `CountVectorizer` object, which means taking each of the document and counting the occurrences of tokens according to the vocabulary established during the \"fitting\" step.\n", + "El segundo paso es aplicar `fit` al objeto `CountVectorizer` con los datos, lo que significa crear un vocabulario de tokens a partir del conjunto de documentos. Luego, en el tercer paso, usamos `transform` para procesar nuestros datos de acuerdo con el objeto `CountVectorizer` \"ajustado\". Esto implica tomar cada documento y contar la aparición de tokens según el vocabulario establecido durante el paso de \"ajuste\". \n", "\n", - "It may sound a bit complex but steps 2 and 3 can be done in one swoop using a `fit_transform` function." + "Puede sonar un poco complejo, pero los pasos 2 y 3 pueden realizarse en una sola operación utilizando la función `fit_transform`. " ] }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 4, "id": "da1bbad4-bb1a-4b92-9096-6e17558b4a42", "metadata": {}, "outputs": [], @@ -819,25 +830,25 @@ "id": "324d3b65-4e98-48bf-87d2-399457f4939c", "metadata": {}, "source": [ - "The return of `fit_transform` is supposed to be the DTM. \n", + "El resultado de `fit_transform` debería ser la DTM. \n", "\n", - "Let's take a look at it!" + "¡Echemos un vistazo! " ] }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 5, "id": "cb044001-8eb2-4489-b025-2d8e2d4bfee2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "<4x8 sparse matrix of type ''\n", - "\twith 9 stored elements in Compressed Sparse Row format>" + "" ] }, - "execution_count": 17, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } @@ -851,14 +862,14 @@ "id": "f9817b09-a806-42c4-9436-822cc27a38b9", "metadata": {}, "source": [ - "Apparently we've got a \"sparse matrix\"—a matrix that contains a lot of zeros. This makes sense. For each document, there are words that don't occur at all, and these are counted as zero in the DTM. This sparse matrix is stored in a \"Compressed Sparse Row\" format, a memory-saving format designed for handling sparse matrices. \n", + "Aparentemente, hemos obtenido una \"sparse matrix\", es decir, una matriz que contiene muchos ceros. Esto tiene sentido: en cada documento, hay palabras que no aparecen en absoluto, y estas se registran como ceros en la DTM. Esta matriz dispersa se almacena en un formato \"Compressed Sparse Row\", un formato optimizado para ahorrar memoria al manejar matrices dispersas. \n", "\n", - "Let's convert it to a dense matrix, where those zeros are probably represented, as in a numpy array." + "Convirtámosla en una matriz densa, donde esos ceros probablemente estén representados, como en un array de numpy. " ] }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 6, "id": "bb03a238-87d8-40c9-b20e-66e7c9b6576b", "metadata": {}, "outputs": [ @@ -871,7 +882,7 @@ " [0, 1, 0, 0, 0, 0, 0, 1]])" ] }, - "execution_count": 18, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } @@ -886,12 +897,12 @@ "id": "28b58a63-d7f6-4b9f-aadf-4d4fc7341336", "metadata": {}, "source": [ - "So this is our DTM! The matrix is the same as shown above. To make it more reader-friendly, let's convert it to a dataframe. The column names should be tokens in the vocabulary, which we can access with the `get_feature_names_out` function." + "¡Así que esta es nuestra DTM! La matriz es la misma que mostramos anteriormente. Para hacerla más fácil de leer, convirtámosla en un dataframe. Los nombres de las columnas deben ser los tokens del vocabulario, a los cuales podemos acceder con la función `get_feature_names_out`. " ] }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 7, "id": "714de5d3-e37d-4a19-9ade-3c6629e38d4e", "metadata": {}, "outputs": [ @@ -902,7 +913,7 @@ " 'time'], dtype=object)" ] }, - "execution_count": 19, + "execution_count": 7, "metadata": {}, "output_type": "execute_result" } @@ -914,7 +925,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 10, "id": "6a7729a2-ca2e-4de7-8795-74dfedb7a4d5", "metadata": {}, "outputs": [], @@ -929,12 +940,12 @@ "id": "781da407-f394-40f2-9d45-1fac39f02047", "metadata": {}, "source": [ - "Here it is! The DTM of our toy data is now a dataframe. The index of `test_dtm` corresponds to the position of each document in the `test` list. " + "¡Aquí está! La DTM de nuestros datos de ejemplo ahora es un dataframe. El índice de `test_dtm` corresponde a la posición de cada documento en la lista `test`. " ] }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 11, "id": "e41dd243-cd2e-43c3-80f8-5eaab6e64210", "metadata": {}, "outputs": [ @@ -1026,7 +1037,7 @@ "3 0 1 0 0 0 0 0 1" ] }, - "execution_count": 21, + "execution_count": 11, "metadata": {}, "output_type": "execute_result" } @@ -1040,20 +1051,20 @@ "id": "d59a03b4-94fa-4fe7-8f5d-7280e31b9bc4", "metadata": {}, "source": [ - "Hopefully this toy example provides a clear walkthrough of creating a DTM.\n", + "Esperamos que este ejemplo sencillo haya proporcionado una guía clara para crear una DTM.\n", "\n", - "Now it's time for our tweets data!\n", + "¡Ahora es el momento de trabajar con nuestros datos de tweets!\n", "\n", - "### DTM for Tweets\n", + "### DTM para Tweets\n", "\n", - "We'll begin by initializing a `CountVectorizer` object. In the following cell, we have included a few parameters that people often adjust. These parameters are currently set to their default values.\n", + "Comenzaremos inicializando un objeto `CountVectorizer`. En la siguiente celda, hemos incluido algunos parámetros que las personas ajustan con frecuencia. Estos parámetros están configurados actualmente con sus valores predeterminados.\n", "\n", - "When we construct a DTM, the default is to lowercase the input text. If nothing is provided for `stop_words`, the default is to keep them. The next three parameters are used to control the size of the vocabulary, which we'll return to in a minute." + "Cuando construimos una DTM, el valor predeterminado es convertir a minúsculas el texto de entrada. Si no se proporciona nada para `stop_words`, el valor predeterminado es mantenerlas. Los siguientes tres parámetros se usan para controlar el tamaño del vocabulario, sobre lo cual volveremos en un momento." ] }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 12, "id": "783e44a4-4a22-4290-b222-282b02c080dc", "metadata": {}, "outputs": [], @@ -1068,22 +1079,27 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 29, "id": "f85e76ea-bc54-4775-bcda-432a03d2c96f", "metadata": { "scrolled": true }, "outputs": [ { - "data": { - "text/plain": [ - "<11541x8751 sparse matrix of type ''\n", - "\twith 191139 stored elements in Compressed Sparse Row format>" - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" + "ename": "AttributeError", + "evalue": "'ellipsis' object has no attribute 'lower'", + "output_type": "error", + "traceback": [ + "\u001b[31m---------------------------------------------------------------------------\u001b[39m", + "\u001b[31mAttributeError\u001b[39m Traceback (most recent call last)", + "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[29]\u001b[39m\u001b[32m, line 2\u001b[39m\n\u001b[32m 1\u001b[39m \u001b[38;5;66;03m# Fit and transform to create DTM\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m2\u001b[39m counts = \u001b[43mvectorizer\u001b[49m\u001b[43m.\u001b[49m\u001b[43mfit_transform\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtweets\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m'\u001b[39;49m\u001b[33;43mtext_processed\u001b[39;49m\u001b[33;43m'\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 3\u001b[39m counts\n", + "\u001b[36mFile \u001b[39m\u001b[32m~/.local/lib/python3.12/site-packages/sklearn/base.py:1389\u001b[39m, in \u001b[36m_fit_context..decorator..wrapper\u001b[39m\u001b[34m(estimator, *args, **kwargs)\u001b[39m\n\u001b[32m 1382\u001b[39m estimator._validate_params()\n\u001b[32m 1384\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m config_context(\n\u001b[32m 1385\u001b[39m skip_parameter_validation=(\n\u001b[32m 1386\u001b[39m prefer_skip_nested_validation \u001b[38;5;129;01mor\u001b[39;00m global_skip_validation\n\u001b[32m 1387\u001b[39m )\n\u001b[32m 1388\u001b[39m ):\n\u001b[32m-> \u001b[39m\u001b[32m1389\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfit_method\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n", + "\u001b[36mFile \u001b[39m\u001b[32m~/.local/lib/python3.12/site-packages/sklearn/feature_extraction/text.py:1376\u001b[39m, in \u001b[36mCountVectorizer.fit_transform\u001b[39m\u001b[34m(self, raw_documents, y)\u001b[39m\n\u001b[32m 1368\u001b[39m warnings.warn(\n\u001b[32m 1369\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mUpper case characters found in\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 1370\u001b[39m \u001b[33m\"\u001b[39m\u001b[33m vocabulary while \u001b[39m\u001b[33m'\u001b[39m\u001b[33mlowercase\u001b[39m\u001b[33m'\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 1371\u001b[39m \u001b[33m\"\u001b[39m\u001b[33m is True. These entries will not\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 1372\u001b[39m \u001b[33m\"\u001b[39m\u001b[33m be matched with any documents\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 1373\u001b[39m )\n\u001b[32m 1374\u001b[39m \u001b[38;5;28;01mbreak\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m1376\u001b[39m vocabulary, X = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_count_vocab\u001b[49m\u001b[43m(\u001b[49m\u001b[43mraw_documents\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mfixed_vocabulary_\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1378\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m.binary:\n\u001b[32m 1379\u001b[39m X.data.fill(\u001b[32m1\u001b[39m)\n", + "\u001b[36mFile \u001b[39m\u001b[32m~/.local/lib/python3.12/site-packages/sklearn/feature_extraction/text.py:1263\u001b[39m, in \u001b[36mCountVectorizer._count_vocab\u001b[39m\u001b[34m(self, raw_documents, fixed_vocab)\u001b[39m\n\u001b[32m 1261\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m doc \u001b[38;5;129;01min\u001b[39;00m raw_documents:\n\u001b[32m 1262\u001b[39m feature_counter = {}\n\u001b[32m-> \u001b[39m\u001b[32m1263\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m feature \u001b[38;5;129;01min\u001b[39;00m \u001b[43manalyze\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdoc\u001b[49m\u001b[43m)\u001b[49m:\n\u001b[32m 1264\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m 1265\u001b[39m feature_idx = vocabulary[feature]\n", + "\u001b[36mFile \u001b[39m\u001b[32m~/.local/lib/python3.12/site-packages/sklearn/feature_extraction/text.py:104\u001b[39m, in \u001b[36m_analyze\u001b[39m\u001b[34m(doc, analyzer, tokenizer, ngrams, preprocessor, decoder, stop_words)\u001b[39m\n\u001b[32m 102\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m 103\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m preprocessor \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m104\u001b[39m doc = \u001b[43mpreprocessor\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdoc\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 105\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m tokenizer \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m 106\u001b[39m doc = tokenizer(doc)\n", + "\u001b[36mFile \u001b[39m\u001b[32m~/.local/lib/python3.12/site-packages/sklearn/feature_extraction/text.py:62\u001b[39m, in \u001b[36m_preprocess\u001b[39m\u001b[34m(doc, accent_function, lower)\u001b[39m\n\u001b[32m 43\u001b[39m \u001b[38;5;250m\u001b[39m\u001b[33;03m\"\"\"Chain together an optional series of text preprocessing steps to\u001b[39;00m\n\u001b[32m 44\u001b[39m \u001b[33;03mapply to a document.\u001b[39;00m\n\u001b[32m 45\u001b[39m \n\u001b[32m (...)\u001b[39m\u001b[32m 59\u001b[39m \u001b[33;03m preprocessed string\u001b[39;00m\n\u001b[32m 60\u001b[39m \u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 61\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m lower:\n\u001b[32m---> \u001b[39m\u001b[32m62\u001b[39m doc = \u001b[43mdoc\u001b[49m\u001b[43m.\u001b[49m\u001b[43mlower\u001b[49m()\n\u001b[32m 63\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m accent_function \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m 64\u001b[39m doc = accent_function(doc)\n", + "\u001b[31mAttributeError\u001b[39m: 'ellipsis' object has no attribute 'lower'" + ] } ], "source": [ @@ -1094,25 +1110,20 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 30, "id": "87119057-c78c-4eb2-a9d6-3e9f44e4c22b", "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "array([[0, 0, 0, ..., 0, 0, 0],\n", - " [0, 0, 0, ..., 0, 0, 0],\n", - " [0, 0, 0, ..., 0, 0, 0],\n", - " ...,\n", - " [0, 0, 0, ..., 0, 0, 0],\n", - " [0, 0, 0, ..., 0, 0, 0],\n", - " [0, 0, 0, ..., 0, 0, 0]])" - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" + "ename": "NameError", + "evalue": "name 'counts' is not defined", + "output_type": "error", + "traceback": [ + "\u001b[31m---------------------------------------------------------------------------\u001b[39m", + "\u001b[31mNameError\u001b[39m Traceback (most recent call last)", + "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[30]\u001b[39m\u001b[32m, line 2\u001b[39m\n\u001b[32m 1\u001b[39m \u001b[38;5;66;03m# Do not run if you have limited memory - this includes DataHub and Binder\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m2\u001b[39m np.array(\u001b[43mcounts\u001b[49m.todense())\n", + "\u001b[31mNameError\u001b[39m: name 'counts' is not defined" + ] } ], "source": [ @@ -1160,7 +1171,7 @@ "id": "2dd257d5-4244-436c-afe7-5688232caf8f", "metadata": {}, "source": [ - "If we leave the `CountVectorizer` to the default setting, the vocabulary size of the tweet data is 8751. " + "Si dejamos el `CountVectorizer` con la configuración predeterminada, el tamaño del vocabulario de los datos de los tweets es 8751. " ] }, { @@ -1371,9 +1382,9 @@ "id": "095d34e2-52f8-4419-b4c7-ed20dbd5df89", "metadata": {}, "source": [ - "Most of the tokens have zero occurences at least in the first five tweets. \n", + "La mayoría de los tokens tienen cero ocurrencias, al menos en los primeros cinco tweets. \n", "\n", - "Let's take a closer look at the DTM!" + "¡Echemos un vistazo más de cerca a la DTM! " ] }, { @@ -1445,9 +1456,9 @@ "id": "5d230f79-e752-4e32-93db-4f013287f8e2", "metadata": {}, "source": [ - "It is not surprising to see \"user\" and \"digit\" to be among the most frequent tokens as we replaced each idiosyncratic one with these placeholders. The rest of the most frequent tokens are mostly stop words.\n", + "No es sorprendente ver que \"user\" y \"digit\" estén entre los tokens más frecuentes, ya que reemplazamos cada uno de los idiosincráticos con estos marcadores de posición. El resto de los tokens más frecuentes son principalmente palabras vacías (stop words).\n", "\n", - "Perhaps a more interesting pattern is to look for which token appears most in any given tweet:" + "Tal vez un patrón más interesante sea buscar qué token aparece más en cualquier tweet dado:" ] }, { @@ -1575,9 +1586,9 @@ "id": "7cdac4ef-6b9d-4aad-9b24-c70f6c2eb8f0", "metadata": {}, "source": [ - "It looks like among all tweets, at most a token appears six times, and it is either the word \"It\" or the word \"worst.\" \n", + "Parece que, entre todos los tweets, como máximo un token aparece seis veces, y es ya sea la palabra \"It\" o la palabra \"worst.\"\n", "\n", - "Let's go back to our tweets dataframe and locate the 918th tweet." + "Volvamos a nuestro dataframe de tweets y ubiquemos el tweet número 918." ] }, { @@ -1607,17 +1618,17 @@ "id": "3dba8e37-4880-4565-b6fc-7e7c96958f0f", "metadata": {}, "source": [ - "## Customize the `CountVectorizer`\n", + "## Personalizar el `CountVectorizer`\n", "\n", - "So far we've always used the default parameter setting to create our DTMs, but in many cases we may want to customize the `CountVectorizer` object. The purpose of doing so is to further filter out unnecessary tokens. In the example below, we tweak the following parameters:\n", + "Hasta ahora, siempre hemos utilizado la configuración predeterminada de parámetros para crear nuestras DTMs, pero en muchos casos, es posible que queramos personalizar el objeto `CountVectorizer`. El propósito de hacerlo es filtrar más a fondo los tokens innecesarios. En el ejemplo siguiente, ajustamos los siguientes parámetros:\n", "\n", - "- `stop_words = 'english'`: ignore English stop words \n", - "- `min_df = 2`: ignore words that don't occur at least twice\n", - "- `max_df = 0.95`: ignore words if they appear in more than 95\\% of the documents\n", + "- `stop_words = 'english'`: ignorar las palabras vacías en inglés\n", + "- `min_df = 2`: ignorar palabras que no ocurren al menos dos veces\n", + "- `max_df = 0.95`: ignorar palabras que aparecen en más del 95\\% de los documentos\n", "\n", - "🔔 **Question**: Let's pause for a minute to discuss whether it sounds reasonable to set these parameters! What do you think?\n", + "🔔 **Pregunta**: ¡Paremos un minuto para discutir si tiene sentido establecer estos parámetros! ¿Qué opinas?\n", "\n", - "Oftentimes, we are not interested in words whose frequencies are either too low or too high, so we use `min_df` and `max_df` to filter them out. Alternatively, we can define our vocabulary size as $N$ by setting `max_features`. In other words, we tell `CountVectorizer` to only consider the top $N$ most frequent tokens when constructing the DTM." + "A menudo, no estamos interesados en palabras cuya frecuencia es demasiado baja o demasiado alta, por lo que usamos `min_df` y `max_df` para filtrarlas. Alternativamente, podemos definir el tamaño de nuestro vocabulario como $N$ configurando `max_features`. En otras palabras, le decimos a `CountVectorizer` que solo considere los $N$ tokens más frecuentes al construir la DTM." ] }, { @@ -1657,7 +1668,7 @@ "id": "6d2e66bc-2eaa-4642-8848-74459948084b", "metadata": {}, "source": [ - "Our second DTM has a substantially smaller vocabulary compared to the first one." + "Nuestra segunda DTM tiene un vocabulario considerablemente más pequeño en comparación con la primera." ] }, { @@ -1888,7 +1899,7 @@ "id": "998fe2c3-ec90-4027-8c7f-417327a33a27", "metadata": {}, "source": [ - "The most frequent token list now includes words that make more sense to us, such as \"cancelled\" and \"service.\" " + "La lista de tokens más frecuentes ahora incluye palabras que tienen más sentido para nosotros, como \"cancelled\" y \"service.\"" ] }, { @@ -3350,7 +3361,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "Python 3", "language": "python", "name": "python3" }, @@ -3364,7 +3375,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.4" + "version": "3.12.1" } }, "nbformat": 4, From f6abfbdec25a7b48ed60bb2f1826f6aa290128ce Mon Sep 17 00:00:00 2001 From: cchicaizap Date: Wed, 26 Mar 2025 02:17:14 +0000 Subject: [PATCH 4/8] preprocesamiento --- lessons/01_preprocessing.ipynb | 241 ++++++++++++++++++++++++++++++--- 1 file changed, 222 insertions(+), 19 deletions(-) diff --git a/lessons/01_preprocessing.ipynb b/lessons/01_preprocessing.ipynb index bbacb2b..dcfd4e5 100644 --- a/lessons/01_preprocessing.ipynb +++ b/lessons/01_preprocessing.ipynb @@ -32,26 +32,223 @@ "🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!
\n", "\n", "### Sections\n", - "1. [Preprocessing](#section1)\n", - "2. [Tokenization](#section2)\n", + "1. [Preprocesamiento](#section1)\n", "\n", - "In this three-part workshop series, we'll learn the building blocks for performing text analysis in Python. These techniques lie in the domain of Natural Language Processing (NLP). NLP is a field that deals with identifying and extracting patterns of language, primarily in written texts. Throughout the workshop series, we'll interact with various packages for performing text analysis: starting from simple string methods to specific NLP packages, such as `nltk`, `spaCy`, and more recent ones on Large Language Models (`BERT`).\n", + "En estas tres partes del trabajo, vamos aprender conceptos básicos para realizar análisis de tecto en python. Estas técnicas pertenecen al dominio del procesmiento de lenguaje natural (NLP). NlP es un camp enfocado a identificar y extraer patrones del lenguaje, principalmente en textos escritos. Durante el rabjo, interactuaremos con diversos paquetes para realizar análisis de texto, desde métodos simples de strings hasta paquetes específicos de NLP, como `nltk`, `spaCy` y otros modelos de lenguaje de gran escala como (`BERT`).\n", "\n", - "Now, let's have these packages properly installed before diving into the materials." + "Ahora bien, antes de iniciar, se debe instalar los siguientes paquetes:" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "id": "d442e4c7-e926-493d-a64e-516616ad915a", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting NLTK\n", + " Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)\n", + "Collecting click (from NLTK)\n", + " Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)\n", + "Collecting joblib (from NLTK)\n", + " Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)\n", + "Collecting regex>=2021.8.3 (from NLTK)\n", + " Downloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)\n", + "Collecting tqdm (from NLTK)\n", + " Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)\n", + "Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.5/1.5 MB\u001b[0m \u001b[31m13.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (796 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m796.9/796.9 kB\u001b[0m \u001b[31m21.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading click-8.1.8-py3-none-any.whl (98 kB)\n", + "Downloading joblib-1.4.2-py3-none-any.whl (301 kB)\n", + "Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)\n", + "Installing collected packages: tqdm, regex, joblib, click, NLTK\n", + "Successfully installed NLTK-3.9.1 click-8.1.8 joblib-1.4.2 regex-2024.11.6 tqdm-4.67.1\n", + "Note: you may need to restart the kernel to use updated packages.\n", + "Collecting transformers\n", + " Downloading transformers-4.50.1-py3-none-any.whl.metadata (39 kB)\n", + "Collecting filelock (from transformers)\n", + " Downloading filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)\n", + "Collecting huggingface-hub<1.0,>=0.26.0 (from transformers)\n", + " Downloading huggingface_hub-0.29.3-py3-none-any.whl.metadata (13 kB)\n", + "Collecting numpy>=1.17 (from transformers)\n", + " Downloading numpy-2.2.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)\n", + "Requirement already satisfied: packaging>=20.0 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from transformers) (24.2)\n", + "Collecting pyyaml>=5.1 (from transformers)\n", + " Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)\n", + "Requirement already satisfied: regex!=2019.12.17 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from transformers) (2024.11.6)\n", + "Collecting requests (from transformers)\n", + " Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)\n", + "Collecting tokenizers<0.22,>=0.21 (from transformers)\n", + " Downloading tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)\n", + "Collecting safetensors>=0.4.3 (from transformers)\n", + " Downloading safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)\n", + "Requirement already satisfied: tqdm>=4.27 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from transformers) (4.67.1)\n", + "Collecting fsspec>=2023.5.0 (from huggingface-hub<1.0,>=0.26.0->transformers)\n", + " Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)\n", + "Collecting typing-extensions>=3.7.4.3 (from huggingface-hub<1.0,>=0.26.0->transformers)\n", + " Downloading typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)\n", + "Collecting charset-normalizer<4,>=2 (from requests->transformers)\n", + " Downloading charset_normalizer-3.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (35 kB)\n", + "Collecting idna<4,>=2.5 (from requests->transformers)\n", + " Downloading idna-3.10-py3-none-any.whl.metadata (10 kB)\n", + "Collecting urllib3<3,>=1.21.1 (from requests->transformers)\n", + " Downloading urllib3-2.3.0-py3-none-any.whl.metadata (6.5 kB)\n", + "Collecting certifi>=2017.4.17 (from requests->transformers)\n", + " Using cached certifi-2025.1.31-py3-none-any.whl.metadata (2.5 kB)\n", + "Downloading transformers-4.50.1-py3-none-any.whl (10.2 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m10.2/10.2 MB\u001b[0m \u001b[31m56.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading huggingface_hub-0.29.3-py3-none-any.whl (468 kB)\n", + "Downloading numpy-2.2.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.1 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m16.1/16.1 MB\u001b[0m \u001b[31m58.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m\n", + "\u001b[?25hDownloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (767 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m767.5/767.5 kB\u001b[0m \u001b[31m32.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (471 kB)\n", + "Downloading tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.0/3.0 MB\u001b[0m \u001b[31m49.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading filelock-3.18.0-py3-none-any.whl (16 kB)\n", + "Downloading requests-2.32.3-py3-none-any.whl (64 kB)\n", + "Using cached certifi-2025.1.31-py3-none-any.whl (166 kB)\n", + "Downloading charset_normalizer-3.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (145 kB)\n", + "Downloading fsspec-2025.3.0-py3-none-any.whl (193 kB)\n", + "Downloading idna-3.10-py3-none-any.whl (70 kB)\n", + "Downloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)\n", + "Downloading urllib3-2.3.0-py3-none-any.whl (128 kB)\n", + "Installing collected packages: urllib3, typing-extensions, safetensors, pyyaml, numpy, idna, fsspec, filelock, charset-normalizer, certifi, requests, huggingface-hub, tokenizers, transformers\n", + "Successfully installed certifi-2025.1.31 charset-normalizer-3.4.1 filelock-3.18.0 fsspec-2025.3.0 huggingface-hub-0.29.3 idna-3.10 numpy-2.2.4 pyyaml-6.0.2 requests-2.32.3 safetensors-0.5.3 tokenizers-0.21.1 transformers-4.50.1 typing-extensions-4.12.2 urllib3-2.3.0\n", + "Note: you may need to restart the kernel to use updated packages.\n", + "Collecting spaCy\n", + " Downloading spacy-3.8.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)\n", + "Collecting spacy-legacy<3.1.0,>=3.0.11 (from spaCy)\n", + " Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)\n", + "Collecting spacy-loggers<2.0.0,>=1.0.0 (from spaCy)\n", + " Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)\n", + "Collecting murmurhash<1.1.0,>=0.28.0 (from spaCy)\n", + " Downloading murmurhash-1.0.12-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)\n", + "Collecting cymem<2.1.0,>=2.0.2 (from spaCy)\n", + " Downloading cymem-2.0.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.5 kB)\n", + "Collecting preshed<3.1.0,>=3.0.2 (from spaCy)\n", + " Downloading preshed-3.0.9-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.2 kB)\n", + "Collecting thinc<8.4.0,>=8.3.4 (from spaCy)\n", + " Downloading thinc-8.3.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)\n", + "Collecting wasabi<1.2.0,>=0.9.1 (from spaCy)\n", + " Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)\n", + "Collecting srsly<3.0.0,>=2.4.3 (from spaCy)\n", + " Downloading srsly-2.5.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)\n", + "Collecting catalogue<2.1.0,>=2.0.6 (from spaCy)\n", + " Downloading catalogue-2.0.10-py3-none-any.whl.metadata (14 kB)\n", + "Collecting weasel<0.5.0,>=0.1.0 (from spaCy)\n", + " Downloading weasel-0.4.1-py3-none-any.whl.metadata (4.6 kB)\n", + "Collecting typer<1.0.0,>=0.3.0 (from spaCy)\n", + " Downloading typer-0.15.2-py3-none-any.whl.metadata (15 kB)\n", + "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from spaCy) (4.67.1)\n", + "Requirement already satisfied: numpy>=1.19.0 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from spaCy) (2.2.4)\n", + "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from spaCy) (2.32.3)\n", + "Collecting pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 (from spaCy)\n", + " Downloading pydantic-2.10.6-py3-none-any.whl.metadata (30 kB)\n", + "Collecting jinja2 (from spaCy)\n", + " Downloading jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)\n", + "Collecting setuptools (from spaCy)\n", + " Downloading setuptools-78.1.0-py3-none-any.whl.metadata (6.6 kB)\n", + "Requirement already satisfied: packaging>=20.0 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from spaCy) (24.2)\n", + "Collecting langcodes<4.0.0,>=3.2.0 (from spaCy)\n", + " Downloading langcodes-3.5.0-py3-none-any.whl.metadata (29 kB)\n", + "Collecting language-data>=1.2 (from langcodes<4.0.0,>=3.2.0->spaCy)\n", + " Downloading language_data-1.3.0-py3-none-any.whl.metadata (4.3 kB)\n", + "Collecting annotated-types>=0.6.0 (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spaCy)\n", + " Downloading annotated_types-0.7.0-py3-none-any.whl.metadata (15 kB)\n", + "Collecting pydantic-core==2.27.2 (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spaCy)\n", + " Downloading pydantic_core-2.27.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)\n", + "Requirement already satisfied: typing-extensions>=4.12.2 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spaCy) (4.12.2)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from requests<3.0.0,>=2.13.0->spaCy) (3.4.1)\n", + "Requirement already satisfied: idna<4,>=2.5 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from requests<3.0.0,>=2.13.0->spaCy) (3.10)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from requests<3.0.0,>=2.13.0->spaCy) (2.3.0)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from requests<3.0.0,>=2.13.0->spaCy) (2025.1.31)\n", + "Collecting blis<1.3.0,>=1.2.0 (from thinc<8.4.0,>=8.3.4->spaCy)\n", + " Downloading blis-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)\n", + "Collecting confection<1.0.0,>=0.0.1 (from thinc<8.4.0,>=8.3.4->spaCy)\n", + " Downloading confection-0.1.5-py3-none-any.whl.metadata (19 kB)\n", + "Requirement already satisfied: click>=8.0.0 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from typer<1.0.0,>=0.3.0->spaCy) (8.1.8)\n", + "Collecting shellingham>=1.3.0 (from typer<1.0.0,>=0.3.0->spaCy)\n", + " Downloading shellingham-1.5.4-py2.py3-none-any.whl.metadata (3.5 kB)\n", + "Collecting rich>=10.11.0 (from typer<1.0.0,>=0.3.0->spaCy)\n", + " Downloading rich-13.9.4-py3-none-any.whl.metadata (18 kB)\n", + "Collecting cloudpathlib<1.0.0,>=0.7.0 (from weasel<0.5.0,>=0.1.0->spaCy)\n", + " Downloading cloudpathlib-0.21.0-py3-none-any.whl.metadata (14 kB)\n", + "Collecting smart-open<8.0.0,>=5.2.1 (from weasel<0.5.0,>=0.1.0->spaCy)\n", + " Downloading smart_open-7.1.0-py3-none-any.whl.metadata (24 kB)\n", + "Collecting MarkupSafe>=2.0 (from jinja2->spaCy)\n", + " Downloading MarkupSafe-3.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)\n", + "Collecting marisa-trie>=1.1.0 (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spaCy)\n", + " Downloading marisa_trie-1.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.0 kB)\n", + "Collecting markdown-it-py>=2.2.0 (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spaCy)\n", + " Downloading markdown_it_py-3.0.0-py3-none-any.whl.metadata (6.9 kB)\n", + "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spaCy) (2.19.1)\n", + "Collecting wrapt (from smart-open<8.0.0,>=5.2.1->weasel<0.5.0,>=0.1.0->spaCy)\n", + " Downloading wrapt-1.17.2-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.4 kB)\n", + "Collecting mdurl~=0.1 (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0.0,>=0.3.0->spaCy)\n", + " Downloading mdurl-0.1.2-py3-none-any.whl.metadata (1.6 kB)\n", + "Downloading spacy-3.8.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (31.8 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m31.8/31.8 MB\u001b[0m \u001b[31m48.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n", + "\u001b[?25hDownloading catalogue-2.0.10-py3-none-any.whl (17 kB)\n", + "Downloading cymem-2.0.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (227 kB)\n", + "Downloading langcodes-3.5.0-py3-none-any.whl (182 kB)\n", + "Downloading murmurhash-1.0.12-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (138 kB)\n", + "Downloading preshed-3.0.9-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (156 kB)\n", + "Downloading pydantic-2.10.6-py3-none-any.whl (431 kB)\n", + "Downloading pydantic_core-2.27.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m49.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading spacy_legacy-3.0.12-py2.py3-none-any.whl (29 kB)\n", + "Downloading spacy_loggers-1.0.5-py3-none-any.whl (22 kB)\n", + "Downloading srsly-2.5.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m42.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading thinc-8.3.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.7 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.7/3.7 MB\u001b[0m \u001b[31m34.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading typer-0.15.2-py3-none-any.whl (45 kB)\n", + "Downloading wasabi-1.1.3-py3-none-any.whl (27 kB)\n", + "Downloading weasel-0.4.1-py3-none-any.whl (50 kB)\n", + "Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)\n", + "Downloading setuptools-78.1.0-py3-none-any.whl (1.3 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m44.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading annotated_types-0.7.0-py3-none-any.whl (13 kB)\n", + "Downloading blis-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.6 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m11.6/11.6 MB\u001b[0m \u001b[31m47.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m\n", + "\u001b[?25hDownloading cloudpathlib-0.21.0-py3-none-any.whl (52 kB)\n", + "Downloading confection-0.1.5-py3-none-any.whl (35 kB)\n", + "Downloading language_data-1.3.0-py3-none-any.whl (5.4 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.4/5.4 MB\u001b[0m \u001b[31m55.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading MarkupSafe-3.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23 kB)\n", + "Downloading rich-13.9.4-py3-none-any.whl (242 kB)\n", + "Downloading shellingham-1.5.4-py2.py3-none-any.whl (9.8 kB)\n", + "Downloading smart_open-7.1.0-py3-none-any.whl (61 kB)\n", + "Downloading marisa_trie-1.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.4/1.4 MB\u001b[0m \u001b[31m40.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading markdown_it_py-3.0.0-py3-none-any.whl (87 kB)\n", + "Downloading wrapt-1.17.2-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (89 kB)\n", + "Downloading mdurl-0.1.2-py3-none-any.whl (10.0 kB)\n", + "Installing collected packages: cymem, wrapt, wasabi, spacy-loggers, spacy-legacy, shellingham, setuptools, pydantic-core, murmurhash, mdurl, MarkupSafe, cloudpathlib, catalogue, blis, annotated-types, srsly, smart-open, pydantic, preshed, markdown-it-py, marisa-trie, jinja2, rich, language-data, confection, typer, thinc, langcodes, weasel, spaCy\n", + "Successfully installed MarkupSafe-3.0.2 annotated-types-0.7.0 blis-1.2.0 catalogue-2.0.10 cloudpathlib-0.21.0 confection-0.1.5 cymem-2.0.11 jinja2-3.1.6 langcodes-3.5.0 language-data-1.3.0 marisa-trie-1.2.1 markdown-it-py-3.0.0 mdurl-0.1.2 murmurhash-1.0.12 preshed-3.0.9 pydantic-2.10.6 pydantic-core-2.27.2 rich-13.9.4 setuptools-78.1.0 shellingham-1.5.4 smart-open-7.1.0 spaCy-3.8.4 spacy-legacy-3.0.12 spacy-loggers-1.0.5 srsly-2.5.1 thinc-8.3.4 typer-0.15.2 wasabi-1.1.3 weasel-0.4.1 wrapt-1.17.2\n", + "Note: you may need to restart the kernel to use updated packages.\n", + "Collecting en-core-web-sm==3.8.0\n", + " Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.8/12.8 MB\u001b[0m \u001b[31m45.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m\n", + "\u001b[?25hInstalling collected packages: en-core-web-sm\n", + "Successfully installed en-core-web-sm-3.8.0\n", + "\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n", + "You can now load the package via spacy.load('en_core_web_sm')\n" + ] + } + ], "source": [ "# Uncomment the following lines to install packages/model\n", - "# %pip install NLTK\n", - "# %pip install transformers\n", - "# %pip install spaCy\n", - "# !python -m spacy download en_core_web_sm" + "%pip install NLTK\n", + "%pip install transformers\n", + "%pip install spaCy\n", + "!python -m spacy download en_core_web_sm" ] }, { @@ -61,16 +258,22 @@ "source": [ "\n", "\n", - "# Preprocessing\n", + "# Preprocesamiento\n", + "\n", + "En la primera parte de este trabajo, se abordará el primer paso para el análisis de texto. Nuestra meta sera convertir los datos desordenados en un formato consistente. Este proceso se conoce como preprocesameinto/ limpieza de texto/ normalización del texto.\n", + "\n", + "Al final del preprocesamiento, los datos seguirán estando en un formato legible. En la segunda y tercera parte, se empza´ra a convertir los datos de tecto en una representación numérica, un formato más adecuado para su precesamiento computacional.\n", + "\n", + "🔔 **Pregunta**: tomate un minuto para reflexionar con tus experiencias pasadas trabjando con datos de texto:\n", + "- ¿Cuál es el formato de los datos de texto con los que has trabajado (texto plano, CSV, XML)?\n", "\n", - "In Part 1 of this workshop, we'll address the first step of text analysis. Our goal is to convert the raw, messy text data into a consistent format. This process is often called **preprocessing**, **text cleaning**, or **text normalization**.\n", + "Hemos trabajado con datos en texto csv, xml, txt, para la limpieza de datos, análisis y entrenamiento de redes neuronales.\n", + "- ¿De dónde provinieron (corpus estructurado, scrapping web, encuestas)?\n", "\n", - "You'll notice that at the end of preprocessing, our data is still in a format that we can read and understand. In Parts 2 and 3, we will begin our foray into converting the text data into a numerical representation—a format that can be more readily handled by computers. \n", + "Los datos fueron obtenidos desde kaggle, ya que contiene un gran banco de datos de todo tipo.\n", + "- ¿Los datos estaban desordenados o inconsistentes?\n", "\n", - "🔔 **Question**: Let's pause for a minute to reflect on **your** previous experiences working on text data. \n", - "- What is the format of the text data you have interacted with (plain text, CSV, or XML)?\n", - "- Where does it come from (structured corpus, scraped from the web, survey data)?\n", - "- Is it messy (i.e., is the data formatted consistently)?" + "Los datos estuvieron en algunos casos desordenados, en otros casos los datos inconsistentes pue slos descartabamos ya que necesitabamos avanzar rapidamente con el proyecto." ] }, { @@ -2158,7 +2361,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": ".venv", "language": "python", "name": "python3" }, @@ -2172,7 +2375,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.4" + "version": "3.12.1" } }, "nbformat": 4, From 8c2fd51d476594b46076544607b9c393475c013a Mon Sep 17 00:00:00 2001 From: cchicaizap Date: Wed, 26 Mar 2025 02:33:09 +0000 Subject: [PATCH 5/8] Importar datos de texto --- lessons/01_preprocessing.ipynb | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/lessons/01_preprocessing.ipynb b/lessons/01_preprocessing.ipynb index dcfd4e5..23ad48f 100644 --- a/lessons/01_preprocessing.ipynb +++ b/lessons/01_preprocessing.ipynb @@ -281,21 +281,21 @@ "id": "4b35911a-3b3f-4a48-a7d1-9882aab04851", "metadata": {}, "source": [ - "## Common Processes\n", + "## Procesos Comunes\n", "\n", - "Preprocessing is not something we can accomplish with a single line of code. We often start by familiarizing ourselves with the data, and along the way, we gain a clearer understanding of the granularity of preprocessing we want to apply.\n", + "El preprocesmiento no s epuede lograr con una sola linea de código. A menudo, nos familizarizamos con los datos para entender mejor el nivel de grnuralidad necesario para aplicar el preprocesamiento.\n", "\n", - "Typically, we begin by applying a set of commonly used processes to clean the data. These operations don't substantially alter the form or meaning of the data; they serve as a standardized procedure to reshape the data into a consistent format.\n", + "Tipicamente, al inicio aplicamos un listado de procesos comunmente utilizados para la limpieza de datos. estas operaciones no alteran sustancialmente la forma ni el significado de los datos; solo sirven como un procesamiento estandarizado para reorganizar los datos en un formato consistente.\n", "\n", - "The following processes, for examples, are commonly applied to preprocess English texts of various genres. These operations can be done using built-in Python functions, such as `string` methods, and Regular Expressions. \n", - "- Lowercase the text\n", - "- Remove punctuation marks\n", - "- Remove extra whitespace characters\n", - "- Remove stop words\n", + "Los siguientes procesos, por ejemplo, se aplican comunmente para el procesamiento de libreos de inglés en varios generos. Estas operaciones pueden ser realizadas usando funciones integradas en python, métodos como `string`, y expresiones regulares.\n", + "- Convertir a minúsculas\n", + "- Eliminar signos de puntuación.\n", + "- Eliminar espacios en blanco que esten demás.\n", + "- Eliminar palabrás bacías.\n", "\n", - "After the initial processing, we may choose to perform task-specific processes, the specifics of which often depend on the downstream task we want to perform and the nature of the text data (i.e., its stylistic and linguistic features). \n", + "Después del preocesamiento inicial, nosotros podemos seleccionar los preocesos específicos según la tarea, los detalles de estos procesos dependen de la tarea posterior que queremos llevar a cabo y el tipo de datos de texto (es decir, sus caracteríticas estilisticas y linguisticas).\n", "\n", - "Before we jump into these operations, let's take a look at our data!" + "¡Antes de adentrarnos en estas operaciones, echemos un vistazo a nuestros datos!" ] }, { @@ -303,11 +303,11 @@ "id": "ec5d7350-9a1e-4db9-b828-a87fe1676d8d", "metadata": {}, "source": [ - "### Import the Text Data\n", + "### Importar datos de texto\n", "\n", - "The text data we'll be working with is a CSV file. It contains tweets about U.S. airlines, scrapped from Feb 2015. \n", + "trabajaremos con un archivo CSV. Este archivo contiene tweets sobre aerolíneas de EE.UU. recopilados en febrero de 2015\n", "\n", - "Let's read the file `airline_tweets.csv` into dataframe with `pandas`." + "Vamos a leer el archivo `airline_tweets.csv` dentro de un dataframe de `pandas`." ] }, { From a64d098f7715e03791dd03f2faf4c02513c131d7 Mon Sep 17 00:00:00 2001 From: cchicaizap Date: Wed, 26 Mar 2025 02:49:17 +0000 Subject: [PATCH 6/8] =?UTF-8?q?Convertir=20a=20min=C3=BAsculas?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- lessons/01_preprocessing.ipynb | 73 +++++++++++++++++++++++++--------- 1 file changed, 54 insertions(+), 19 deletions(-) diff --git a/lessons/01_preprocessing.ipynb b/lessons/01_preprocessing.ipynb index 23ad48f..54c3836 100644 --- a/lessons/01_preprocessing.ipynb +++ b/lessons/01_preprocessing.ipynb @@ -312,7 +312,40 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 3, + "id": "6bda2022", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting pandas\n", + " Downloading pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)\n", + "Requirement already satisfied: numpy>=1.26.0 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from pandas) (2.2.4)\n", + "Requirement already satisfied: python-dateutil>=2.8.2 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from pandas) (2.9.0.post0)\n", + "Collecting pytz>=2020.1 (from pandas)\n", + " Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)\n", + "Collecting tzdata>=2022.7 (from pandas)\n", + " Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)\n", + "Requirement already satisfied: six>=1.5 in /workspaces/Python-Text-Analysis_grupo_4/.venv/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n", + "Downloading pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.7/12.7 MB\u001b[0m \u001b[31m44.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m\n", + "\u001b[?25hDownloading pytz-2025.2-py2.py3-none-any.whl (509 kB)\n", + "Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)\n", + "Installing collected packages: pytz, tzdata, pandas\n", + "Successfully installed pandas-2.2.3 pytz-2025.2 tzdata-2025.2\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install pandas" + ] + }, + { + "cell_type": "code", + "execution_count": 4, "id": "3d1ff64b-53ad-4eca-b846-3fda20085c43", "metadata": {}, "outputs": [], @@ -329,7 +362,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 5, "id": "e397ac6a-c2ba-4cce-8700-b36b38026c9d", "metadata": {}, "outputs": [ @@ -503,7 +536,7 @@ "4 2015-02-24 11:14:45 -0800 NaN Pacific Time (US & Canada) " ] }, - "execution_count": 2, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } @@ -518,13 +551,13 @@ "id": "ae3b339f-45cf-465d-931c-05f9096fd510", "metadata": {}, "source": [ - "The dataframe has one row per tweet. The text of tweet is shown in the `text` column.\n", - "- `text` (`str`): the text of the tweet.\n", + "El dataframe tiene una fila por cada tweet. El texto del tweet se muestra en la columna text.\n", + "- `text` (`str`): el texto del tweet.\n", "\n", - "Other metadata we are interested in include: \n", - "- `airline_sentiment` (`str`): the sentiment of the tweet, labeled as \"neutral,\" \"positive,\" or \"negative.\"\n", - "- `airline` (`str`): the airline that is tweeted about.\n", - "- `retweet count` (`int`): how many times the tweet was retweeted." + "Otra información relevante que nos interesa incluye:\n", + "- `airline_sentiment` (`str`): el sentimiento del tweet, etiquetado como \"neutral\", \"positivo\" o \"negativo\".\n", + "- `airline` (`str`): la aerolínea sobre la que se tuitea.\n", + "- `retweet count` (`int`): la cantidad de veces que el tweet fue retuiteado." ] }, { @@ -532,12 +565,12 @@ "id": "302c695b-4bd1-4151-9cb9-ef5253eb16df", "metadata": {}, "source": [ - "Let's take a look at some of the tweets:" + "Echemos un vistazo a algunos de los tweets:" ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 7, "id": "b690daab-7be5-4b8f-8af0-a91fdec4ec4f", "metadata": {}, "outputs": [ @@ -562,7 +595,9 @@ "id": "8adc05fa-ad30-4402-ab56-086bcb09a166", "metadata": {}, "source": [ - "🔔 **Question**: What have you noticed? What are the stylistic features of tweets?" + "🔔 **Pregunta**: ¿Qué has notado? ¿Cuáles son las características estilísticas de los tweets?\n", + "\n", + "Los tweets son informales y son directos con respecto a los servicios de la aerolínea, a través de ello se puede identificar el sentimiento de los usuarios." ] }, { @@ -570,20 +605,20 @@ "id": "c3460393-00a6-461c-b02a-9e98f9b5d1af", "metadata": {}, "source": [ - "### Lowercasing\n", + "### Convertir a minúsculas\n", "\n", - "While we acknowledge that a word's casing is informative, we often don't work in contexts where we can properly utilize this information.\n", + "Mientras reconocemos que el uso de mayúsculas y minúsculas en una palabra resulta ser información, a menudo no trabajams en contextos donde podemos aprovechar adecuandamente esta infomación.\n", "\n", - "More often, the subsequent analysis we perform is **case-insensitive**. For instance, in frequency analysis, we want to account for various forms of the same word. Lowercasing the text data aids in this process and simplifies our analysis.\n", + "Mayormente, el análisis posterior que realizamos es insensible a las mayúsculas. Por ejemplo, en el análisis de frecuencia, nosotros usualmente queremos considerar varias formas de la misma palabra. Convertir los datos de texto a minúsculas facilita este proceso y smplifica nestro análisis.\n", "\n", - "We can easily achieve lowercasing with the string method [`.lower()`](https://docs.python.org/3/library/stdtypes.html#str.lower); see [documentation](https://docs.python.org/3/library/stdtypes.html#string-methods) for more useful functions.\n", + "Podemos lograr fácilmente la conversión a minúsculas con el método de cadena [`.lower()`](https://docs.python.org/3/library/stdtypes.html#str.lower); visitar [documentation](https://docs.python.org/3/library/stdtypes.html#string-methods) para ver más funciones útiles.\n", "\n", - "Let's apply it to the following example:" + "Apliquémoslo al siguiente ejemplo:" ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 8, "id": "58a95d90-3ef1-4bff-9cfe-d447ed99f252", "metadata": {}, "outputs": [ @@ -603,7 +638,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 9, "id": "c66d91c0-6eed-4591-95fc-cd2eae2e0d41", "metadata": {}, "outputs": [ From 04792fa6e27fb5e48caa749d45e2c2eefd732200 Mon Sep 17 00:00:00 2001 From: juvizueteva Date: Wed, 26 Mar 2025 03:02:20 +0000 Subject: [PATCH 7/8] challenges realizados --- lessons/02_bag_of_words.ipynb | 442 ++++++++++++++++++---------------- 1 file changed, 235 insertions(+), 207 deletions(-) diff --git a/lessons/02_bag_of_words.ipynb b/lessons/02_bag_of_words.ipynb index 25868f6..67ebae3 100644 --- a/lessons/02_bag_of_words.ipynb +++ b/lessons/02_bag_of_words.ipynb @@ -48,7 +48,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 68, "id": "9e4a3a0d-66f4-44e5-8dd6-5f441146014d", "metadata": { "scrolled": true, @@ -62,7 +62,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 69, "id": "21ed437f-9767-43b7-abc5-159aa4339a31", "metadata": {}, "outputs": [], @@ -75,7 +75,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 70, "id": "f3862ffd-918f-4184-8c90-8a39a8a2a069", "metadata": {}, "outputs": [], @@ -104,7 +104,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 71, "id": "4190e351-97b7-4c5b-866e-07aa6cbd42c2", "metadata": {}, "outputs": [], @@ -116,7 +116,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 72, "id": "79acbaf2-6625-4abb-b50f-97ea54ba0d11", "metadata": {}, "outputs": [ @@ -290,7 +290,7 @@ "4 2015-02-24 11:14:45 -0800 NaN Pacific Time (US & Canada) " ] }, - "execution_count": 20, + "execution_count": 72, "metadata": {}, "output_type": "execute_result" } @@ -316,7 +316,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 73, "id": "a1faaf90-8c01-4d25-9468-90c01823f0d5", "metadata": {}, "outputs": [], @@ -334,7 +334,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 74, "id": "438830e6-1064-47fe-b578-a1ca693a0ed0", "metadata": {}, "outputs": [ @@ -369,7 +369,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 75, "id": "01955158-6954-447a-acb6-2989d02a49c3", "metadata": {}, "outputs": [ @@ -404,7 +404,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 76, "id": "428ddde7-af73-4eb6-92c9-041a1791ca59", "metadata": {}, "outputs": [ @@ -417,7 +417,7 @@ "Name: retweet_count, dtype: float64" ] }, - "execution_count": 24, + "execution_count": 76, "metadata": {}, "output_type": "execute_result" } @@ -439,7 +439,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 77, "id": "12aa9f2d-d655-494a-bb72-08ad973518f3", "metadata": {}, "outputs": [ @@ -519,7 +519,7 @@ "Virgin America 0.543544 0.456456" ] }, - "execution_count": 25, + "execution_count": 77, "metadata": {}, "output_type": "execute_result" } @@ -581,30 +581,20 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 78, "id": "21738b02-9ab9-4a61-b41f-ff75888aa747", "metadata": { "tags": [] }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/workspaces/Python-Text-Analysis_grupo_4/lessons/utils.py:4: SyntaxWarning: invalid escape sequence '\\d'\n", - " digit_pattern = '\\d+'\n", - "/workspaces/Python-Text-Analysis_grupo_4/lessons/utils.py:14: SyntaxWarning: invalid escape sequence '\\d'\n", - " digit_pattern = '\\d+'\n" - ] - } - ], + "outputs": [], "source": [ - "from utils import placeholder" + "from utils import placeholder\n", + "import re" ] }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 79, "id": "03569f0d-34ba-492d-aa1d-1dce9d34f792", "metadata": {}, "outputs": [], @@ -615,21 +605,21 @@ "def preprocess(text):\n", " '''Create a preprocess pipeline that cleans the tweet data.'''\n", " \n", - " # Step 1: Lowercase\n", - " text = ...\n", - "\n", - " # Step 2: Replace patterns with placeholders\n", - " text = ...\n", - "\n", + " # Step 1: Convert text to lowercase\n", + " text = text.lower()\n", + " \n", + " # Step 2: Replace patterns with placeholders (URLs, digits, hashtags, user handles)\n", + " text = placeholder(text)\n", + " \n", " # Step 3: Remove extra whitespace characters\n", - " text = ...\n", - "\n", + " text = re.sub(blankspace_pattern, blankspace_repl, text)\n", + " \n", " return text" ] }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 80, "id": "8990cefd-5d04-46ba-ada2-29978c28cfe8", "metadata": {}, "outputs": [ @@ -639,7 +629,7 @@ "text": [ "lol @justinbeiber and @BillGates are like soo 2000 #yesterday #amiright saw it on https://twitter.com #yolo\n", "==================================================\n", - "Ellipsis\n" + "lol USER and USER are like soo DIGIT HASHTAG HASHTAG saw it on URL HASHTAG \n" ] } ], @@ -656,7 +646,7 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 81, "id": "a5f7bb6a-f064-48cc-b650-12c4ef2fbb88", "metadata": { "scrolled": true @@ -665,15 +655,15 @@ { "data": { "text/plain": [ - "0 Ellipsis\n", - "1 Ellipsis\n", - "2 Ellipsis\n", - "3 Ellipsis\n", - "4 Ellipsis\n", + "0 USER plus you've added commercials to the exp...\n", + "1 USER it's really aggressive to blast obnoxiou...\n", + "2 USER and it's a really big bad thing about it\n", + "3 USER seriously would pay $ DIGIT a flight for...\n", + "4 USER yes, nearly every time i fly vx this “ea...\n", "Name: text_processed, dtype: object" ] }, - "execution_count": 28, + "execution_count": 81, "metadata": {}, "output_type": "execute_result" } @@ -741,7 +731,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 82, "id": "cd2adf56-ba93-459d-8cfa-16ce8dc9284b", "metadata": {}, "outputs": [], @@ -771,7 +761,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 83, "id": "4da2bd3d-0460-4b5f-9b9e-02940db0d7ca", "metadata": {}, "outputs": [], @@ -795,7 +785,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 84, "id": "9de3fe6a-9abf-4e11-aad1-e54c891567bb", "metadata": {}, "outputs": [], @@ -816,7 +806,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 85, "id": "da1bbad4-bb1a-4b92-9096-6e17558b4a42", "metadata": {}, "outputs": [], @@ -837,7 +827,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 86, "id": "cb044001-8eb2-4489-b025-2d8e2d4bfee2", "metadata": {}, "outputs": [ @@ -848,7 +838,7 @@ "\twith 9 stored elements and shape (4, 8)>" ] }, - "execution_count": 5, + "execution_count": 86, "metadata": {}, "output_type": "execute_result" } @@ -869,7 +859,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 87, "id": "bb03a238-87d8-40c9-b20e-66e7c9b6576b", "metadata": {}, "outputs": [ @@ -882,7 +872,7 @@ " [0, 1, 0, 0, 0, 0, 0, 1]])" ] }, - "execution_count": 6, + "execution_count": 87, "metadata": {}, "output_type": "execute_result" } @@ -902,7 +892,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 88, "id": "714de5d3-e37d-4a19-9ade-3c6629e38d4e", "metadata": {}, "outputs": [ @@ -913,7 +903,7 @@ " 'time'], dtype=object)" ] }, - "execution_count": 7, + "execution_count": 88, "metadata": {}, "output_type": "execute_result" } @@ -925,7 +915,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 89, "id": "6a7729a2-ca2e-4de7-8795-74dfedb7a4d5", "metadata": {}, "outputs": [], @@ -945,7 +935,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 90, "id": "e41dd243-cd2e-43c3-80f8-5eaab6e64210", "metadata": {}, "outputs": [ @@ -1037,7 +1027,7 @@ "3 0 1 0 0 0 0 0 1" ] }, - "execution_count": 11, + "execution_count": 90, "metadata": {}, "output_type": "execute_result" } @@ -1064,7 +1054,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 91, "id": "783e44a4-4a22-4290-b222-282b02c080dc", "metadata": {}, "outputs": [], @@ -1079,27 +1069,22 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 92, "id": "f85e76ea-bc54-4775-bcda-432a03d2c96f", "metadata": { "scrolled": true }, "outputs": [ { - "ename": "AttributeError", - "evalue": "'ellipsis' object has no attribute 'lower'", - "output_type": "error", - "traceback": [ - "\u001b[31m---------------------------------------------------------------------------\u001b[39m", - "\u001b[31mAttributeError\u001b[39m Traceback (most recent call last)", - "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[29]\u001b[39m\u001b[32m, line 2\u001b[39m\n\u001b[32m 1\u001b[39m \u001b[38;5;66;03m# Fit and transform to create DTM\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m2\u001b[39m counts = \u001b[43mvectorizer\u001b[49m\u001b[43m.\u001b[49m\u001b[43mfit_transform\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtweets\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m'\u001b[39;49m\u001b[33;43mtext_processed\u001b[39;49m\u001b[33;43m'\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 3\u001b[39m counts\n", - "\u001b[36mFile \u001b[39m\u001b[32m~/.local/lib/python3.12/site-packages/sklearn/base.py:1389\u001b[39m, in \u001b[36m_fit_context..decorator..wrapper\u001b[39m\u001b[34m(estimator, *args, **kwargs)\u001b[39m\n\u001b[32m 1382\u001b[39m estimator._validate_params()\n\u001b[32m 1384\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m config_context(\n\u001b[32m 1385\u001b[39m skip_parameter_validation=(\n\u001b[32m 1386\u001b[39m prefer_skip_nested_validation \u001b[38;5;129;01mor\u001b[39;00m global_skip_validation\n\u001b[32m 1387\u001b[39m )\n\u001b[32m 1388\u001b[39m ):\n\u001b[32m-> \u001b[39m\u001b[32m1389\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfit_method\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n", - "\u001b[36mFile \u001b[39m\u001b[32m~/.local/lib/python3.12/site-packages/sklearn/feature_extraction/text.py:1376\u001b[39m, in \u001b[36mCountVectorizer.fit_transform\u001b[39m\u001b[34m(self, raw_documents, y)\u001b[39m\n\u001b[32m 1368\u001b[39m warnings.warn(\n\u001b[32m 1369\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mUpper case characters found in\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 1370\u001b[39m \u001b[33m\"\u001b[39m\u001b[33m vocabulary while \u001b[39m\u001b[33m'\u001b[39m\u001b[33mlowercase\u001b[39m\u001b[33m'\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 1371\u001b[39m \u001b[33m\"\u001b[39m\u001b[33m is True. These entries will not\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 1372\u001b[39m \u001b[33m\"\u001b[39m\u001b[33m be matched with any documents\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 1373\u001b[39m )\n\u001b[32m 1374\u001b[39m \u001b[38;5;28;01mbreak\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m1376\u001b[39m vocabulary, X = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_count_vocab\u001b[49m\u001b[43m(\u001b[49m\u001b[43mraw_documents\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mfixed_vocabulary_\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1378\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m.binary:\n\u001b[32m 1379\u001b[39m X.data.fill(\u001b[32m1\u001b[39m)\n", - "\u001b[36mFile \u001b[39m\u001b[32m~/.local/lib/python3.12/site-packages/sklearn/feature_extraction/text.py:1263\u001b[39m, in \u001b[36mCountVectorizer._count_vocab\u001b[39m\u001b[34m(self, raw_documents, fixed_vocab)\u001b[39m\n\u001b[32m 1261\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m doc \u001b[38;5;129;01min\u001b[39;00m raw_documents:\n\u001b[32m 1262\u001b[39m feature_counter = {}\n\u001b[32m-> \u001b[39m\u001b[32m1263\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m feature \u001b[38;5;129;01min\u001b[39;00m \u001b[43manalyze\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdoc\u001b[49m\u001b[43m)\u001b[49m:\n\u001b[32m 1264\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m 1265\u001b[39m feature_idx = vocabulary[feature]\n", - "\u001b[36mFile \u001b[39m\u001b[32m~/.local/lib/python3.12/site-packages/sklearn/feature_extraction/text.py:104\u001b[39m, in \u001b[36m_analyze\u001b[39m\u001b[34m(doc, analyzer, tokenizer, ngrams, preprocessor, decoder, stop_words)\u001b[39m\n\u001b[32m 102\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m 103\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m preprocessor \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m104\u001b[39m doc = \u001b[43mpreprocessor\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdoc\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 105\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m tokenizer \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m 106\u001b[39m doc = tokenizer(doc)\n", - "\u001b[36mFile \u001b[39m\u001b[32m~/.local/lib/python3.12/site-packages/sklearn/feature_extraction/text.py:62\u001b[39m, in \u001b[36m_preprocess\u001b[39m\u001b[34m(doc, accent_function, lower)\u001b[39m\n\u001b[32m 43\u001b[39m \u001b[38;5;250m\u001b[39m\u001b[33;03m\"\"\"Chain together an optional series of text preprocessing steps to\u001b[39;00m\n\u001b[32m 44\u001b[39m \u001b[33;03mapply to a document.\u001b[39;00m\n\u001b[32m 45\u001b[39m \n\u001b[32m (...)\u001b[39m\u001b[32m 59\u001b[39m \u001b[33;03m preprocessed string\u001b[39;00m\n\u001b[32m 60\u001b[39m \u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 61\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m lower:\n\u001b[32m---> \u001b[39m\u001b[32m62\u001b[39m doc = \u001b[43mdoc\u001b[49m\u001b[43m.\u001b[49m\u001b[43mlower\u001b[49m()\n\u001b[32m 63\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m accent_function \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m 64\u001b[39m doc = accent_function(doc)\n", - "\u001b[31mAttributeError\u001b[39m: 'ellipsis' object has no attribute 'lower'" - ] + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 92, + "metadata": {}, + "output_type": "execute_result" } ], "source": [ @@ -1110,20 +1095,25 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 93, "id": "87119057-c78c-4eb2-a9d6-3e9f44e4c22b", "metadata": {}, "outputs": [ { - "ename": "NameError", - "evalue": "name 'counts' is not defined", - "output_type": "error", - "traceback": [ - "\u001b[31m---------------------------------------------------------------------------\u001b[39m", - "\u001b[31mNameError\u001b[39m Traceback (most recent call last)", - "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[30]\u001b[39m\u001b[32m, line 2\u001b[39m\n\u001b[32m 1\u001b[39m \u001b[38;5;66;03m# Do not run if you have limited memory - this includes DataHub and Binder\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m2\u001b[39m np.array(\u001b[43mcounts\u001b[49m.todense())\n", - "\u001b[31mNameError\u001b[39m: name 'counts' is not defined" - ] + "data": { + "text/plain": [ + "array([[0, 0, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " ...,\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0]], shape=(11541, 8751))" + ] + }, + "execution_count": 93, + "metadata": {}, + "output_type": "execute_result" } ], "source": [ @@ -1133,7 +1123,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 94, "id": "99322b85-1a15-46a5-bb80-bb5eaa6eeb7b", "metadata": {}, "outputs": [], @@ -1144,7 +1134,7 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 95, "id": "43620587-3795-4434-8f1f-145c81b93706", "metadata": {}, "outputs": [ @@ -1176,7 +1166,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 96, "id": "bb3604ec-d909-4238-9a3f-67e7d4ae2ac5", "metadata": {}, "outputs": [ @@ -1368,7 +1358,7 @@ "[5 rows x 8751 columns]" ] }, - "execution_count": 27, + "execution_count": 96, "metadata": {}, "output_type": "execute_result" } @@ -1389,7 +1379,7 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 97, "id": "f432154a-eae0-4723-a797-55f3cfdd71c4", "metadata": {}, "outputs": [ @@ -1409,7 +1399,7 @@ "dtype: int64" ] }, - "execution_count": 28, + "execution_count": 97, "metadata": {}, "output_type": "execute_result" } @@ -1421,27 +1411,27 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 98, "id": "26c7f1c9-dd66-49f2-b337-01253da551d2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "_exact_ 1\n", - "mightmismybrosgraduation 1\n", - "midterm 1\n", - "midnite 1\n", - "midland 1\n", - "michelle 1\n", - "michele 1\n", - "michael 1\n", - "mhtt 1\n", - "mgmt 1\n", + "zones 1\n", + "accelerate 1\n", + "acc 1\n", + "acarl 1\n", + "yogurt 1\n", + "yoga 1\n", + "yikes 1\n", + "absurdity 1\n", + "absorber 1\n", + "absorb 1\n", "dtype: int64" ] }, - "execution_count": 29, + "execution_count": 98, "metadata": {}, "output_type": "execute_result" } @@ -1463,7 +1453,7 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 99, "id": "efb8f4d8-4c88-4155-a6c5-c72a5b4e8bb8", "metadata": {}, "outputs": [ @@ -1504,42 +1494,42 @@ " 6\n", " \n", " \n", - " 10572\n", + " 11007\n", " to\n", " 5\n", " \n", " \n", - " 8148\n", - " the\n", + " 5513\n", + " to\n", " 5\n", " \n", " \n", - " 10742\n", + " 7750\n", " to\n", " 5\n", " \n", " \n", - " 152\n", - " to\n", + " 10923\n", + " the\n", " 5\n", " \n", " \n", - " 5005\n", + " 4089\n", " to\n", " 5\n", " \n", " \n", - " 10923\n", - " the\n", + " 8134\n", + " to\n", " 5\n", " \n", " \n", - " 7750\n", - " to\n", + " 8148\n", + " the\n", " 5\n", " \n", " \n", - " 355\n", + " 557\n", " to\n", " 5\n", " \n", @@ -1551,17 +1541,17 @@ " token number\n", "3127 lt 6\n", "918 worst 6\n", - "10572 to 5\n", - "8148 the 5\n", - "10742 to 5\n", - "152 to 5\n", - "5005 to 5\n", - "10923 the 5\n", + "11007 to 5\n", + "5513 to 5\n", "7750 to 5\n", - "355 to 5" + "10923 the 5\n", + "4089 to 5\n", + "8134 to 5\n", + "8148 the 5\n", + "557 to 5" ] }, - "execution_count": 30, + "execution_count": 99, "metadata": {}, "output_type": "execute_result" } @@ -1593,7 +1583,7 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 100, "id": "5e7cacd8-1fb3-4f0d-a744-4ee0994a089f", "metadata": {}, "outputs": [ @@ -1603,7 +1593,7 @@ "\"@united is the worst. Worst reservation policies. Worst costumer service. Worst worst worst. Congrats, @Delta you're not that bad!\"" ] }, - "execution_count": 31, + "execution_count": 100, "metadata": {}, "output_type": "execute_result" } @@ -1633,7 +1623,7 @@ }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 101, "id": "37a0a93e-9dd8-43dc-a82c-06a24bf02bc9", "metadata": {}, "outputs": [], @@ -1648,7 +1638,7 @@ }, { "cell_type": "code", - "execution_count": 33, + "execution_count": 102, "id": "b53e5ecf-7be3-4915-9d11-fd3edb913400", "metadata": {}, "outputs": [], @@ -1673,7 +1663,7 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 103, "id": "570fb598-fa81-4111-9e36-7172d8034713", "metadata": {}, "outputs": [ @@ -1693,7 +1683,7 @@ }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 104, "id": "d8deabb2-20eb-4047-b592-48cb1564fd2a", "metadata": {}, "outputs": [ @@ -1885,7 +1875,7 @@ "[5 rows x 4471 columns]" ] }, - "execution_count": 35, + "execution_count": 104, "metadata": {}, "output_type": "execute_result" } @@ -1904,7 +1894,7 @@ }, { "cell_type": "code", - "execution_count": 36, + "execution_count": 105, "id": "ffa7bf4e-640b-49bc-b64b-721140f67f76", "metadata": {}, "outputs": [ @@ -1924,7 +1914,7 @@ "dtype: int64" ] }, - "execution_count": 36, + "execution_count": 105, "metadata": {}, "output_type": "execute_result" } @@ -1956,19 +1946,23 @@ }, { "cell_type": "code", - "execution_count": 37, + "execution_count": 106, "id": "da610560-62c3-48ab-a1b2-25e0b589bc61", "metadata": {}, "outputs": [], "source": [ "# Import spaCy\n", "import spacy\n", + "import pandas as pd\n", + "from sklearn.feature_extraction.text import CountVectorizer\n", + "\n", + "# Cargar el modelo de spaCy\n", "nlp = spacy.load('en_core_web_sm')" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 107, "id": "98ead266-30f3-48ad-bc51-c1685487f000", "metadata": { "scrolled": true @@ -1978,17 +1972,17 @@ "# Create a function to lemmatize text\n", "def lemmatize_text(text):\n", " '''Lemmatize the text input with spaCy annotations.'''\n", - "\n", + " \n", " # Step 1: Initialize an empty list to hold lemmas\n", - " lemma = ...\n", - "\n", + " lemma = []\n", + " \n", " # Step 2: Apply the nlp pipeline to input text\n", - " doc = ...\n", - "\n", + " doc = nlp(text)\n", + " \n", " # Step 3: Iterate over tokens in the text to get the token lemma\n", " for token in doc:\n", - " lemma.append(...)\n", - "\n", + " lemma.append(token.lemma_)\n", + " \n", " # Step 4: Join lemmas together into a single string\n", " text_lemma = ' '.join(lemma)\n", " \n", @@ -2005,7 +1999,7 @@ }, { "cell_type": "code", - "execution_count": 39, + "execution_count": 108, "id": "742e82bb-5c42-4fa8-9101-5a0ea908db25", "metadata": {}, "outputs": [ @@ -2013,9 +2007,9 @@ "name": "stdout", "output_type": "stream", "text": [ - "USER wow this just blew my mind\n", + " USER wow this just blew my mind\n", "==================================================\n", - "USER wow this just blow my mind\n" + " USER wow this just blow my mind\n" ] } ], @@ -2036,7 +2030,7 @@ }, { "cell_type": "code", - "execution_count": 40, + "execution_count": 109, "id": "1ac128d2-1be5-4ef5-bb50-5b8d44ef8ee9", "metadata": {}, "outputs": [], @@ -2055,7 +2049,7 @@ }, { "cell_type": "code", - "execution_count": 41, + "execution_count": 110, "id": "5f49d790-3c9d-4dc1-a5c9-72c306630412", "metadata": {}, "outputs": [ @@ -2226,7 +2220,7 @@ " \n", " \n", "\n", - "

5 rows × 3553 columns

\n", + "

5 rows × 3571 columns

\n", "" ], "text/plain": [ @@ -2244,10 +2238,10 @@ "3 0 0 0 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 0 0 0 0 \n", "\n", - "[5 rows x 3553 columns]" + "[5 rows x 3571 columns]" ] }, - "execution_count": 41, + "execution_count": 110, "metadata": {}, "output_type": "execute_result" } @@ -2273,7 +2267,7 @@ }, { "cell_type": "code", - "execution_count": 42, + "execution_count": 111, "id": "9859eb04-dbd2-4fa0-9798-65ed7496c297", "metadata": {}, "outputs": [ @@ -2283,7 +2277,7 @@ "text": [ "(11541, 8751)\n", "(11541, 4471)\n", - "(11541, 3553)\n" + "(11541, 3571)\n" ] } ], @@ -2304,7 +2298,7 @@ }, { "cell_type": "code", - "execution_count": 43, + "execution_count": 112, "id": "5745ca29-97ed-4fe1-81db-7e402c8da674", "metadata": {}, "outputs": [ @@ -2312,19 +2306,19 @@ "data": { "text/plain": [ "digit 6927\n", - "flight 4043\n", + "flight 3952\n", "hashtag 2633\n", - "thank 1455\n", + "thank 1454\n", "hour 1134\n", - "cancel 948\n", - "delay 937\n", - "service 937\n", + "cancel 951\n", + "service 939\n", + "delay 934\n", "customer 902\n", - "time 856\n", + "time 860\n", "dtype: int64" ] }, - "execution_count": 43, + "execution_count": 112, "metadata": {}, "output_type": "execute_result" } @@ -2336,7 +2330,7 @@ }, { "cell_type": "code", - "execution_count": 44, + "execution_count": 113, "id": "16c63e6a-50c3-448a-9a56-a1d193cd6680", "metadata": {}, "outputs": [ @@ -2356,7 +2350,7 @@ "dtype: int64" ] }, - "execution_count": 44, + "execution_count": 113, "metadata": {}, "output_type": "execute_result" } @@ -2388,7 +2382,7 @@ }, { "cell_type": "code", - "execution_count": 45, + "execution_count": 114, "id": "f5e32d8a-c42d-475f-aab4-21eca8b1aee8", "metadata": {}, "outputs": [], @@ -2398,7 +2392,7 @@ }, { "cell_type": "code", - "execution_count": 46, + "execution_count": 115, "id": "d23916c1-5693-456c-b71d-6d9d78d1e2e4", "metadata": {}, "outputs": [], @@ -2413,18 +2407,18 @@ }, { "cell_type": "code", - "execution_count": 47, + "execution_count": 116, "id": "7af5b342-ab18-4766-9561-e38e50cd1e9b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "<11541x3553 sparse matrix of type ''\n", - "\twith 88287 stored elements in Compressed Sparse Row format>" + "" ] }, - "execution_count": 47, + "execution_count": 116, "metadata": {}, "output_type": "execute_result" } @@ -2437,7 +2431,7 @@ }, { "cell_type": "code", - "execution_count": 48, + "execution_count": 117, "id": "55e509c8-5402-4be0-9143-0e448fff7066", "metadata": {}, "outputs": [ @@ -2608,7 +2602,7 @@ " \n", " \n", "\n", - "

5 rows × 3553 columns

\n", + "

5 rows × 3571 columns

\n", "" ], "text/plain": [ @@ -2626,10 +2620,10 @@ "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", - "[5 rows x 3553 columns]" + "[5 rows x 3571 columns]" ] }, - "execution_count": 48, + "execution_count": 117, "metadata": {}, "output_type": "execute_result" } @@ -2668,7 +2662,7 @@ }, { "cell_type": "code", - "execution_count": 49, + "execution_count": 118, "id": "995b511a-d448-4cfb-a6a0-22a465efd8a8", "metadata": {}, "outputs": [ @@ -2686,10 +2680,10 @@ "zone 3177\n", "zoom 3920\n", "zurich 10622\n", - "Length: 3553, dtype: int64" + "Length: 3571, dtype: int64" ] }, - "execution_count": 49, + "execution_count": 118, "metadata": {}, "output_type": "execute_result" } @@ -2709,17 +2703,17 @@ }, { "cell_type": "code", - "execution_count": 50, + "execution_count": 119, "id": "09b222fb-ad8c-4767-a974-dd261370a06e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "918" + "np.int64(918)" ] }, - "execution_count": 50, + "execution_count": 119, "metadata": {}, "output_type": "execute_result" } @@ -2738,17 +2732,17 @@ }, { "cell_type": "code", - "execution_count": 51, + "execution_count": 120, "id": "079ee0e0-476f-4236-ba8a-615ba7a0efe8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "\"USER is the worst. worst reservation policies. worst costumer service. worst worst worst. congrats, USER you're not that bad!\"" + "\" USER is the worst. worst reservation policies. worst costumer service. worst worst worst. congrats, USER you're not that bad!\"" ] }, - "execution_count": 51, + "execution_count": 120, "metadata": {}, "output_type": "execute_result" } @@ -2767,17 +2761,17 @@ }, { "cell_type": "code", - "execution_count": 52, + "execution_count": 121, "id": "f809df1a-1178-4272-a415-42edb20173b2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "5945" + "np.int64(5945)" ] }, - "execution_count": 52, + "execution_count": 121, "metadata": {}, "output_type": "execute_result" } @@ -2788,17 +2782,17 @@ }, { "cell_type": "code", - "execution_count": 53, + "execution_count": 122, "id": "8093b6a7-54ca-468a-9376-b3c0be0b6f9b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "'USER cancelled flighted 😢'" + "' USER cancelled flighted 😢'" ] }, - "execution_count": 53, + "execution_count": 122, "metadata": {}, "output_type": "execute_result" } @@ -2831,34 +2825,45 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 123, "id": "2bfbf838-9ff6-48b8-ad5d-5e75304fe060", "metadata": {}, "outputs": [], "source": [ "# Complete the boolean masks \n", - "positive_index = tweets[...].index\n", - "negative_index = tweets[...].index" + "positive_index = tweets[tweets['airline_sentiment'] == 'positive'].index\n", + "negative_index = tweets[tweets['airline_sentiment'] == 'negative'].index" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 124, "id": "8c67ea1f-de9e-49a9-94f2-a3351446e364", "metadata": {}, "outputs": [], "source": [ "# Complete the following two lines\n", - "pos = tfidf.loc[...].mean().sort_values(...).head(...)\n", - "neg = tfidf.loc[...].mean().sort_values(...).head(...)" + "pos = tfidf.loc[positive_index].mean().sort_values(ascending=False).head(10)\n", + "neg = tfidf.loc[negative_index].mean().sort_values(ascending=False).head(10)" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 125, "id": "f1e29043-8c78-4e41-81d2-b4552030b457", "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], "source": [ "pos.plot(kind='barh', \n", " xlim=(0, 0.18),\n", @@ -2868,10 +2873,21 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 126, "id": "e8b25940-2372-4755-818e-f75e4d23daf9", "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAnQAAAGzCAYAAACmbIpeAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjEsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvc2/+5QAAAAlwSFlzAAAPYQAAD2EBqD+naQAAVvtJREFUeJzt3XlYE1f7N/Bv2MKagAKCCqgICopLcaki4lps1bpWbK2Cdak+WqV15bEqWBE31FarVttHrdVq3Vtt64qKqGiruG+44oori1ZAct4/fMnPSIAACXHo93NduTSTM2fuezIzuTmZmciEEAJEREREJFkmxg6AiIiIiEqHBR0RERGRxLGgIyIiIpI4FnREREREEseCjoiIiEjiWNARERERSRwLOiIiIiKJY0FHREREJHEs6IiIiIgkjgUdSVpkZCRkMlmx2j548ECvMezduxcymQzr16/Xa7+lWU5YWBiqVatWouWEhYXB1ta2RPOSYRw9ehTNmzeHjY0NZDIZkpKStLYrzv4gk8kQGRlZouXoW962vXfv3jJZXnG9ePECY8eOhZubG0xMTNC1a1djh/RGq1atGsLCwowdxr+O5As6mUym06MsDhSLFi3CBx98AHd3d8hkskI36CdPnmDw4MFwcnKCjY0NWrdujWPHjum0nIULF2L58uX6CbocmjZtGjZv3qz3flevXo158+bpvV8Cnj17hsjIyDf2A70sFLTd5uTk4IMPPsCjR48wd+5crFy5Eh4eHnpfflktR4r+97//YdasWejZsydWrFiBzz//3NghGd3BgwcRGRmJJ0+eGDsUNUMd+/Xl7NmziIyMxLVr1wyzACFxK1eu1Hi0b99eAMg3/e7duwaPxcPDQ1SoUEF06NBBmJmZidDQUK3tcnNzRfPmzYWNjY2IjIwUCxYsEL6+vsLOzk5cvHixyOXUqVNHBAUF6Td4icrJyRH//POPxjQbGxut637y5MkCgLh//36JltWxY0fh4eGRb3pcXJwAINatW1eifnVVnOVkZ2eL58+fl2g5oaGhwsbGpkTzltT9+/cFADF58uQyXe6bpKDt9ty5cwKAWLp0aZF9aNsfCvL6+i7OcvQtb9uOi4sr82XrIiQkRFSpUsXYYbxRZs2aJQCIq1ev5nvt+fPnIjs7u8xjKmgfelOsW7fOoNu5mWHKxLLz8ccfazw/fPgwdu7cmW96Wdi3b596dK6wr6zWr1+PgwcPYt26dejZsycAoFevXvD29sbkyZOxevXqsgpZ7cWLF1CpVLCwsCjzZZeGmZkZzMwkvxnrnbm5ubFDID1JTU0FANjb2xfZtjT7Q3GW82+Tmpqq1/WiUqmQnZ0NS0tLvfX5JpHL5cYO4d/JIGWiEQ0bNky8nlZmZqb44osvRNWqVYWFhYXw9vYWs2bNEiqVSqMdADFs2DDx008/CW9vbyGXy8Vbb70l9u3bV+w4CvtL4YMPPhCVKlUSubm5GtMHDx4srK2tCx1Z8fDwEAA0Hq+O1j1+/FiMHDlSnaunp6eYPn26xrKuXr0qAIhZs2aJuXPniho1aggTExNx/Phx9SjWhQsXRJ8+fYRCoRCOjo7iyy+/FCqVSty4cUO8//77ws7OTlSqVEnMnj07X4zffPON8PX1FVZWVsLe3l74+/uLVatWFZiTSqUSFStWFJ9//rl6Wm5urlAqlcLExEQ8fvxYPX369OnC1NRUZGRkCCH+b9Qtz+vrBoD6fchre+nSJREaGiqUSqVQKBQiLCxMPH36tMD4hBAiKCgoX795o3V5owtr164VU6dOFVWqVBFyuVy0adNGXLp0KV9fhw8fFsHBwUKhUAgrKyvRsmVLceDAgUKXX9zlhIaG5htNfPDggfj444+FnZ2dUCqVol+/fiIpKUkAEMuWLdOY18bGRty8eVN06dJF2NjYCEdHRzFq1Cjx4sULjT5zc3PF3Llzha+vr5DL5cLZ2VkMHjxYPHr0SKPd0aNHxTvvvCMqVqwoLC0tRbVq1UT//v2FEP+3Pb7+KGy0btmyZQKAiI+PF5999plwdHQUSqVSDB48WGRlZYnHjx+Lvn37Cnt7e2Fvby/GjBmTb3/XNfbNmzeL9957T7i6ugoLCwtRo0YNMWXKlHzrIigoSNSpU0ecOXNGtGrVSlhZWYnKlSuLGTNmFJhHnoK229DQ0EL399e9vj8I8XK0JDw8XDg6OgpbW1vRuXNnkZKSorGOi7Oco0ePCgBi+fLl+V77888/BQDx22+/CSGEuHbtmhg6dKjw9vYWlpaWokKFCqJnz575RnW0jdB5eHhoPYYGBQXli+358+di0qRJwtPTU1hYWIiqVauKMWPG5DuW7tixQwQEBAilUilsbGyEt7e3iIiI0JqnEAVvm3lxluSzxdfXV5iZmYlNmzYVuFwPDw/RsWNHER8fLxo3bizkcrmoXr26WLFiRb62uhzzhdB9/z9x4oQIDQ0V1atXF3K5XFSqVEn0799fPHjwQN0mbzt7/ZH3vr763hVnexFCiJs3b4r+/fsLZ2dnYWFhIXx9fcUPP/xQ4Lp6dR1r24dOnDghAIgtW7ao2/71118CgGjYsKFGHx06dBBNmjTRmPb777+LFi1aCGtra2Frayvee+89cfr06XzLP3funOjRo4dwcHAQcrlc+Pv7aywz75hV0LZU2DFSV+V+aEMIgffffx9xcXEYMGAAGjRogO3bt2PMmDG4desW5s6dq9F+3759WLt2LUaMGAG5XI6FCxeiQ4cOOHLkCOrWrauXmI4fP4633noLJiaapzA2adIES5YswcWLF+Hn56d13nnz5uGzzz6Dra0tJkyYAACoVKkSgJfnIQUFBeHWrVv49NNP4e7ujoMHDyIiIgJ37tzJd/7XsmXL8Pz5cwwePBhyuRwVKlRQvxYSEgIfHx9Mnz4d27Ztw9SpU1GhQgV89913aNOmDWbMmIFVq1Zh9OjRaNy4MVq2bAkAWLp0KUaMGIGePXti5MiReP78OU6ePInExER89NFHWnOSyWQICAjA/v371dNOnjyJtLQ0mJiYICEhAR07dgQAxMfHo2HDhgWOgK5cuRIDBw5EkyZNMHjwYACAp6enRptevXqhevXqiImJwbFjx/D999/D2dkZM2bM0NonAEyYMAFpaWm4efOmept5PYbp06fDxMQEo0ePRlpaGmbOnIk+ffogMTFR3WbPnj1499134e/vj8mTJ8PExATLli1DmzZtEB8fjyZNmhQYQ3GW8zqVSoXOnTvjyJEjGDp0KGrXro0tW7YgNDRUa/vc3FwEBwejadOmmD17Nnbt2oXY2Fh4enpi6NCh6naffvopli9fjv79+2PEiBG4evUqFixYgOPHjyMhIQHm5uZITU3FO++8AycnJ4wfPx729va4du0aNm7cCABwcnLCokWLMHToUHTr1g3du3cHANSrV6/IdfHZZ5/BxcUFUVFROHz4MJYsWQJ7e3scPHgQ7u7umDZtGn7//XfMmjULdevWRb9+/YoVOwAsX74ctra2+OKLL2Bra4s9e/Zg0qRJSE9Px6xZszTiefz4MTp06IDu3bujV69eWL9+PcaNGwc/Pz+8++67BeZR2HZbpUoVTJs2DSNGjEDjxo3V+7uuBg4ciJ9++gkfffQRmjdvjj179qj3p1fXha7LadSoEWrUqIFffvkl3/azdu1aODg4IDg4GMDLiywOHjyI3r17o2rVqrh27RoWLVqEVq1a4ezZs7C2ti5WLtqoVCq8//77OHDgAAYPHgwfHx+cOnUKc+fOxcWLF9XnVJ05cwadOnVCvXr1MGXKFMjlciQnJyMhIaHAvp2cnLBy5UpER0cjMzMTMTExAAAfH59if7bs2bMHv/zyC4YPHw5HR8ciL1pKTk5Gz549MWDAAISGhuJ///sfwsLC4O/vjzp16gDQ/ZhfnP1/586duHLlCvr37w8XFxecOXMGS5YswZkzZ3D48GHIZDJ0794dFy9exM8//4y5c+fC0dFRvb5eV5zt5d69e3j77bchk8kwfPhwODk54Y8//sCAAQOQnp6O8PDwAtdXQftQ3bp1YW9vj/379+P9998H8PJzxMTEBCdOnEB6ejoUCgVUKhUOHjyonjevz9DQUAQHB2PGjBl49uwZFi1ahBYtWuD48ePq9/DMmTMICAhAlSpVMH78eNjY2OCXX35B165dsWHDBnTr1g0tW7bEiBEj8M033+C///0vfHx8ALzcloo6RuqsWOWfBLw+Qrd582YBQEydOlWjXc+ePYVMJhPJycnqafj/FfNff/2lnnb9+nVhaWkpunXrVqw4Chuhs7GxEZ988km+6du2bRMAxJ9//llo3wWdQ/fVV18JGxubfOfhjR8/XpiamoobN24IIf7vr06FQiFSU1M12ub95TV48GD1tBcvXoiqVasKmUwmpk+frp7++PFjYWVlpZFnly5dRJ06dQqNX5tZs2YJU1NTkZ6eLoR4Ocrn4eEhmjRpIsaNGyeEeDmiYm9vrzGSp21Eoqhz6F5f9926dRMVK1YsMsaizqHz8fERWVlZ6ulff/21ACBOnTolhHg5Eunl5SWCg4M1/oJ/9uyZqF69umjfvn2hy9d1OULkH6HbsGGDACDmzZunnpabmyvatGmjdYQOgJgyZYrG8hs2bCj8/f3Vz+Pj4wWAfKOveX91503ftGmTACCOHj1aYG7FPYcu76/d19dls2bNhEwmE0OGDFFPy9t+X91ndI1diJfvz+s+/fTTfKPpeaO4P/74o3paVlaWcHFxET169Cgyp4K22+KcO/n6/pA3AvOf//xHo91HH32Ub30XZzkRERHC3NxcYzQzKytL2Nvba+xf2tbdoUOH8q2n0ozQrVy5UpiYmIj4+HiNdosXLxYAREJCghBCiLlz55b4HNq80ddXFfezxcTERJw5c0an5eV9E7N//371tNTUVCGXy8WoUaPU03Q95hdn/9f2nv3888/54insHLrX3ztdt5cBAwYIV1dXjdFAIYTo3bu3UCqVWmN7VUH7UMeOHTVG3rp37y66d+8uTE1NxR9//CGEEOLYsWMaI3kZGRnC3t5eDBo0SKOvu3fvCqVSqTG9bdu2ws/PT+N4oFKpRPPmzYWXl5d6WkHn0OlyjNSF5K9yLcrvv/8OU1NTjBgxQmP6qFGjIITAH3/8oTG9WbNm8Pf3Vz93d3dHly5dsH37duTm5uolpn/++UfrOQZ551P8888/Jep33bp1CAwMhIODAx48eKB+tGvXDrm5uRojYADQo0cPrX9RAS//qs9jamqKRo0aQQiBAQMGqKfb29ujVq1auHLlisa0mzdv4ujRo8WKPTAwELm5uTh48CCAl39BBQYGIjAwEPHx8QCA06dP48mTJwgMDCxW368bMmRIvmU/fPgQ6enppeq3f//+Gucg5sWZt36SkpJw6dIlfPTRR3j48KH6/Xn69Cnatm2L/fv3Q6VSlXo52vz5558wNzfHoEGD1NNMTEwwbNiwAufRtp5eXca6deugVCrRvn17je3N398ftra2iIuLA/B/52Rt3boVOTk5ReZXHAMGDNC4TUfTpk3zbad5229JYgcAKysr9f8zMjLw4MEDBAYG4tmzZzh//rxGPLa2thrn71pYWKBJkyaFvjeG9PvvvwNAvuNfYSMduggJCUFOTo7GCMKOHTvw5MkThISEqKe9uu5ycnLw8OFD1KxZE/b29jpf1V+UdevWwcfHB7Vr19Z4L9u0aQMA+bbDLVu26LSfFaW4ny1BQUHw9fXVuX9fX1+NY52Tk1O+462ux/zi7P+vvmfPnz/HgwcP8PbbbwNAid8zXbYXIQQ2bNiAzp07QwihkU9wcDDS0tJKvPzAwEAcO3YMT58+BQAcOHAA7733Hho0aKD+fImPj4dMJkOLFi0AvBypfPLkCT788EONWExNTdG0aVP1dvXo0SPs2bMHvXr1Uh8fHjx4gIcPHyI4OBiXLl3CrVu3Co1PX8fIcv+V6/Xr11G5cmXY2dlpTM8b7rx+/brGdC8vr3x9eHt749mzZ7h//z5cXFxKHZOVlRWysrLyTX/+/Ln69ZK4dOkSTp48WWCRlnfSc57q1asX2Je7u7vGc6VSCUtLS/XQ+qvTHz58qH4+btw47Nq1C02aNEHNmjXxzjvv4KOPPkJAQEChsb/11luwtrZGfHw8goODER8fj6ioKLi4uGD+/Pl4/vy5esfL2+FK6vXcHBwcALz8ukyhUBikX+Dl+wOgwK85ASAtLU09X0mXo83169fh6uqa7yuumjVram1vaWmZbztycHDQWMalS5eQlpYGZ2dnrX3kbW9BQUHo0aMHoqKiMHfuXLRq1Qpdu3bFRx99VOqTp7VtpwDg5uaWb3pJYgdefp3y5ZdfYs+ePfmK/rS0NI3nVatWzXcfOAcHB5w8eVLHjIqWnZ2NR48eaUxzcnKCqalpvrbXr1+HiYlJvtMOatWqVaoY6tevj9q1a2Pt2rXq4nnt2rVwdHRUF1LAyz9OY2JisGzZMty6dQtCCPVrr6+7krp06RLOnTtX5HEvJCQE33//PQYOHIjx48ejbdu26N69O3r27Jnv9BddFPezpbDjrTavb9uA9n1Ql2N+cfb/R48eISoqCmvWrMn3mVHS90yX7eX+/ft48uQJlixZgiVLlhSaT3EFBgbixYsXOHToENzc3JCamorAwECcOXNGo6Dz9fVVn3qUd7x+dXt+Vd5nRXJyMoQQmDhxIiZOnFhg3FWqVCkwPn0dI8t9QfcmcnV1xZ07d/JNz5tWuXLlEvWrUqnQvn17jB07Vuvr3t7eGs8LKxy1fThomwZA4yDt4+ODCxcuYOvWrfjzzz+xYcMGLFy4EJMmTUJUVFSByzM3N0fTpk2xf/9+JCcn4+7duwgMDESlSpWQk5ODxMRExMfHo3bt2gUevHSlSx6G6DdvVGDWrFlo0KCB1ra63NDXUPHrsoxXqVQqODs7Y9WqVVpfz3uf8m6GfPjwYfz222/Yvn07PvnkE8TGxuLw4cOluolxQXFqm/7q+tE19idPniAoKAgKhQJTpkyBp6cnLC0tcezYMYwbNy7fSE9ZvDcHDx5E69atNaZdvXq1xDeSLqmQkBBER0fjwYMHsLOzw6+//ooPP/xQ4yrbzz77DMuWLUN4eDiaNWsGpVIJmUyG3r17FzlKVtANknNzczXWs0qlgp+fH+bMmaO1fV5xb2Vlhf379yMuLg7btm3Dn3/+ibVr16JNmzbYsWOHTtt8aRT3D3VdtqXiHvN10atXLxw8eBBjxoxBgwYNYGtrC5VKhQ4dOpRqZLOo7SWv748//rjAP3p1Oa9Wm0aNGsHS0hL79++Hu7s7nJ2d4e3tjcDAQCxcuBBZWVmIj49Ht27d1PPkxbNy5UqtAzmvxz169Gj1uYCvK+gP5zz6OkaW+4LOw8MDu3btQkZGhsZfUnlflbx+48y8qvxVFy9ehLW1dakLiTx5w7wqlUrjL8PExERYW1sXuRMWdKDz9PREZmYm2rVrp5c4S8rGxgYhISEICQlBdnY2unfvjujoaERERBR6mX5gYCBmzJiBXbt2wdHREbVr14ZMJkOdOnUQHx+P+Ph4dOrUqcjl63qn/OIqbb95oyQKhaLM3yMPDw/ExcXh2bNnGn+lJycnl7hPT09P7Nq1CwEBATp9WL399tt4++23ER0djdWrV6NPnz5Ys2YNBg4caLD3rCC6xr537148fPgQGzduVF/4A7wsoPRN13VQv3597Ny5U2NaQd8ceHh4QKVS4fLlyxqjchcuXCh5oP9fSEgIoqKisGHDBlSqVAnp6eno3bu3Rpv169cjNDQUsbGx6mnPnz/X6Wa0Dg4OWttdv34dNWrUUD/39PTEiRMn0LZt2yLXoYmJCdq2bYu2bdtizpw5mDZtGiZMmIC4uLhi75PF/WwxBF2P+bru/48fP8bu3bsRFRWFSZMmqadr+1ws7j5b1Pbi5OQEOzs75Obmlvj4WFBMeac+xMfHw93dXf1VdmBgILKysrBq1Srcu3dPYx/PO147OzsXGk/etmhubl5k3EWts8KOkboo9+fQvffee8jNzcWCBQs0ps+dOxcymSzflWeHDh3S+J4+JSUFW7ZswTvvvKO3v+B69uyJe/fuaZxP8ODBA6xbtw6dO3cucojVxsZG64GuV69eOHToELZv357vtSdPnuDFixeljr0or379CrzckXx9fSGEKPLcgLyda968eWjRooV64w8MDMTKlStx+/Ztnc6fK2j9lJaNjU2pviby9/eHp6cnZs+ejczMzHyv379/vzThFSo4OBg5OTlYunSpeppKpcK3335b4j579eqF3NxcfPXVV/lee/Hihfo9ePz4cb4RqrwRyrxTD/I+ZMrqrvO6xp63z78af3Z2NhYuXKj3mHTdbh0cHNCuXTuNR0F/KOUd37755huN6fr4xRMfHx/4+flh7dq1WLt2LVxdXTU+EIGX6+/1937+/Pk6nY/s6emJw4cPIzs7Wz1t69atSElJ0WjXq1cv3Lp1S2PbzvPPP/+oz5t6/WtqIP92WBzF/WwxBF2P+bru/9q2d0D79mJjY6Neji6K2l5MTU3Ro0cPbNiwAadPn843vy7Hx8L2ocDAQCQmJiIuLk79OeLo6AgfHx/1HQ5e/XwJDg6GQqHAtGnTtH525cXj7OyMVq1a4bvvvtP6zdurcRe0znQ5Ruqi3I/Qde7cGa1bt8aECRNw7do11K9fHzt27MCWLVsQHh6e79ySunXrIjg4WOO2JQAK/bowz2+//YYTJ04AeHkC8MmTJzF16lQAwPvvv68eLu7Zsyfefvtt9O/fH2fPnoWjoyMWLlyI3NxcnZbj7++PRYsWYerUqahZsyacnZ3Rpk0bjBkzBr/++is6deqkvrz96dOnOHXqFNavX49r167lOwdO39555x24uLggICAAlSpVwrlz57BgwQJ07Ngx37kmr2vWrBnMzMxw4cIFjUvHW7ZsiUWLFgGATgWdv78/du3ahTlz5qBy5cqoXr06mjZtWrrE/n+/a9euxRdffIHGjRvD1tYWnTt31nl+ExMTfP/993j33XdRp04d9O/fH1WqVMGtW7cQFxcHhUKB3377rdRxatO1a1c0adIEo0aNQnJyMmrXro1ff/1V/SFXkhGyoKAgfPrpp4iJiUFSUhLeeecdmJub49KlS1i3bh2+/vpr9U8lLVy4EN26dYOnpycyMjKwdOlSKBQKvPfeewBefh3l6+uLtWvXwtvbGxUqVEDdunX1dqugksbevHlzODg4IDQ0FCNGjIBMJsPKlSv1+hVqHkNstw0aNMCHH36IhQsXIi0tDc2bN8fu3btLNTL7qpCQEEyaNAmWlpYYMGBAvnPROnXqhJUrV0KpVMLX1xeHDh3Crl27ULFixSL7HjhwINavX48OHTqgV69euHz5Mn766ad8x+y+ffvil19+wZAhQxAXF4eAgADk5ubi/Pnz+OWXX7B9+3Y0atQIU6ZMwf79+9GxY0d4eHggNTUVCxcuRNWqVUt0Xm5xP1sMQddjvq77v0KhQMuWLTFz5kzk5OSgSpUq2LFjh9YR6byLBydMmIDevXvD3NwcnTt3Vhct2hS1vUyfPh1xcXFo2rQpBg0aBF9fXzx69AjHjh3Drl27tBblr8dU0D4UGBiI6OhopKSkaHyOtGzZEt999x2qVauGqlWrqqcrFAosWrQIffv2xVtvvYXevXvDyckJN27cwLZt2xAQEKAu5r/99lu0aNECfn5+GDRoEGrUqIF79+7h0KFDuHnzprouaNCgAUxNTTFjxgykpaVBLpejTZs2WL16dZHHSJ2U6hrZN5C2GwtnZGSIzz//XFSuXFmYm5sLLy+vIm/+6OXlJeRyuWjYsKHOP9Oh7eaceY9XLwsXQohHjx6JAQMGiIoVKwpra2sRFBSk8yXLd+/eFR07dhR2dnb5bgCakZEhIiIiRM2aNYWFhYVwdHQUzZs3F7Nnz1b/FMurNxZ+XUE/j1XQz0G9fjn/d999J1q2bCkqVqwo5HK58PT0FGPGjBFpaWk65da4cWMBQCQmJqqn3bx5UwAQbm5uBcb7qvPnz4uWLVsKKysr9c0lC8st7xYY2i6/f1VmZqb46KOPhL29vYCWGwu/fruHvPX8+nt//Phx0b17d/U68vDwEL169RK7d+8udPnFWY62Gwvfv39ffPTRR+obi4aFhYmEhAQBQKxZs0ZjXm3vtbZ1LYQQS5YsEf7+/sLKykrY2dkJPz8/MXbsWHH79m0hxMvbAXz44YfC3d1dfQPfTp06adweSAghDh48KPz9/YWFhUWRtzDJe89e32eKu/0WFbsQQiQkJIi3335bfaPgsWPHiu3bt+e7/YC2W1vkLVvb7W5eV9B2W5rblgghxD///CNGjBghKlasKGxsbLTeWLi4y8lz6dIl9TFO282xHz9+LPr376++qXFwcLA4f/58vttaFPTTX7GxseobaAcEBIi//vpL642Fs7OzxYwZM0SdOnWEXC4XDg4Owt/fX0RFRamPPbt37xZdunQRlStXFhYWFqJy5criww8/1OnnFgt6b4v72aKrvBsLa4vj9dx1OeYLofv+f/PmTdGtWzdhb28vlEql+OCDD8Tt27e17pNfffWVqFKlijAxMdE4hhZ0y5mithchhLh3754YNmyYcHNzE+bm5sLFxUW0bdtWLFmypMj1VtA+JIQQ6enpwtTUVNjZ2WncFPynn34SAETfvn219hkXFyeCg4OFUqkUlpaWwtPTU4SFheU7fl2+fFn069dPuLi4CHNzc1GlShXRqVMnsX79eo12S5cuFTVq1BCmpqbqbV7XY2RRZEIY4E9NiZLJZBg2bFi+IXSi8mrz5s3o1q0bDhw4UOSVyERUvnD/L1/K/Tl0RPTS6/c3zM3Nxfz586FQKPDWW28ZKSoiKgvc/8u/cn8OHRG99Nlnn+Gff/5Bs2bNkJWVhY0bN+LgwYOYNm1aie99SETSwP2//GNBR/Qv0aZNG8TGxmLr1q14/vw5atasifnz52P48OHGDo2IDIz7f/nHc+iIiIiIJI7n0BERERFJHAs6IiIiIonjOXR6plKpcPv2bdjZ2ZX5zxkRERFRyQghkJGRgcqVK+e76bEUsKDTs9u3b6t/DJqIiIikJSUlReNXI6SCBZ2e5f28VUpKChQKhZGjISIiIl2kp6fDzc2tyJ+pfFOxoNOzV38TjwUdERGRtEj1dCnpfUlMRERERBpY0BERERFJHAs6IiIiIoljQUdEREQkcSzoiIiIiCSOBR0RERGRxPG2JQZy//speG4lN3YYeuE8NNrYIRAREVEhOEJHREREJHEs6IiIiIgkjgUdERERkcSVi4Ju7969kMlkePLkibFDISIiIipzkizoWrVqhfDwcPXz5s2b486dO1AqlcYLioiIiMhIysVVrhYWFnBxcTF2GERERERGIbkRurCwMOzbtw9ff/01ZDIZZDIZli9frvGV6/Lly2Fvb4+tW7eiVq1asLa2Rs+ePfHs2TOsWLEC1apVg4ODA0aMGIHc3Fx131lZWRg9ejSqVKkCGxsbNG3aFHv37jVOokREREQ6ktwI3ddff42LFy+ibt26mDJlCgDgzJkz+do9e/YM33zzDdasWYOMjAx0794d3bp1g729PX7//XdcuXIFPXr0QEBAAEJCQgAAw4cPx9mzZ7FmzRpUrlwZmzZtQocOHXDq1Cl4eXlpjScrKwtZWVnq5+np6QbImoiIiKhgkivolEolLCwsYG1trf6a9fz58/na5eTkYNGiRfD09AQA9OzZEytXrsS9e/dga2sLX19ftG7dGnFxcQgJCcGNGzewbNky3LhxA5UrVwYAjB49Gn/++SeWLVuGadOmaY0nJiYGUVFRBsqWiIiIqGiSK+h0ZW1trS7mAKBSpUqoVq0abG1tNaalpqYCAE6dOoXc3Fx4e3tr9JOVlYWKFSsWuJyIiAh88cUX6ufp6elwc3PTVxpERERERSq3BZ25ubnGc5lMpnWaSqUCAGRmZsLU1BR///03TE1NNdq9WgS+Ti6XQy4vHz/xRURERNIkyYLOwsJC42IGfWjYsCFyc3ORmpqKwMBAvfZNREREZEiSu8oVAKpVq4bExERcu3YNDx48UI+ylYa3tzf69OmDfv36YePGjbh69SqOHDmCmJgYbNu2TQ9RExERERmGJAu60aNHw9TUFL6+vnBycsKNGzf00u+yZcvQr18/jBo1CrVq1ULXrl1x9OhRuLu766V/IiIiIkOQCSGEsYMoT9LT06FUKpEcOwp2VuXj3DrnodHGDoGIiMig8j6/09LSoFAojB1OsUlyhI6IiIiI/g8LOiIiIiKJk+RVrlLgNHCSJIdsiYiISHo4QkdEREQkcSzoiIiIiCSOBR0RERGRxLGgIyIiIpI4FnREREREEseCjoiIiEjiWNARERERSRwLOiIiIiKJY0FHREREJHEs6IiIiIgkjgUdERERkcSxoCMiIiKSOBZ0RERERBLHgo6IiIhI4ljQEREREUkcCzoiIiIiiWNBR0RERCRxLOiIiIiIJM7M2AGUV/e/n4LnVnJjh6E3zkOjjR0CERERFYAjdEREREQSx4KOiIiISOJY0BERERFJXLku6K5duwaZTIakpCRjh0JERERkMOX6ogg3NzfcuXMHjo6Oxg6FiIiIyGAkO0KXk5NTZBtTU1O4uLjAzKxc161ERET0L1emBd369evh5+cHKysrVKxYEe3atcPTp08BAN9//z18fHxgaWmJ2rVrY+HCher58r46Xbt2LYKCgmBpaYlFixbBysoKf/zxh8YyNm3aBDs7Ozx79kzrV65nzpxBp06doFAoYGdnh8DAQFy+fFn9emFxEBEREb2Jymzo6s6dO/jwww8xc+ZMdOvWDRkZGYiPj4cQAqtWrcKkSZOwYMECNGzYEMePH8egQYNgY2OD0NBQdR/jx49HbGwsGjZsCEtLS8THx2P16tV499131W1WrVqFrl27wtraOl8Mt27dQsuWLdGqVSvs2bMHCoUCCQkJePHihXpeXeJ4VVZWFrKystTP09PT9bXKiIiIiHRSpgXdixcv0L17d3h4eAAA/Pz8AACTJ09GbGwsunfvDgCoXr06zp49i++++06jkAoPD1e3AYA+ffqgb9++ePbsGaytrZGeno5t27Zh06ZNWmP49ttvoVQqsWbNGpibmwMAvL291a/rGserYmJiEBUVVdLVQkRERFRqZfaVa/369dG2bVv4+fnhgw8+wNKlS/H48WM8ffoUly9fxoABA2Bra6t+TJ06VeOrUABo1KiRxvP33nsP5ubm+PXXXwEAGzZsgEKhQLt27bTGkJSUhMDAQHUx96rixPGqiIgIpKWlqR8pKSnFXTVEREREpVJmI3SmpqbYuXMnDh48iB07dmD+/PmYMGECfvvtNwDA0qVL0bRp03zzvMrGxkbjuYWFBXr27InVq1ejd+/eWL16NUJCQgq8CMLKyqrA+DIzM3WO41VyuRxyefn5iS8iIiKSnjK9/FMmkyEgIAABAQGYNGkSPDw8kJCQgMqVK+PKlSvo06dPsfvs06cP2rdvjzNnzmDPnj2YOnVqgW3r1auHFStWICcnJ98oXaVKlUoVBxEREZGxlFlBl5iYiN27d+Odd96Bs7MzEhMTcf/+ffj4+CAqKgojRoyAUqlEhw4dkJWVhb/++guPHz/GF198UWi/LVu2hIuLC/r06YPq1avnG1171fDhwzF//nz07t0bERERUCqVOHz4MJo0aYJatWqVKg4iIiIiYymzgk6hUGD//v2YN28e0tPT4eHhgdjYWPUVqtbW1pg1axbGjBkDGxsb+Pn5ITw8vMh+ZTKZ+urZSZMmFdq2YsWK2LNnD8aMGYOgoCCYmpqiQYMGCAgIAAAMHDiwxHEQERERGYtMCCGMHUR5kp6eDqVSieTYUbCzKj/n1jkPjTZ2CERERAaT9/mdlpYGhUJh7HCKTbK/FEFEREREL7GgIyIiIpI4/sipgTgNnCTJIVsiIiKSHo7QEREREUkcCzoiIiIiiWNBR0RERCRxLOiIiIiIJI4FHREREZHEsaAjIiIikjgWdEREREQSx4KOiIiISOJY0BERERFJHAs6IiIiIoljQUdEREQkcSzoiIiIiCSOBR0RERGRxLGgIyIiIpI4FnREREREEseCjoiIiEjiWNARERERSRwLOiIiIiKJMzN2AOXV/e+n4LmV3Nhh6JXz0Ghjh0BERERacISOiIiISOJY0BERERFJHAs6IiIiIoljQUdEREQkcW9MQbd3717IZDI8efLE2KEQERERScobU9C96YQQePHihbHDICIiIsqn2AWdSqXCzJkzUbNmTcjlcri7uyM6OlrrCFtSUhJkMhmuXbsGALh+/To6d+4MBwcH2NjYoE6dOvj9999x7do1tG7dGgDg4OAAmUyGsLAwAEBWVhZGjBgBZ2dnWFpaokWLFjh69Kh6GXnL3b59Oxo2bAgrKyu0adMGqamp+OOPP+Dj4wOFQoGPPvoIz54908gjJiYG1atXh5WVFerXr4/169fn6/ePP/6Av78/5HI5Dhw4UNzVRURERGRwxb4PXUREBJYuXYq5c+eiRYsWuHPnDs6fP6/TvMOGDUN2djb2798PGxsbnD17Fra2tnBzc8OGDRvQo0cPXLhwAQqFAlZWVgCAsWPHYsOGDVixYgU8PDwwc+ZMBAcHIzk5GRUqVFD3HRkZiQULFsDa2hq9evVCr169IJfLsXr1amRmZqJbt26YP38+xo0bBwCIiYnBTz/9hMWLF8PLywv79+/Hxx9/DCcnJwQFBan7HT9+PGbPno0aNWrAwcEhX05ZWVnIyspSP09PTy/uKiUiIiIqlWIVdBkZGfj666+xYMEChIaGAgA8PT3RokUL7N27t8j5b9y4gR49esDPzw8AUKNGDfVrecWZs7Mz7O3tAQBPnz7FokWLsHz5crz77rsAgKVLl2Lnzp344YcfMGbMGPX8U6dORUBAAABgwIABiIiIwOXLl9XL6NmzJ+Li4jBu3DhkZWVh2rRp2LVrF5o1a6aO5cCBA/juu+80CropU6agffv2BeYUExODqKioInMnIiIiMpRifeV67tw5ZGVloW3btiVa2IgRI9SF1+TJk3Hy5MlC21++fBk5OTnqQg0AzM3N0aRJE5w7d06jbb169dT/r1SpEqytrTUKxkqVKiE1NRUAkJycjGfPnqF9+/awtbVVP3788UdcvnxZo99GjRoVGmNERATS0tLUj5SUlMJXAhEREZGeFWuELu9rUG1MTF7WhkII9bScnByNNgMHDkRwcDC2bduGHTt2ICYmBrGxsfjss8+KE4ZW5ubm6v/LZDKN53nTVCoVACAzMxMAsG3bNlSpUkWjnVyu+XNdNjY2hS5XLpfnm4eIiIioLBVrhM7LywtWVlbYvXt3vtecnJwAAHfu3FFPS0pKytfOzc0NQ4YMwcaNGzFq1CgsXboUAGBhYQEAyM3NVbf19PSEhYUFEhIS1NNycnJw9OhR+Pr6Fid0Db6+vpDL5bhx4wZq1qyp8XBzcytxv0RERETGUKwROktLS4wbNw5jx46FhYUFAgICcP/+fZw5cwb9+vWDm5sbIiMjER0djYsXLyI2NlZj/vDwcLz77rvw9vbG48ePERcXBx8fHwCAh4cHZDIZtm7divfeew9WVlawtbXF0KFDMWbMGFSoUAHu7u6YOXMmnj17hgEDBpQ4aTs7O4wePRqff/45VCoVWrRogbS0NCQkJEChUKjPDyQiIiKSgmJf5Tpx4kSYmZlh0qRJuH37NlxdXTFkyBCYm5vj559/xtChQ1GvXj00btwYU6dOxQcffKCeNzc3F8OGDcPNmzehUCjQoUMHzJ07FwBQpUoVREVFYfz48ejfvz/69euH5cuXY/r06VCpVOjbty8yMjLQqFEjbN++XesVp8Xx1VdfwcnJCTExMbhy5Qrs7e3x1ltv4b///W+p+iUiIiIqazLx6klvVGrp6elQKpVIjh0FO6vydW6d89BoY4dARERkEHmf32lpaVAoFMYOp9j4SxFEREREEseCjoiIiEjiin0OHenGaeAkSQ7ZEhERkfRwhI6IiIhI4ljQEREREUkcCzoiIiIiiWNBR0RERCRxLOiIiIiIJI4FHREREZHEsaAjIiIikjgWdEREREQSx4KOiIiISOJY0BERERFJHAs6IiIiIoljQUdEREQkcSzoiIiIiCSOBR0RERGRxLGgIyIiIpI4FnREREREEseCjoiIiEjiWNARERERSZyZsQMor+5/PwXPreTGDkOvnIdGGzsEIiIi0oIjdEREREQSx4KOiIiISOJY0BERERFJnGQLulatWiE8PLzE80dGRqJBgwaFtgkLC0PXrl1LvAwiIiKisiDZgo6IiIiIXmJBR0RERCRxki7oVCoVxo4diwoVKsDFxQWRkZHq127cuIEuXbrA1tYWCoUCvXr1wr179wrsKzc3F1988QXs7e1RsWJFjB07FkKIMsiCiIiIqHQkXdCtWLECNjY2SExMxMyZMzFlyhTs3LkTKpUKXbp0waNHj7Bv3z7s3LkTV65cQUhISIF9xcbGYvny5fjf//6HAwcO4NGjR9i0aVORMWRlZSE9PV3jQURERFSWJH1j4Xr16mHy5MkAAC8vLyxYsAC7d+8GAJw6dQpXr16Fm5sbAODHH39EnTp1cPToUTRu3DhfX/PmzUNERAS6d+8OAFi8eDG2b99eZAwxMTGIiorSV0pERERExSbpEbp69eppPHd1dUVqairOnTsHNzc3dTEHAL6+vrC3t8e5c+fy9ZOWloY7d+6gadOm6mlmZmZo1KhRkTFEREQgLS1N/UhJSSlFRkRERETFJ+kROnNzc43nMpkMKpWqTGOQy+WQy8vXT3wRERGRtEh6hK4gPj4+SElJ0RgtO3v2LJ48eQJfX9987ZVKJVxdXZGYmKie9uLFC/z9999lEi8RERFRaUh6hK4g7dq1g5+fH/r06YN58+bhxYsX+M9//oOgoKACv0YdOXIkpk+fDi8vL9SuXRtz5szBkydPyjZwIiIiohIolyN0MpkMW7ZsgYODA1q2bIl27dqhRo0aWLt2bYHzjBo1Cn379kVoaCiaNWsGOzs7dOvWrQyjJiIiIioZmeDN1vQqPT0dSqUSybGjYGdVvs6tcx4abewQiIiIDCLv8zstLQ0KhcLY4RRbuRyhIyIiIvo3YUFHREREJHHl8qKIN4HTwEmSHLIlIiIi6eEIHREREZHEsaAjIiIikjgWdEREREQSx4KOiIiISOJY0BERERFJHAs6IiIiIoljQUdEREQkcSzoiIiIiCSOBR0RERGRxLGgIyIiIpI4FnREREREEseCjoiIiEjiWNARERERSRwLOiIiIiKJY0FHREREJHEs6IiIiIgkjgUdERERkcSxoCMiIiKSODNjB1Be3f9+Cp5byY0dRplwHhpt7BCIiIj+1ThCR0RERCRxLOiIiIiIJI4FHREREZHEsaArQlhYGLp27WrsMIiIiIgKxIKOiIiISOJY0BERERFJnNELOpVKhZkzZ6JmzZqQy+Vwd3dHdPTL22CMGzcO3t7esLa2Ro0aNTBx4kTk5OSo542MjESDBg2wcuVKVKtWDUqlEr1790ZGRoZO/QNASkoKevXqBXt7e1SoUAFdunTBtWvXyix/IiIiotIyekEXERGB6dOnY+LEiTh79ixWr16NSpUqAQDs7OywfPlynD17Fl9//TWWLl2KuXPnasx/+fJlbN68GVu3bsXWrVuxb98+TJ8+Xaf+c3JyEBwcDDs7O8THxyMhIQG2trbo0KEDsrOzdYo/KysL6enpGg8iIiKisiQTQghjLTwjIwNOTk5YsGABBg4cWGT72bNnY82aNfjrr78AvByhmzVrFu7evQs7OzsAwNixY7F//34cPny4yP5/+uknTJ06FefOnYNMJgMAZGdnw97eHps3b8Y777yDsLAwPHnyBJs3b9YaU2RkJKKiovJNT44dBTveWJiIiEgS0tPToVQqkZaWBoVCYexwis2ovxRx7tw5ZGVloW3btlpfX7t2Lb755htcvnwZmZmZePHiRb6VXK1aNXUxBwCurq5ITU3Vqf8TJ04gOTlZY34AeP78OS5fvqxTDhEREfjiiy/Uz9PT0+Hm5qbTvERERET6YNSCzsrKqsDXDh06hD59+iAqKgrBwcFQKpVYs2YNYmNjNdqZm5trPJfJZFCpVEX2DwCZmZnw9/fHqlWr8r3m5OSkUw5yuRxy+b9jJI6IiIjeTEY9h87LywtWVlbYvXt3vtcOHjwIDw8PTJgwAY0aNYKXlxeuX7+ut/4B4K233sKlS5fg7OyMmjVrajyUSmWJciIiIiIqa0YdobO0tMS4ceMwduxYWFhYICAgAPfv38eZM2fg5eWFGzduYM2aNWjcuDG2bduGTZs26a3/AQMGoE+fPpg1axa6dOmCKVOmoGrVqrh+/To2btyIsWPHomrVqgbKnIiIiEh/jFrQAcDEiRNhZmaGSZMm4fbt23B1dcWQIUMwYMAAfP755xg+fDiysrLQsWNHTJw4EZGRkXrpHwCsra2xf/9+jBs3Dt27d0dGRgaqVKmCtm3bSvKESCIiIvp3MupVruVR3lUyvMqViIhIOqR+lavR70NHRERERKXDgo6IiIhI4ox+Dl155TRwkiSHbImIiEh6OEJHREREJHEs6IiIiIgkjgUdERERkcSxoCMiIiKSOBZ0RERERBLHgo6IiIhI4ljQEREREUkcCzoiIiIiiWNBR0RERCRxLOiIiIiIJI4FHREREZHEsaAjIiIikjgWdEREREQSx4KOiIiISOJY0BERERFJHAs6IiIiIoljQUdEREQkcSzoiIiIiCTOzNgBlFf3v5+C51ZyY4dRZpyHRhs7BCIion8tjtARERERSRwLOiIiIiKJY0FHREREJHGSK+hatWqF8PBwndouX74c9vb2Bo2HiIiIyNgkV9ARERERkSYWdEREREQS90YXdE+fPkW/fv1ga2sLV1dXxMbGaryelZWF0aNHo0qVKrCxsUHTpk2xd+/eAvu7fPkyunTpgkqVKsHW1haNGzfGrl271K9PmTIFdevWzTdfgwYNMHHiRL3lRURERKRPb3RBN2bMGOzbtw9btmzBjh07sHfvXhw7dkz9+vDhw3Ho0CGsWbMGJ0+exAcffIAOHTrg0qVLWvvLzMzEe++9h927d+P48ePo0KEDOnfujBs3bgAAPvnkE5w7dw5Hjx5Vz3P8+HGcPHkS/fv319pnVlYW0tPTNR5EREREZemNLegyMzPxww8/YPbs2Wjbti38/PywYsUKvHjxAgBw48YNLFu2DOvWrUNgYCA8PT0xevRotGjRAsuWLdPaZ/369fHpp5+ibt268PLywldffQVPT0/8+uuvAICqVasiODhYY/5ly5YhKCgINWrU0NpnTEwMlEql+uHm5qbnNUFERERUuDe2oLt8+TKys7PRtGlT9bQKFSqgVq1aAIBTp04hNzcX3t7esLW1VT/27duHy5cva+0zMzMTo0ePho+PD+zt7WFra4tz586pR+gAYNCgQfj555/x/PlzZGdnY/Xq1fjkk08KjDMiIgJpaWnqR0pKip7WABEREZFuJPvTX5mZmTA1NcXff/8NU1NTjddsbW21zjN69Gjs3LkTs2fPRs2aNWFlZYWePXsiOztb3aZz586Qy+XYtGkTLCwskJOTg549exYYh1wuh1z+7/mJLyIiInrzvLEFnaenJ8zNzZGYmAh3d3cAwOPHj3Hx4kUEBQWhYcOGyM3NRWpqKgIDA3XqMyEhAWFhYejWrRuAl0XhtWvXNNqYmZkhNDQUy5Ytg4WFBXr37g0rKyu95kZERESkT29sQWdra4sBAwZgzJgxqFixIpydnTFhwgSYmLz8ltjb2xt9+vRBv379EBsbi4YNG+L+/fvYvXs36tWrh44dO+br08vLCxs3bkTnzp0hk8kwceJEqFSqfO0GDhwIHx8fAC+LQCIiIqI32Rtb0AHArFmzkJmZic6dO8POzg6jRo1CWlqa+vVly5Zh6tSpGDVqFG7dugVHR0e8/fbb6NSpk9b+5syZg08++QTNmzeHo6Mjxo0bp/WqVC8vLzRv3hyPHj3SOIePiIiI6E0kE0IIYwfxphFCwMvLC//5z3/wxRdfFGve9PR0KJVKJMeOgp3Vv+fcOueh0cYOgYiIqMTyPr/T0tKgUCiMHU6xvdEjdMZw//59rFmzBnfv3i3w3nNEREREbxIWdK9xdnaGo6MjlixZAgcHB2OHQ0RERFQkFnSv0dc30E4DJ0lyyJaIiIik5429sTARERER6YYFHREREZHEsaAjIiIikjgWdEREREQSx4KOiIiISOJY0BERERFJHAs6IiIiIoljQUdEREQkcSzoiIiIiCSOBR0RERGRxLGgIyIiIpI4FnREREREEseCjoiIiEjiWNARERERSRwLOiIiIiKJY0FHREREJHEs6IiIiIgkjgUdERERkcSZGTuA8ur+91Pw3Epu7DAkwXlotLFDICIikjSO0BERERFJHAs6IiIiIoljQUdEREQkcZIs6Fq1aoXw8HBjh0FERET0RpBkQUdERERE/4cFnY6ys7ONHQIRERGRVpIt6FQqFcaOHYsKFSrAxcUFkZGR6tdu3LiBLl26wNbWFgqFAr169cK9e/fUr4eFhaFr164a/YWHh6NVq1bq561atcLw4cMRHh4OR0dHBAcHGzgjIiIiopKRbEG3YsUK2NjYIDExETNnzsSUKVOwc+dOqFQqdOnSBY8ePcK+ffuwc+dOXLlyBSEhISVahoWFBRISErB48WKtbbKyspCenq7xICIiIipLkr2xcL169TB58mQAgJeXFxYsWIDdu3cDAE6dOoWrV6/Czc0NAPDjjz+iTp06OHr0KBo3bqzzMry8vDBz5sxC28TExCAqKqqEWRARERGVnmRH6OrVq6fx3NXVFampqTh37hzc3NzUxRwA+Pr6wt7eHufOnSvWMvz9/YtsExERgbS0NPUjJSWlWMsgIiIiKi3JjtCZm5trPJfJZFCpVDrNa2JiAiGExrScnJx87WxsbIrsSy6XQy7nT3wRERGR8Uh2hK4gPj4+SElJ0RgpO3v2LJ48eQJfX18AgJOTE+7cuaMxX1JSUlmGSURERKQ35a6ga9euHfz8/NCnTx8cO3YMR44cQb9+/RAUFIRGjRoBANq0aYO//voLP/74Iy5duoTJkyfj9OnTRo6ciIiIqGTKXUEnk8mwZcsWODg4oGXLlmjXrh1q1KiBtWvXqtsEBwdj4sSJGDt2LBo3boyMjAz069fPiFETERERlZxMvH4yGZVKeno6lEolkmNHwc6K59bpwnlotLFDICKif7m8z++0tDQoFApjh1Ns5W6EjoiIiOjfhgUdERERkcRJ9rYlbzqngZMkOWRLRERE0sMROiIiIiKJY0FHREREJHEs6IiIiIgkjgUdERERkcSxoCMiIiKSOBZ0RERERBLHgo6IiIhI4ljQEREREUkcCzoiIiIiiWNBR0RERCRxLOiIiIiIJI4FHREREZHEsaAjIiIikjgWdEREREQSx4KOiIiISOJY0BERERFJHAs6IiIiIoljQUdEREQkcWbGDqC8uv/9FDy3khs7DNKR89BoY4dARERUYhyhIyIiIpI4FnREREREEseCjoiIiEji9F7QtWrVCuHh4fruFgCwd+9eyGQyPHnyxCD9ExEREUlRuRuhi4yMRIMGDYwdBhEREVGZKXcFHREREdG/jUEKOpVKhbFjx6JChQpwcXFBZGSk+rU5c+bAz88PNjY2cHNzw3/+8x9kZmaqX79+/To6d+4MBwcH2NjYoE6dOvj99981+v/777/RqFEjWFtbo3nz5rhw4QIAYPny5YiKisKJEycgk8kgk8mwfPlynZYLAEuXLoWbmxusra3RrVs3zJkzB/b29oZYRURERER6Y5CCbsWKFbCxsUFiYiJmzpyJKVOmYOfOnS8XaGKCb775BmfOnMGKFSuwZ88ejB07Vj3vsGHDkJWVhf379+PUqVOYMWMGbG1tNfqfMGECYmNj8ddff8HMzAyffPIJACAkJASjRo1CnTp1cOfOHdy5cwchISE6LTchIQFDhgzByJEjkZSUhPbt2yM6uuh7k2VlZSE9PV3jQURERFSWZEIIoc8OW7VqhdzcXMTHx6unNWnSBG3atMH06dPztV+/fj2GDBmCBw8eAADq1auHHj16YPLkyfna7t27F61bt8auXbvQtm1bAMDvv/+Ojh074p9//oGlpSUiIyOxefNmJCUlFRrn68vt3bs3MjMzsXXrVnWbjz/+GFu3bi30IozIyEhERUXlm54cOwp2vLGwZPDGwkRE/27p6elQKpVIS0uDQqEwdjjFZpARunr16mk8d3V1RWpqKgCoi7EqVarAzs4Offv2xcOHD/Hs2TMAwIgRIzB16lQEBARg8uTJOHnyZKH9u7q6AoC6/4IUtdwLFy6gSZMmGvO8/lybiIgIpKWlqR8pKSlFzkNERESkTwYp6MzNzTWey2QyqFQqXLt2DZ06dUK9evWwYcMG/P333/j2228BANnZ2QCAgQMH4sqVK+jbty9OnTqFRo0aYf78+QX2L5PJALw8b68guiy3pORyORQKhcaDiIiIqCyV6VWuf//9N1QqFWJjY/H222/D29sbt2/fztfOzc0NQ4YMwcaNGzFq1CgsXbpU52VYWFggNze32MutVasWjh49qjHt9edEREREbyKzslxYzZo1kZOTg/nz56Nz585ISEjA4sWLNdqEh4fj3Xffhbe3Nx4/foy4uDj4+PjovIxq1arh6tWrSEpKQtWqVWFnZ6fTcj/77DO0bNkSc+bMQefOnbFnzx788ccf6hFAIiIiojdVmY7Q1a9fH3PmzMGMGTNQt25drFq1CjExMRptcnNzMWzYMPj4+KBDhw7w9vbGwoULdV5Gjx490KFDB7Ru3RpOTk74+eefdVpuQEAAFi9ejDlz5qB+/fr4888/8fnnn8PS0lIvuRMREREZit6vci1PBg0ahPPnz2tcsVuUvKtkeJWrtPAqVyKifzepX+Vapl+5vulmz56N9u3bw8bGBn/88QdWrFhRrNFBIiIiImNgQfeKI0eOYObMmcjIyECNGjXwzTffYODAgcYOi4iIiKhQ/MpVz6Q+ZEtERPRvJPXP7zK9KIKIiIiI9I8FHREREZHEsaAjIiIikjgWdEREREQSx4KOiIiISOJY0BERERFJHAs6IiIiIoljQUdEREQkcSzoiIiIiCSOBR0RERGRxLGgIyIiIpI4FnREREREEseCjoiIiEjiWNARERERSRwLOiIiIiKJY0FHREREJHEs6IiIiIgkjgUdERERkcSZGTuA8ur+91Pw3Epu7DCoHHEeGm3sEIiI6A3FEToiIiIiiWNBR0RERCRxLOiIiIiIJE4SBZ0QAoMHD0aFChUgk8lgb2+P8PBw9evVqlXDvHnzdO7v2rVrkMlkSEpK0nusRERERGVNEgXdn3/+ieXLl2Pr1q24c+cO6tatq/H60aNHMXjwYL0uc/ny5bC3t9drn0RERESGIImrXC9fvgxXV1c0b94cAGBmphm2k5OTMcIiIiIieiO88SN0YWFh+Oyzz3Djxg3IZDJUq1YtX5vXv3I9f/48WrRoAUtLS/j6+mLXrl2QyWTYvHmzxnxXrlxB69atYW1tjfr16+PQoUMAgL1796J///5IS0uDTCaDTCZDZGSk4ZIkIiIiKoU3vqD7+uuvMWXKFFStWhV37tzB0aNHC22fm5uLrl27wtraGomJiViyZAkmTJigte2ECRMwevRoJCUlwdvbGx9++CFevHiB5s2bY968eVAoFLhz5w7u3LmD0aNHa+0jKysL6enpGg8iIiKisvTGf+WqVCphZ2cHU1NTuLi4FNl+586duHz5Mvbu3atuHx0djfbt2+drO3r0aHTs2BEAEBUVhTp16iA5ORm1a9eGUqmETCYrcpkxMTGIiooqQWZERERE+vHGj9AV14ULF+Dm5qZRiDVp0kRr23r16qn/7+rqCgBITU0t1vIiIiKQlpamfqSkpJQgaiIiIqKSe+NH6AzJ3Nxc/X+ZTAYAUKlUxepDLpdDLudPfBEREZHxlLsRulq1aiElJQX37t1TTyvqvDttLCwskJubq8/QiIiIiAyi3BV07du3h6enJ0JDQ3Hy5EkkJCTgyy+/BPB/o3C6qFatGjIzM7F79248ePAAz549M1TIRERERKVS7go6U1NTbN68GZmZmWjcuDEGDhyovsrV0tJS536aN2+OIUOGICQkBE5OTpg5c6ahQiYiIiIqFZkQQhg7CENLSEhAixYtkJycDE9PT4MuKz09HUqlEsmxo2BnxXPrSH+ch0YbOwQionIr7/M7LS0NCoXC2OEUW7m8KGLTpk2wtbWFl5cXkpOTMXLkSAQEBBi8mCMiIiIyhnJZ0GVkZGDcuHG4ceMGHB0d0a5dO8TGxho7LCIiIiKD+Fd85VqWpD5kS0RE9G8k9c/vcndRBBEREdG/DQs6IiIiIoljQUdEREQkcSzoiIiIiCSOBR0RERGRxLGgIyIiIpI4FnREREREEseCjoiIiEjiWNARERERSRwLOiIiIiKJY0FHREREJHEs6IiIiIgkjgUdERERkcSxoCMiIiKSOBZ0RERERBLHgo6IiIhI4ljQEREREUkcCzoiIiIiiTMzdgDl1f3vp+C5ldzYYRD96zgPjTZ2CEREZY4jdEREREQSx4KOiIiISOJY0BERERFJXLkq6Fq1aoXw8HAAQLVq1TBv3jyd57127RpkMhmSkpIMEhsRERGRoZTbiyKOHj0KGxsbndu7ubnhzp07cHR0BADs3bsXrVu3xuPHj2Fvb2+gKImIiIhKr9wWdE5OTsVqb2pqChcXFwNFQ0RERGQ4kv3K9enTp+jXrx9sbW3h6uqK2NhYjddf/8r1/PnzaNGiBSwtLeHr64tdu3ZBJpNh8+bNADS/cr127Rpat24NAHBwcIBMJkNYWFgZZUZERERUPJIdoRszZgz27duHLVu2wNnZGf/9739x7NgxNGjQIF/b3NxcdO3aFe7u7khMTERGRgZGjRpVYN9ubm7YsGEDevTogQsXLkChUMDKykpr26ysLGRlZamfp6enlzo3IiIiouKQZEGXmZmJH374AT/99BPatm0LAFixYgWqVq2qtf3OnTtx+fJl7N27V/21anR0NNq3b6+1vampKSpUqAAAcHZ2LvQcupiYGERFRZUiGyIiIqLSkeRXrpcvX0Z2djaaNm2qnlahQgXUqlVLa/sLFy7Azc1N4xy5Jk2a6CWWiIgIpKWlqR8pKSl66ZeIiIhIV5IcoXuTyOVyyOX8iS8iIiIyHkmO0Hl6esLc3ByJiYnqaY8fP8bFixe1tq9VqxZSUlJw79499bSjR48WugwLCwsAL8+/IyIiInqTSbKgs7W1xYABAzBmzBjs2bMHp0+fRlhYGExMtKfTvn17eHp6IjQ0FCdPnkRCQgK+/PJLAIBMJtM6j4eHB2QyGbZu3Yr79+8jMzPTYPkQERERlYYkCzoAmDVrFgIDA9G5c2e0a9cOLVq0gL+/v9a2pqam2Lx5MzIzM9G4cWMMHDgQEyZMAABYWlpqnadKlSqIiorC+PHjUalSJQwfPtxguRARERGVhkwIIYwdhDEkJCSgRYsWSE5Ohqenp976TU9Ph1KpRHLsKNhZ8dw6orLmPDTa2CEQkQTlfX6npaVBoVAYO5xi+9dcFLFp0ybY2trCy8sLycnJGDlyJAICAvRazBEREREZw7+moMvIyMC4ceNw48YNODo6ol27dvl+XYKIiIhIiv61X7kaitSHbImIiP6NpP75LdmLIoiIiIjoJRZ0RERERBLHgo6IiIhI4ljQEREREUkcCzoiIiIiiWNBR0RERCRxLOiIiIiIJO5fc2PhspJ3W7/09HQjR0JERES6yvvclurteVnQ6dnDhw8BAG5ubkaOhIiIiIrr4cOHUCqVxg6j2FjQ6VmFChUAADdu3JDkBlGY9PR0uLm5ISUlRZJ30S4Mc5Om8pwbUL7zY27SVJ5zS0tLg7u7u/pzXGpY0OmZicnL0xKVSmW529jzKBQK5iZBzE26ynN+zE2aynNueZ/jUiPNqImIiIhIjQUdERERkcSxoNMzuVyOyZMnQy6XGzsUvWNu0sTcpKs858fcpIm5vblkQqrX5xIRERERAI7QEREREUkeCzoiIiIiiWNBR0RERCRxLOiIiIiIJI4FHREREZHEsaDTwbfffotq1arB0tISTZs2xZEjRwptv27dOtSuXRuWlpbw8/PD77//rvG6EAKTJk2Cq6srrKys0K5dO1y6dMmQKRRIn7nl5ORg3Lhx8PPzg42NDSpXrox+/frh9u3bhk5DK32/b68aMmQIZDIZ5s2bp+eodWOI3M6dO4f3338fSqUSNjY2aNy4MW7cuGGoFAqk79wyMzMxfPhwVK1aFVZWVvD19cXixYsNmUKBipPbmTNn0KNHD1SrVq3Qba2468tQ9J1bTEwMGjduDDs7Ozg7O6Nr1664cOGCATMomCHetzzTp0+HTCZDeHi4foPWkSFyu3XrFj7++GNUrFgRVlZW8PPzw19//WWgDAqm79xyc3MxceJEVK9eHVZWVvD09MRXX32FN+ZmIYIKtWbNGmFhYSH+97//iTNnzohBgwYJe3t7ce/ePa3tExIShKmpqZg5c6Y4e/as+PLLL4W5ubk4deqUus306dOFUqkUmzdvFidOnBDvv/++qF69uvjnn3/KKi0hhP5ze/LkiWjXrp1Yu3atOH/+vDh06JBo0qSJ8Pf3L8u0hBCGed/ybNy4UdSvX19UrlxZzJ0718CZ5GeI3JKTk0WFChXEmDFjxLFjx0RycrLYsmVLgX0aiiFyGzRokPD09BRxcXHi6tWr4rvvvhOmpqZiy5YtZZWWEKL4uR05ckSMHj1a/Pzzz8LFxUXrtlbcPg3FELkFBweLZcuWidOnT4ukpCTx3nvvCXd3d5GZmWngbDQZIrdX21arVk3Uq1dPjBw50jAJFMIQuT169Eh4eHiIsLAwkZiYKK5cuSK2b98ukpOTDZyNJkPkFh0dLSpWrCi2bt0qrl69KtatWydsbW3F119/beBsdMOCrghNmjQRw4YNUz/Pzc0VlStXFjExMVrb9+rVS3Ts2FFjWtOmTcWnn34qhBBCpVIJFxcXMWvWLPXrT548EXK5XPz8888GyKBg+s5NmyNHjggA4vr16/oJWkeGyu3mzZuiSpUq4vTp08LDw8MoBZ0hcgsJCREff/yxYQIuBkPkVqdOHTFlyhSNNm+99ZaYMGGCHiMvWnFze1VB21pp+tQnQ+T2utTUVAFA7Nu3rzShFpuhcsvIyBBeXl5i586dIigoyCgFnSFyGzdunGjRooU+wywRQ+TWsWNH8cknn2hM6969u+jTp0+p49UHfuVaiOzsbPz9999o166depqJiQnatWuHQ4cOaZ3n0KFDGu0BIDg4WN3+6tWruHv3rkYbpVKJpk2bFtinIRgiN23S0tIgk8lgb2+vl7h1YajcVCoV+vbtizFjxqBOnTqGCb4IhshNpVJh27Zt8Pb2RnBwMJydndG0aVNs3rzZYHloY6j3rXnz5vj1119x69YtCCEQFxeHixcv4p133jFMIlqUJDdj9Pkmx5GWlgYAqFChgt76LIohcxs2bBg6duyYb/stK4bK7ddff0WjRo3wwQcfwNnZGQ0bNsTSpUv1EbLODJVb8+bNsXv3bly8eBEAcOLECRw4cADvvvtuqWPWBxZ0hXjw4AFyc3NRqVIljemVKlXC3bt3tc5z9+7dQtvn/VucPg3BELm97vnz5xg3bhw+/PBDKBQK/QSuA0PlNmPGDJiZmWHEiBH6D1pHhsgtNTUVmZmZmD59Ojp06IAdO3agW7du6N69O/bt22eYRLQw1Ps2f/58+Pr6omrVqrCwsECHDh3w7bffomXLlvpPogAlyc0Yfb6pcahUKoSHhyMgIAB169bVS5+6MFRua9aswbFjxxATE1PaEEvMULlduXIFixYtgpeXF7Zv346hQ4dixIgRWLFiRWlD1pmhchs/fjx69+6N2rVrw9zcHA0bNkR4eDj69OlT2pD1wszYAVD5lJOTg169ekEIgUWLFhk7nFL7+++/8fXXX+PYsWOQyWTGDkevVCoVAKBLly74/PPPAQANGjTAwYMHsXjxYgQFBRkzvFKbP38+Dh8+jF9//RUeHh7Yv38/hg0bhsqVKxttdISKZ9iwYTh9+jQOHDhg7FBKLSUlBSNHjsTOnTthaWlp7HD0TqVSoVGjRpg2bRoAoGHDhjh9+jQWL16M0NBQI0dXOr/88gtWrVqF1atXo06dOkhKSkJ4eDgqV678RuTGEbpCODo6wtTUFPfu3dOYfu/ePbi4uGidx8XFpdD2ef8Wp09DMERuefKKuevXr2Pnzp1lOjoHGCa3+Ph4pKamwt3dHWZmZjAzM8P169cxatQoVKtWzSB5aGOI3BwdHWFmZgZfX1+NNj4+PmV6lashcvvnn3/w3//+F3PmzEHnzp1Rr149DB8+HCEhIZg9e7ZhEtGiJLkZo883MY7hw4dj69atiIuLQ9WqVUvdX3EYIre///4bqampeOutt9THkn379uGbb76BmZkZcnNz9RF6kQz1vrm6ukryWKKLMWPGqEfp/Pz80LdvX3z++edGHWl9FQu6QlhYWMDf3x+7d+9WT1OpVNi9ezeaNWumdZ5mzZpptAeAnTt3qttXr14dLi4uGm3S09ORmJhYYJ+GYIjcgP8r5i5duoRdu3ahYsWKhkmgEIbIrW/fvjh58iSSkpLUj8qVK2PMmDHYvn274ZJ5jSFys7CwQOPGjfPdEuLixYvw8PDQcwYFM0RuOTk5yMnJgYmJ5qHO1NRUPTJZFkqSmzH6fJPiEEJg+PDh2LRpE/bs2YPq1avrI9xiMURubdu2xalTpzSOJY0aNUKfPn2QlJQEU1NTfYVfKEO9bwEBAZI8luji2bNnRj+WFMrIF2W88dasWSPkcrlYvny5OHv2rBg8eLCwt7cXd+/eFUII0bdvXzF+/Hh1+4SEBGFmZiZmz54tzp07JyZPnqz1tiX29vZiy5Yt4uTJk6JLly5Gu22JPnPLzs4W77//vqhatapISkoSd+7cUT+ysrIknZs2xrrK1RC5bdy4UZibm4slS5aIS5cuifnz5wtTU1MRHx8v+dyCgoJEnTp1RFxcnLhy5YpYtmyZsLS0FAsXLnyjc8vKyhLHjx8Xx48fF66urmL06NHi+PHj4tKlSzr3KeXchg4dKpRKpdi7d6/GseTZs2eSz+11xrrK1RC5HTlyRJiZmYno6Ghx6dIlsWrVKmFtbS1++uknyecWGhoqqlSpor5tycaNG4Wjo6MYO3ZsmeZWEBZ0Opg/f75wd3cXFhYWokmTJuLw4cPq14KCgkRoaKhG+19++UV4e3sLCwsLUadOHbFt2zaN11UqlZg4caKoVKmSkMvlom3btuLChQtlkUo++szt6tWrAoDWR1xcXBll9H/0/b69zlgFnRCGye2HH34QNWvWFJaWlqJ+/fpi8+bNhk5DK33ndufOHREWFiYqV64sLC0tRa1atURsbKxQqVRlkY6G4uRW0P4UFBSkc59lSd+5FXQsWbZsWdkl9f8Z4n17lbEKOiEMk9tvv/0m6tatK+Ryuahdu7ZYsmRJGWWjSd+5paeni5EjRwp3d3dhaWkpatSoISZMmFDmAxYFkQnxptzimIiIiIhKgufQEREREUkcCzoiIiIiiWNBR0RERCRxLOiIiIiIJI4FHREREZHEsaAjIiIikjgWdEREREQSx4KOiIiISOJY0BERERFJHAs6IiIiIoljQUdEREQkcf8P29+QllfzL2kAAAAASUVORK5CYII=", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], "source": [ "neg.plot(kind='barh', \n", " xlim=(0, 0.18),\n", @@ -2915,7 +2931,7 @@ }, { "cell_type": "code", - "execution_count": 55, + "execution_count": 127, "id": "33413d63-87eb-489f-b374-3cfeaa51cf3c", "metadata": {}, "outputs": [], @@ -2934,7 +2950,7 @@ }, { "cell_type": "code", - "execution_count": 56, + "execution_count": 128, "id": "64cec8b9-14d9-4897-9c02-cc89fcf7b3c6", "metadata": {}, "outputs": [], @@ -2955,10 +2971,22 @@ }, { "cell_type": "code", - "execution_count": 57, + "execution_count": null, "id": "d46de0b2-af00-4a1d-b4cd-31b96ce545d1", "metadata": {}, - "outputs": [], + "outputs": [ + { + "ename": "", + "evalue": "", + "output_type": "error", + "traceback": [ + "\u001b[1;31mEl kernel se bloqueó al ejecutar código en la celda actual o en una celda anterior. \n", + "\u001b[1;31mRevise el código de las celdas para identificar una posible causa del error. \n", + "\u001b[1;31mHaga clic aquí para obtener más información. \n", + "\u001b[1;31mVea Jupyter log para obtener más detalles." + ] + } + ], "source": [ "def fit_logistic_regression(X, y):\n", " '''Fits a logistic regression model to provided data.'''\n", @@ -2982,7 +3010,7 @@ }, { "cell_type": "code", - "execution_count": 58, + "execution_count": null, "id": "773963bd-6603-4fad-884b-09ce60afab18", "metadata": {}, "outputs": [], @@ -2993,7 +3021,7 @@ }, { "cell_type": "code", - "execution_count": 59, + "execution_count": null, "id": "e10d06c1-d884-45d4-a03d-dd5d40bf70aa", "metadata": {}, "outputs": [ @@ -3032,7 +3060,7 @@ }, { "cell_type": "code", - "execution_count": 60, + "execution_count": null, "id": "6dcb6ef1-13b3-437e-813c-7118911847a4", "metadata": {}, "outputs": [], @@ -3051,7 +3079,7 @@ }, { "cell_type": "code", - "execution_count": 61, + "execution_count": null, "id": "3e63814e-9c0d-4f7a-a5e0-72cca2758d71", "metadata": {}, "outputs": [ @@ -3162,7 +3190,7 @@ }, { "cell_type": "code", - "execution_count": 62, + "execution_count": null, "id": "0d596bf7-753c-40cd-ac52-4a37163650ae", "metadata": {}, "outputs": [ @@ -3281,7 +3309,7 @@ }, { "cell_type": "code", - "execution_count": 63, + "execution_count": null, "id": "17b1223b-e5c1-4992-bb7e-0a99651c3729", "metadata": {}, "outputs": [ @@ -3308,7 +3336,7 @@ }, { "cell_type": "code", - "execution_count": 64, + "execution_count": null, "id": "159e00c6-8a9f-484f-aea2-853fd5512083", "metadata": {}, "outputs": [ From bada6454e203b5c8109d53d9abce7cd947036796 Mon Sep 17 00:00:00 2001 From: Emuide Date: Wed, 26 Mar 2025 03:34:37 +0000 Subject: [PATCH 8/8] Traduccion de la parte Words with Highest Mean TF-IDF scores --- lessons/02_bag_of_words.ipynb | 226 ++++++++++++++++++++++++---------- 1 file changed, 159 insertions(+), 67 deletions(-) diff --git a/lessons/02_bag_of_words.ipynb b/lessons/02_bag_of_words.ipynb index cbc9046..b2e73b5 100644 --- a/lessons/02_bag_of_words.ipynb +++ b/lessons/02_bag_of_words.ipynb @@ -375,7 +375,7 @@ "outputs": [ { "data": { - "image/png": "", + "image/png": "", "text/plain": [ "
" ] @@ -586,14 +586,25 @@ "metadata": { "tags": [] }, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/workspaces/Python-Text-Analysis_grupo_4/lessons/utils.py:4: SyntaxWarning: invalid escape sequence '\\d'\n", + " digit_pattern = '\\d+'\n", + "/workspaces/Python-Text-Analysis_grupo_4/lessons/utils.py:14: SyntaxWarning: invalid escape sequence '\\d'\n", + " digit_pattern = '\\d+'\n" + ] + } + ], "source": [ "from utils import placeholder" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 10, "id": "03569f0d-34ba-492d-aa1d-1dce9d34f792", "metadata": {}, "outputs": [], @@ -628,7 +639,7 @@ "text": [ "lol @justinbeiber and @BillGates are like soo 2000 #yesterday #amiright saw it on https://twitter.com #yolo\n", "==================================================\n", - "lol USER and USER are like soo DIGIT HASHTAG HASHTAG saw it on URL HASHTAG\n" + "Ellipsis\n" ] } ], @@ -654,11 +665,11 @@ { "data": { "text/plain": [ - "0 USER plus you've added commercials to the expe...\n", - "1 USER it's really aggressive to blast obnoxious...\n", - "2 USER and it's a really big bad thing about it\n", - "3 USER seriously would pay $ DIGIT a flight for ...\n", - "4 USER yes, nearly every time i fly vx this “ear...\n", + "0 Ellipsis\n", + "1 Ellipsis\n", + "2 Ellipsis\n", + "3 Ellipsis\n", + "4 Ellipsis\n", "Name: text_processed, dtype: object" ] }, @@ -833,8 +844,8 @@ { "data": { "text/plain": [ - "<4x8 sparse matrix of type ''\n", - "\twith 9 stored elements in Compressed Sparse Row format>" + "" ] }, "execution_count": 17, @@ -1075,15 +1086,20 @@ }, "outputs": [ { - "data": { - "text/plain": [ - "<11541x8751 sparse matrix of type ''\n", - "\twith 191139 stored elements in Compressed Sparse Row format>" - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" + "ename": "AttributeError", + "evalue": "'ellipsis' object has no attribute 'lower'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[23], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# Fit and transform to create DTM\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m counts \u001b[38;5;241m=\u001b[39m \u001b[43mvectorizer\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfit_transform\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtweets\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mtext_processed\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 3\u001b[0m counts\n", + "File \u001b[0;32m~/.local/lib/python3.12/site-packages/sklearn/base.py:1389\u001b[0m, in \u001b[0;36m_fit_context..decorator..wrapper\u001b[0;34m(estimator, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1382\u001b[0m estimator\u001b[38;5;241m.\u001b[39m_validate_params()\n\u001b[1;32m 1384\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m config_context(\n\u001b[1;32m 1385\u001b[0m skip_parameter_validation\u001b[38;5;241m=\u001b[39m(\n\u001b[1;32m 1386\u001b[0m prefer_skip_nested_validation \u001b[38;5;129;01mor\u001b[39;00m global_skip_validation\n\u001b[1;32m 1387\u001b[0m )\n\u001b[1;32m 1388\u001b[0m ):\n\u001b[0;32m-> 1389\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfit_method\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n", + "File \u001b[0;32m~/.local/lib/python3.12/site-packages/sklearn/feature_extraction/text.py:1376\u001b[0m, in \u001b[0;36mCountVectorizer.fit_transform\u001b[0;34m(self, raw_documents, y)\u001b[0m\n\u001b[1;32m 1368\u001b[0m warnings\u001b[38;5;241m.\u001b[39mwarn(\n\u001b[1;32m 1369\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mUpper case characters found in\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1370\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m vocabulary while \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mlowercase\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1371\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m is True. These entries will not\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1372\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m be matched with any documents\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1373\u001b[0m )\n\u001b[1;32m 1374\u001b[0m \u001b[38;5;28;01mbreak\u001b[39;00m\n\u001b[0;32m-> 1376\u001b[0m vocabulary, X \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_count_vocab\u001b[49m\u001b[43m(\u001b[49m\u001b[43mraw_documents\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfixed_vocabulary_\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1378\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mbinary:\n\u001b[1;32m 1379\u001b[0m X\u001b[38;5;241m.\u001b[39mdata\u001b[38;5;241m.\u001b[39mfill(\u001b[38;5;241m1\u001b[39m)\n", + "File \u001b[0;32m~/.local/lib/python3.12/site-packages/sklearn/feature_extraction/text.py:1263\u001b[0m, in \u001b[0;36mCountVectorizer._count_vocab\u001b[0;34m(self, raw_documents, fixed_vocab)\u001b[0m\n\u001b[1;32m 1261\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m doc \u001b[38;5;129;01min\u001b[39;00m raw_documents:\n\u001b[1;32m 1262\u001b[0m feature_counter \u001b[38;5;241m=\u001b[39m {}\n\u001b[0;32m-> 1263\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m feature \u001b[38;5;129;01min\u001b[39;00m \u001b[43manalyze\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdoc\u001b[49m\u001b[43m)\u001b[49m:\n\u001b[1;32m 1264\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m 1265\u001b[0m feature_idx \u001b[38;5;241m=\u001b[39m vocabulary[feature]\n", + "File \u001b[0;32m~/.local/lib/python3.12/site-packages/sklearn/feature_extraction/text.py:104\u001b[0m, in \u001b[0;36m_analyze\u001b[0;34m(doc, analyzer, tokenizer, ngrams, preprocessor, decoder, stop_words)\u001b[0m\n\u001b[1;32m 102\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 103\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m preprocessor \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m--> 104\u001b[0m doc \u001b[38;5;241m=\u001b[39m \u001b[43mpreprocessor\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdoc\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 105\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m tokenizer \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m 106\u001b[0m doc \u001b[38;5;241m=\u001b[39m tokenizer(doc)\n", + "File \u001b[0;32m~/.local/lib/python3.12/site-packages/sklearn/feature_extraction/text.py:62\u001b[0m, in \u001b[0;36m_preprocess\u001b[0;34m(doc, accent_function, lower)\u001b[0m\n\u001b[1;32m 43\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m\"\"\"Chain together an optional series of text preprocessing steps to\u001b[39;00m\n\u001b[1;32m 44\u001b[0m \u001b[38;5;124;03mapply to a document.\u001b[39;00m\n\u001b[1;32m 45\u001b[0m \n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 59\u001b[0m \u001b[38;5;124;03m preprocessed string\u001b[39;00m\n\u001b[1;32m 60\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 61\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m lower:\n\u001b[0;32m---> 62\u001b[0m doc \u001b[38;5;241m=\u001b[39m \u001b[43mdoc\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mlower\u001b[49m()\n\u001b[1;32m 63\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m accent_function \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m 64\u001b[0m doc \u001b[38;5;241m=\u001b[39m accent_function(doc)\n", + "\u001b[0;31mAttributeError\u001b[0m: 'ellipsis' object has no attribute 'lower'" + ] } ], "source": [ @@ -1099,20 +1115,15 @@ "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "array([[0, 0, 0, ..., 0, 0, 0],\n", - " [0, 0, 0, ..., 0, 0, 0],\n", - " [0, 0, 0, ..., 0, 0, 0],\n", - " ...,\n", - " [0, 0, 0, ..., 0, 0, 0],\n", - " [0, 0, 0, ..., 0, 0, 0],\n", - " [0, 0, 0, ..., 0, 0, 0]])" - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" + "ename": "NameError", + "evalue": "name 'counts' is not defined", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[24], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# Do not run if you have limited memory - this includes DataHub and Binder\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m np\u001b[38;5;241m.\u001b[39marray(\u001b[43mcounts\u001b[49m\u001b[38;5;241m.\u001b[39mtodense())\n", + "\u001b[0;31mNameError\u001b[0m: name 'counts' is not defined" + ] } ], "source": [ @@ -1125,7 +1136,21 @@ "execution_count": 25, "id": "99322b85-1a15-46a5-bb80-bb5eaa6eeb7b", "metadata": {}, - "outputs": [], + "outputs": [ + { + "ename": "NotFittedError", + "evalue": "Vocabulary not fitted or provided", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNotFittedError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[25], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# Extract tokens\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m tokens \u001b[38;5;241m=\u001b[39m \u001b[43mvectorizer\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_feature_names_out\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n", + "File \u001b[0;32m~/.local/lib/python3.12/site-packages/sklearn/feature_extraction/text.py:1472\u001b[0m, in \u001b[0;36mCountVectorizer.get_feature_names_out\u001b[0;34m(self, input_features)\u001b[0m\n\u001b[1;32m 1459\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mget_feature_names_out\u001b[39m(\u001b[38;5;28mself\u001b[39m, input_features\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mNone\u001b[39;00m):\n\u001b[1;32m 1460\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"Get output feature names for transformation.\u001b[39;00m\n\u001b[1;32m 1461\u001b[0m \n\u001b[1;32m 1462\u001b[0m \u001b[38;5;124;03m Parameters\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 1470\u001b[0m \u001b[38;5;124;03m Transformed feature names.\u001b[39;00m\n\u001b[1;32m 1471\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m-> 1472\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_check_vocabulary\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1473\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m np\u001b[38;5;241m.\u001b[39masarray(\n\u001b[1;32m 1474\u001b[0m [t \u001b[38;5;28;01mfor\u001b[39;00m t, i \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28msorted\u001b[39m(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mvocabulary_\u001b[38;5;241m.\u001b[39mitems(), key\u001b[38;5;241m=\u001b[39mitemgetter(\u001b[38;5;241m1\u001b[39m))],\n\u001b[1;32m 1475\u001b[0m dtype\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mobject\u001b[39m,\n\u001b[1;32m 1476\u001b[0m )\n", + "File \u001b[0;32m~/.local/lib/python3.12/site-packages/sklearn/feature_extraction/text.py:501\u001b[0m, in \u001b[0;36m_VectorizerMixin._check_vocabulary\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 499\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_validate_vocabulary()\n\u001b[1;32m 500\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mfixed_vocabulary_:\n\u001b[0;32m--> 501\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m NotFittedError(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mVocabulary not fitted or provided\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 503\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mvocabulary_) \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m0\u001b[39m:\n\u001b[1;32m 504\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mVocabulary is empty\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n", + "\u001b[0;31mNotFittedError\u001b[0m: Vocabulary not fitted or provided" + ] + } + ], "source": [ "# Extract tokens\n", "tokens = vectorizer.get_feature_names_out()" @@ -1893,29 +1918,20 @@ }, { "cell_type": "code", - "execution_count": 36, + "execution_count": 27, "id": "ffa7bf4e-640b-49bc-b64b-721140f67f76", "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "digit 6927\n", - "flight 3320\n", - "hashtag 2633\n", - "cancelled 956\n", - "thanks 921\n", - "service 910\n", - "just 801\n", - "customer 726\n", - "time 695\n", - "help 687\n", - "dtype: int64" - ] - }, - "execution_count": 36, - "metadata": {}, - "output_type": "execute_result" + "ename": "NameError", + "evalue": "name 'second_dtm' is not defined", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[27], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43msecond_dtm\u001b[49m\u001b[38;5;241m.\u001b[39msum()\u001b[38;5;241m.\u001b[39msort_values(ascending\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m)\u001b[38;5;241m.\u001b[39mhead(\u001b[38;5;241m10\u001b[39m)\n", + "\u001b[0;31mNameError\u001b[0m: name 'second_dtm' is not defined" + ] } ], "source": [ @@ -1945,10 +1961,22 @@ }, { "cell_type": "code", - "execution_count": 37, + "execution_count": 26, "id": "da610560-62c3-48ab-a1b2-25e0b589bc61", "metadata": {}, - "outputs": [], + "outputs": [ + { + "ename": "ModuleNotFoundError", + "evalue": "No module named 'spacy'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[26], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# Import spaCy\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mspacy\u001b[39;00m\n\u001b[1;32m 3\u001b[0m nlp \u001b[38;5;241m=\u001b[39m spacy\u001b[38;5;241m.\u001b[39mload(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124men_core_web_sm\u001b[39m\u001b[38;5;124m'\u001b[39m)\n", + "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'spacy'" + ] + } + ], "source": [ "# Import spaCy\n", "import spacy\n", @@ -2815,7 +2843,30 @@ " - Sort the mean values in the descending order using `.sort_values()`\n", " - Get the top 10 terms using `.head()`\n", "\n", - "Next, run `pos.plot` and `neg.plot` to plot the words with the highest mean tf-idf scores for each subset. " + "Next, run `pos.plot` and `neg.plot` to plot the words with the highest mean tf-idf scores for each subset. \n", + "\n", + "\n", + "## Desafío 3: Palabras con los Puntajes Medios de TF-IDF más Altos\n", + "\n", + "Hemos obtenido valores de TF-IDF para cada término en cada documento. Pero ¿qué nos dicen estos valores sobre los sentimientos de los tweets? ¿Hay algunas palabras que sean particularmente informativas para los tweets positivos/negativos?\n", + "\n", + "Para explorar esto, reunamos los índices de todos los tweets positivos/negativos y calculemos los puntajes medios de TF-IDF de las palabras que aparecen en cada categoría.\n", + "\n", + "Hemos proporcionado el siguiente código de inicio para guiarte:\n", + "\n", + "- Subconjunta el dataframe `tweets` según la etiqueta `airline_sentiment` y recupera el índice de cada subconjunto (`.index`). Asigna el índice a `positive_index` o `negative_index`.\n", + "\n", + "- Para cada subconjunto:\n", + "\n", + " - Obtén la representación de tf-idf\n", + "\n", + " - Calcula los valores medios de tf-idf en el subconjunto usando `.mean()`\n", + "\n", + " - Ordena los valores medios en orden descendente usando `.sort_values()`\n", + "\n", + " - Obtén las 10 palabras principales usando `.head()`\n", + "\n", + "A continuación, ejecuta `pos.plot` y `neg.plot` para graficar las palabras con los puntajes medios más altos de tf-idf para cada subconjunto." ] }, { @@ -2899,7 +2950,28 @@ "\n", "The remaining portion of the data, known as the test set, is used to test whether the learned coefficients could be generalized to unseen data. \n", "\n", - "Now that we already have the tf-idf dataframe, the feature set is ready. Let's dive into model specification!" + "Now that we already have the tf-idf dataframe, the feature set is ready. Let's dive into model specification!\n", + "\n", + "\n", + "## **Demostración**: Clasificación de Sentimientos Usando la Representación TF-IDF\n", + "\n", + "Ahora que tenemos una representación TF-IDF del texto, ¡estamos listos para realizar análisis de sentimientos!\n", + "\n", + "En esta demostración, utilizaremos un modelo de regresión logística para llevar a cabo la tarea de clasificación. Aquí explicaremos brevemente cómo funciona la regresión logística como uno de los métodos de aprendizaje supervisado en Machine Learning, pero si quieres aprender más al respecto, siéntete libre de explorar nuestro taller sobre [Fundamentos de Machine Learning en Python](https://github.com/dlab-berkeley/Python-Machine-Learning).\n", + "\n", + "La regresión logística es un modelo lineal que utilizamos para predecir la etiqueta de un tweet, basado en un conjunto de características ($x_1, x_2, x_3, ..., x_i$), como se muestra a continuación:\n", + "\n", + "$$\n", + "L = \\beta_1 x_1 + \\beta_2 x_2 + \\cdots + \\beta_T x_T\n", + "$$\n", + "\n", + "La lista de características que pasaremos al modelo es el vocabulario de la Matriz de Términos-Documentos (DTM). También alimentamos al modelo con una parte de los datos, conocida como conjunto de entrenamiento, junto con otras especificaciones del modelo, para aprender los coeficientes ($\\beta_1, \\beta_2, \\beta_3, ..., \\beta_i$) de cada característica. Los coeficientes nos indican si una característica contribuye positiva o negativamente a la predicción realizada.\n", + "\n", + "El valor predicho corresponde a la suma de todas las características (multiplicadas por sus coeficientes) y el resultado se pasa a una [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function) para convertirlo en un espacio de probabilidad, lo que nos dice si la etiqueta predicha es positiva (cuando $p>0.5$)) o negativa (cuando $p<0.5$)).\n", + "\n", + "La parte restante de los datos, conocida como conjunto de prueba, se usa para verificar si los coeficientes aprendidos pueden generalizarse a datos no vistos.\n", + "\n", + "Ahora que ya tenemos el dataframe TF-IDF, el conjunto de características está listo. ¡Vamos a especificar el modelo!" ] }, { @@ -2918,7 +2990,9 @@ "id": "ee87ff74-3fbb-472a-b795-6f4d18fab215", "metadata": {}, "source": [ - "We'll use the `train_test_split` function from `sklearn` to separate our data into two sets:" + "We'll use the `train_test_split` function from `sklearn` to separate our data into two sets:\n", + "\n", + "Usaremos la función `train_test_split` de `sklearn` para separar nuestros datos en dos conjuntos." ] }, { @@ -2939,7 +3013,10 @@ "id": "066771d8-2f31-4646-9a1b-6d2b1b9b208c", "metadata": {}, "source": [ - "The `fit_logistic_regression` function is written below to streamline the training process." + "The `fit_logistic_regression` function is written below to streamline the training process.\n", + "\n", + "\n", + "La función `fit_logistic_regression` está escrita a continuación para agilizar el proceso de entrenamiento." ] }, { @@ -2966,7 +3043,9 @@ "id": "124aa7ea-1bc1-43e2-beeb-0ba2da9b2df9", "metadata": {}, "source": [ - "We'll fit the model and compute the training and test accuracy." + "We'll fit the model and compute the training and test accuracy.\n", + "\n", + "Ajustaremos el modelo y calcularemos la precisión en el entrenamiento y la prueba." ] }, { @@ -3006,7 +3085,9 @@ "id": "d4e186c5-1719-4deb-bdb4-614a9980f058", "metadata": {}, "source": [ - "The model achieved ~94% accuracy on the training set and ~89% on the test set—that's pretty good! The model generalizes reasonably well to the test data." + "The model achieved ~94% accuracy on the training set and ~89% on the test set—that's pretty good! The model generalizes reasonably well to the test data.\n", + "\n", + "El modelo alcanzó aproximadamente un 94% de precisión en el conjunto de entrenamiento y un 89% en el conjunto de prueba, ¡lo cual es bastante bueno! El modelo se generaliza razonablemente bien a los datos de prueba." ] }, { @@ -3016,7 +3097,11 @@ "source": [ "Next, let's also take a look at the fitted coefficients to see if what we see makes sense. \n", "\n", - "We can access them using `coef_`, and we can match each coefficient to the tokens from the vectorizer:" + "We can access them using `coef_`, and we can match each coefficient to the tokens from the vectorizer:\n", + "\n", + "A continuación, echemos un vistazo a los coeficientes ajustados para ver si tienen sentido.\n", + "\n", + "Podemos acceder a ellos usando `coef_`, y podemos hacer coincidir cada coeficiente con los tokens del vectorizador." ] }, { @@ -3265,7 +3350,9 @@ "id": "7b3b7893-caa0-4281-98f0-92c9e7b31953", "metadata": {}, "source": [ - "Let's plot the top 10 tokens with the highest/lowest coefficients. " + "Let's plot the top 10 tokens with the highest/lowest coefficients. \n", + "\n", + "Vamos a graficar los 10 tokens con los coeficientes más altos y más bajos." ] }, { @@ -3328,7 +3415,12 @@ "source": [ "Words like \"ruin,\" \"rude,\" and \"hour\" are strong indicators of negative sentiment, while \"thank,\" \"awesome,\" and \"wonderful\" are associated with positive sentiment. \n", "\n", - "We will wrap up Part 2 with these plots. These coefficient terms and the words with the highest TF-IDF values provide different perspectives on the sentiment of tweets. If you'd like, take some time to compare the two sets of plots and see which one provides a better account of the sentiments conveyed in tweets." + "We will wrap up Part 2 with these plots. These coefficient terms and the words with the highest TF-IDF values provide different perspectives on the sentiment of tweets. If you'd like, take some time to compare the two sets of plots and see which one provides a better account of the sentiments conveyed in tweets.\n", + "\n", + "\n", + "Palabras como 'ruin', 'rude' y 'hour' son fuertes indicadores de un sentimiento negativo, mientras que 'thank', 'awesome' y 'wonderful' están asociadas con un sentimiento positivo.\n", + "\n", + "Con estos gráficos concluiremos la Parte 2. Estos términos de coeficientes y las palabras con los valores más altos de TF-IDF brindan diferentes perspectivas sobre el sentimiento de los tweets. Si lo deseas, tómate un tiempo para comparar los dos conjuntos de gráficos y ver cuál proporciona una mejor interpretación de los sentimientos expresados en los tweets." ] }, { @@ -3350,7 +3442,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "Python 3", "language": "python", "name": "python3" }, @@ -3364,7 +3456,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.4" + "version": "3.12.1" } }, "nbformat": 4,