diff --git a/source/reading.md b/source/reading.md index 4febd2cd..4182df15 100644 --- a/source/reading.md +++ b/source/reading.md @@ -16,7 +16,7 @@ kernelspec: # Reading in data locally and from the web -## Overview +## Overview ```{index} see: loading; reading ``` @@ -46,10 +46,10 @@ By the end of the chapter, readers will be able to do the following: - **U**niform **R**esource **L**ocator (URL) - Read data into Python using an absolute path, relative path and a URL. - Compare and contrast the following functions: - - `read_csv` + - `read_csv` - `read_excel` - Match the following `pandas` `read_csv` function arguments to their descriptions: - - `filepath_or_buffer` + - `filepath_or_buffer` - `sep` - `names` - `skiprows` @@ -76,7 +76,7 @@ This chapter will discuss the different functions we can use to import data into Python, but before we can talk about *how* we read the data into Python with these functions, we first need to talk about *where* the data lives. When you load a data set into Python, you first need to tell Python where those files live. The file -could live on your computer (*local*) or somewhere on the internet (*remote*). +could live on your computer (*local*) or somewhere on the internet (*remote*). The place where the file lives on your computer is called the "path". You can think of the path as directions to the file. There are two kinds of paths: @@ -90,7 +90,7 @@ in respect to the computer's filesystem base (or root) folder. Suppose our computer's filesystem looks like the picture in {numref}`Filesystem`, and we are working in a -file titled `worksheet_02.ipynb`. If we want to +file titled `worksheet_02.ipynb`. If we want to read the `.csv` file named `happiness_report.csv` into Python, we could do this using either a relative or an absolute path. We show both choices below. @@ -124,24 +124,24 @@ happy_data = pd.read_csv("/home/dsci-100/worksheet_02/data/happiness_report.csv" +++ -So which one should you use? Generally speaking, to ensure your code can be run -on a different computer, you should use relative paths. An added bonus is that -it's also less typing! Generally, you should use relative paths because the file's -absolute path (the names of -folders between the computer's root `/` and the file) isn't usually the same -across different computers. For example, suppose Fatima and Jayden are working on a -project together on the `happiness_report.csv` data. Fatima's file is stored at +So which one should you use? Generally speaking, to ensure your code can be run +on a different computer, you should use relative paths. An added bonus is that +it's also less typing! Generally, you should use relative paths because the file's +absolute path (the names of +folders between the computer's root `/` and the file) isn't usually the same +across different computers. For example, suppose Fatima and Jayden are working on a +project together on the `happiness_report.csv` data. Fatima's file is stored at ``` /home/Fatima/project/data/happiness_report.csv ``` -while Jayden's is stored at +while Jayden's is stored at ``` /home/Jayden/project/data/happiness_report.csv ``` - + Even though Fatima and Jayden stored their files in the same place on their computers (in their home folders), the absolute paths are different due to their different usernames. If Jayden has code that loads the @@ -154,10 +154,10 @@ relative paths will work on both! ``` Your file could be stored locally, as we discussed, or it could also be -somewhere on the internet (remotely). For this purpose we use a +somewhere on the internet (remotely). For this purpose we use a *Uniform Resource Locator (URL)*, i.e., a web address that looks something like https://google.com/. URLs indicate the location of a resource on the internet and -helps us retrieve that resource. +helps us retrieve that resource. ## Reading tabular data from a plain text file into Python @@ -168,26 +168,26 @@ helps us retrieve that resource. ``` Now that we have learned about *where* data could be, we will learn about *how* -to import data into Python using various functions. Specifically, we will learn how +to import data into Python using various functions. Specifically, we will learn how to *read* tabular data from a plain text file (a document containing only text) *into* Python and *write* tabular data to a file *out of* Python. The function we use to do this depends on the file's format. For example, in the last chapter, we learned about using the `read_csv` function from `pandas` when reading `.csv` (**c**omma-**s**eparated **v**alues) files. In that case, the *separator* that divided our columns was a -comma (`,`). We only learned the case where the data matched the expected defaults -of the `read_csv` function -(column names are present, and commas are used as the separator between columns). -In this section, we will learn how to read +comma (`,`). We only learned the case where the data matched the expected defaults +of the `read_csv` function +(column names are present, and commas are used as the separator between columns). +In this section, we will learn how to read files that do not satisfy the default expectations of `read_csv`. ```{index} Canadian languages; canlang data ``` -Before we jump into the cases where the data aren't in the expected default format +Before we jump into the cases where the data aren't in the expected default format for `pandas` and `read_csv`, let's revisit the more straightforward case where the defaults hold, and the only argument we need to give to the function -is the path to the file, `data/can_lang.csv`. The `can_lang` data set contains -language data from the 2016 Canadian census. +is the path to the file, `data/can_lang.csv`. The `can_lang` data set contains +language data from the 2016 Canadian census. We put `data/` before the file's name when we are loading the data set because this data set is located in a sub-folder, named `data`, relative to where we are running our Python code. @@ -209,18 +209,19 @@ Non-Official & Non-Aboriginal languages,Amharic,22465,12785,200,33670 ```{index} pandas ``` -And here is a review of how we can use `read_csv` to load it into Python. First we +And here is a review of how we can use `read_csv` to load it into Python. First we load the `pandas` package to gain access to useful -functions for reading the data. +functions for reading the data. ```{code-cell} ipython3 -import pandas as pd +import pandas as pd ``` Next we use `read_csv` to load the data into Python, and in that call we specify the relative path to the file. ```{code-cell} ipython3 +:tags: ["output_scroll"] canlang_data = pd.read_csv("data/can_lang.csv") canlang_data ``` @@ -269,19 +270,20 @@ ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 6 ```{index} read function; skiprows argument ``` -To successfully read data like this into Python, the `skiprows` -argument can be useful to tell Python +To successfully read data like this into Python, the `skiprows` +argument can be useful to tell Python how many rows to skip before it should start reading in the data. In the example above, we would set this value to 3 to read and load the data correctly. ```{code-cell} ipython3 +:tags: ["output_scroll"] canlang_data = pd.read_csv("data/can_lang_meta-data.csv", skiprows=3) canlang_data ``` How did we know to skip three rows? We looked at the data! The first three rows -of the data had information we didn't need to import: +of the data had information we didn't need to import: ```code Data source: https://ttimbers.github.io/canlang/ @@ -289,13 +291,13 @@ Data originally published in: Statistics Canada Census of Population 2016. Reproduced and distributed on an as-is basis with their permission. ``` -The column names began at row 4, so we skipped the first three rows. +The column names began at row 4, so we skipped the first three rows. ### Using the `sep` argument for different separators Another common way data is stored is with tabs as the separator. Notice the data file, `can_lang.tsv`, has tabs in between the columns instead of -commas. +commas. ```code category language mother_tongue most_at_home most_at_work lang_known @@ -318,26 +320,27 @@ Non-Official & Non-Aboriginal languages Amharic 22465 12785 200 33670 ```{index} tsv, read function; read_tsv ``` -To read in `.tsv` (**t**ab **s**eparated **v**alues) files, we can set the `sep` argument +To read in `.tsv` (**t**ab **s**eparated **v**alues) files, we can set the `sep` argument in the `read_csv` function to the *tab character* `\t`. ```{index} escape character ``` -> **Note:** `\t` is an example of an *escaped character*, +> **Note:** `\t` is an example of an *escaped character*, > which always starts with a backslash (`\`). -> Escaped characters are used to represent non-printing characters +> Escaped characters are used to represent non-printing characters > (like the tab) or characters with special meanings (such as quotation marks). ```{code-cell} ipython3 +:tags: ["output_scroll"] canlang_data = pd.read_csv("data/can_lang.tsv", sep="\t") canlang_data ``` Let's compare the data frame here to the resulting data frame in Section {ref}`readcsv` after using `read_csv`. Notice anything? They look the same; they have -the same number of columns and rows, and have the same column names! +the same number of columns and rows, and have the same column names! So even though we needed to use different arguments depending on the file format, our resulting data frame (`canlang_data`) in both cases was the same. @@ -365,7 +368,7 @@ Non-Official & Non-Aboriginal languages Amharic 22465 12785 200 33670 ``` Data frames in Python need to have column names. Thus if you read in data that -don't have column names, Python will assign names automatically. In this example, +don't have column names, Python will assign names automatically. In this example, Python assigns each column a name of `0, 1, 2, 3, 4, 5`. To read this data into Python, we specify the first argument as the path to the file (as done with `read_csv`), and then provide @@ -374,9 +377,10 @@ and finally set `header = None` to tell `pandas` that the data file does not contain its own column names. ```{code-cell} ipython3 +:tags: ["output_scroll"] canlang_data = pd.read_csv( - "data/can_lang_no_cols.tsv", - sep = "\t", + "data/can_lang_no_cols.tsv", + sep = "\t", header = None ) canlang_data @@ -387,10 +391,10 @@ canlang_data It is best to rename your columns manually in this scenario. The current column names (`0, 1`, etc.) are problematic for two reasons: first, because they not very descriptive names, which will make your analysis -confusing; and second, because your column names should generally be *strings*, but are currently *integers*. +confusing; and second, because your column names should generally be *strings*, but are currently *integers*. To rename your columns, you can use the `rename` function -from the [pandas package](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html#). -The argument of the `rename` function is `columns`, which takes a mapping between the old column names and the new column names. +from the [pandas package](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html#). +The argument of the `rename` function is `columns`, which takes a mapping between the old column names and the new column names. In this case, we want to rename the old columns (`0, 1, ..., 5`) in the `canlang_data` data frame to more descriptive names. To specify the mapping, we create a *dictionary*: a Python object that represents @@ -400,6 +404,7 @@ Below, we create a dictionary called `col_map` that maps the old column names in names, and then pass it to the `rename` function. ```{code-cell} ipython3 +:tags: ["output_scroll"] col_map = { 0 : "category", 1 : "language", @@ -415,10 +420,11 @@ canlang_data_renamed ```{index} read function; names argument ``` -The column names can also be assigned to the data frame immediately upon reading it from the file by passing a -list of column names to the `names` argument in `read_csv`. +The column names can also be assigned to the data frame immediately upon reading it from the file by passing a +list of column names to the `names` argument in `read_csv`. ```{code-cell} ipython3 +:tags: ["output_scroll"] canlang_data = pd.read_csv( "data/can_lang_no_cols.tsv", sep="\t", @@ -448,6 +454,7 @@ path on our local computer. All other arguments that we use are the same as when using these functions with a local file on our computer. ```{code-cell} ipython3 +:tags: ["output_scroll"] url = "https://raw.githubusercontent.com/UBC-DSCI/introduction-to-datascience-python/reading/source/data/can_lang.csv" pd.read_csv(url) canlang_data = pd.read_csv(url) @@ -497,8 +504,8 @@ t 8f??3wn ?Pd(??J-?E???7?'t(?-GZ?????y???c~N?g[^_r?4 yG?O ?K??G? - - + + ]TUEe??O??c[???????6q??s??d?m???\???H?^????3} ?rZY? ?:L60?^?????XTP+?|? X?a??4VT?,D?Jq ``` @@ -509,11 +516,12 @@ X?a??4VT?,D?Jq This type of file representation allows Excel files to store additional things that you cannot store in a `.csv` file, such as fonts, text formatting, graphics, multiple sheets and more. And despite looking odd in a plain text -editor, we can read Excel spreadsheets into Python using the `pandas` package's `read_excel` -function developed specifically for this +editor, we can read Excel spreadsheets into Python using the `pandas` package's `read_excel` +function developed specifically for this purpose. ```{code-cell} ipython3 +:tags: ["output_scroll"] canlang_data = pd.read_excel("data/can_lang.xlsx") canlang_data ``` @@ -522,13 +530,13 @@ If the `.xlsx` file has multiple sheets, you have to use the `sheet_name` argume to specify the sheet number or name. This functionality is useful when a single sheet contains multiple tables (a sad thing that happens to many Excel spreadsheets since this makes reading in data more difficult). You can also specify cell ranges using the -`usecols` argument (e.g., `usecols="A:D"` for including columns from `A` to `D`). +`usecols` argument (e.g., `usecols="A:D"` for including columns from `A` to `D`). As with plain text files, you should always explore the data file before importing it into Python. Exploring the data beforehand helps you decide which arguments you need to load the data into Python successfully. If you do not have the Excel program on your computer, you can use other programs to preview the -file. Examples include Google Sheets and Libre Office. +file. Examples include Google Sheets and Libre Office. In {numref}`read_func` we summarize the `read_csv` and `read_excel` functions we covered in this chapter. We also include the arguments for data separated by @@ -547,20 +555,20 @@ European countries). * - Comma (`,`) separated files - `read_csv` - just the file path -* - Tab (`\t`) separated files +* - Tab (`\t`) separated files - `read_csv` - `sep="\t"` * - Missing header - `read_csv` - `header=None` * - European-style numbers, semicolon (`;`) separators - - `read_csv` + - `read_csv` - `sep=";"`, `thousands="."`, `decimal=","` * - Excel files (`.xlsx`) - `read_excel` - `sheet_name`, `usecols` - - + + ``` ## Reading data from a database @@ -576,7 +584,7 @@ different relational database management systems each have their own advantages and limitations. Almost all employ SQL (*structured query language*) to obtain data from the database. But you don't need to know SQL to analyze data from a database; several packages have been written that allow you to connect to -relational databases and use the Python programming language +relational databases and use the Python programming language to obtain data. In this book, we will give examples of how to do this using Python with SQLite and PostgreSQL databases. @@ -588,8 +596,8 @@ using Python with SQLite and PostgreSQL databases. SQLite is probably the simplest relational database system that one can use in combination with Python. SQLite databases are self-contained and usually stored and accessed locally on one computer. Data is usually stored in -a file with a `.db` extension (or sometimes a `.sqlite` extension). -Similar to Excel files, these are not plain text files and cannot be read in a plain text editor. +a file with a `.db` extension (or sometimes a `.sqlite` extension). +Similar to Excel files, these are not plain text files and cannot be read in a plain text editor. ```{index} database; connect, ibis, ibis; ibis ``` @@ -598,18 +606,18 @@ Similar to Excel files, these are not plain text files and cannot be read in a p ``` The first thing you need to do to read data into Python from a database is to -connect to the database. For an SQLite database, we will do that using +connect to the database. For an SQLite database, we will do that using the `connect` function from the `sqlite` backend in the `ibis` package. This command does not read in the data, but simply tells Python where the database is and opens up a communication channel that Python can use to send SQL commands to the database. -> **Note:** There is another database package in python called `sqlalchemy`. +> **Note:** There is another database package in python called `sqlalchemy`. > That package is a bit more mature than `ibis`, -> so if you want to dig deeper into working with databases in Python, that is a good next -> package to learn about. We will work with `ibis` in this book, as it -> provides a more modern and friendlier syntax that is more like `pandas` for data analysis code. +> so if you want to dig deeper into working with databases in Python, that is a good next +> package to learn about. We will work with `ibis` in this book, as it +> provides a more modern and friendlier syntax that is more like `pandas` for data analysis code. ```{code-cell} ipython3 import ibis @@ -621,7 +629,7 @@ conn = ibis.sqlite.connect("data/can_lang.db") ``` Often relational databases have many tables; thus, in order to retrieve -data from a database, you need to know the name of the table +data from a database, you need to know the name of the table in which the data is stored. You can get the names of all the tables in the database using the `list_tables` function: @@ -636,22 +644,22 @@ tables The `list_tables` function returned only one name---`"can_lang"`---which tells us that there is only one table in this database. To reference a table in the -database (so that we can perform operations like selecting columns and filtering rows), we +database (so that we can perform operations like selecting columns and filtering rows), we use the `table` function from the `conn` object. The object returned by the `table` function allows us to work with data stored in databases as if they were just regular `pandas` data frames; but secretly, behind -the scenes, `ibis` will turn your commands into SQL queries! +the scenes, `ibis` will turn your commands into SQL queries! ```{code-cell} ipython3 canlang_table = conn.table("can_lang") -canlang_table +canlang_table ``` ```{index} database; count, ibis; count ``` Although it looks like we might have obtained the whole data frame from the database, we didn't! -It's a *reference*; the data is still stored only in the SQLite database. The `canlang_table` object +It's a *reference*; the data is still stored only in the SQLite database. The `canlang_table` object is an `AlchemyTable` (`ibis` is using `sqlalchemy` under the hood!), which, when printed, tells you which columns are available in the table. But unlike a usual `pandas` data frame, we do not immediately know how many rows are in the table. In order to find out how many @@ -665,7 +673,7 @@ canlang_table.count() ```{index} execute, ibis; execute ``` -Wait a second...this isn't the number of rows in the database. In fact, we haven't actually sent our +Wait a second...this isn't the number of rows in the database. In fact, we haven't actually sent our SQL query to the database yet! We need to explicitly tell `ibis` when we want to send the query. The reason for this is that databases are often more efficient at working with (i.e., selecting, filtering, joining, etc.) large data sets than Python. And typically, the database will not even @@ -693,23 +701,24 @@ str(canlang_table.count().compile()) The output above shows the SQL code that is sent to the database. When we write `canlang_table.count().execute()` in Python, in the background, the `execute` function is translating the Python code into SQL, sending that SQL to the database, and then translating the -response for us. So `ibis` does all the hard work of translating from Python to SQL and back for us; -we can just stick with Python! +response for us. So `ibis` does all the hard work of translating from Python to SQL and back for us; +we can just stick with Python! The `ibis` package provides lots of `pandas`-like tools for working with database tables. -For example, we can look at the first few rows of the table by using the `head` function---and +For example, we can look at the first few rows of the table by using the `head` function---and we won't forget to `execute` to see the result! ```{index} database; head, ibis; ``` ```{code-cell} ipython3 +:tags: ["output_scroll"] canlang_table.head(10).execute() ``` You can see that `ibis` actually returned a `pandas` data frame to us after we executed the query, which is very convenient for working with the data after getting it from the database. -So now that we have the `canlang_table` table reference for the 2016 Canadian Census data in hand, we +So now that we have the `canlang_table` table reference for the 2016 Canadian Census data in hand, we can mostly continue onward as if it were a regular data frame. For example, let's do the same exercise from Chapter 1: we will obtain only those rows corresponding to Aboriginal languages, and keep only the `language` and `mother_tongue` columns. @@ -723,7 +732,7 @@ to obtain only certain rows. Below we filter the data to include only Aboriginal canlang_table_filtered = canlang_table[canlang_table["category"] == "Aboriginal languages"] canlang_table_filtered ``` -Above you can see that we have not yet executed this command; `canlang_table_filtered` is just showing +Above you can see that we have not yet executed this command; `canlang_table_filtered` is just showing the first part of our query (the part that starts with `Selection[r0]` above). We didn't call `execute` because we are not ready to bring the data into Python yet. We can still use the database to do some work to obtain *only* the small amount of data we want to work with locally @@ -746,7 +755,7 @@ aboriginal_lang_data `ibis` provides many more functions (not just the `[]` operation) that you can use to manipulate the data within the database before calling -`execute` to obtain the data in Python. But `ibis` does not provide *every* function +`execute` to obtain the data in Python. But `ibis` does not provide *every* function that we need for analysis; we do eventually need to call `execute`. For example, `ibis` does not provide the `tail` function to look at the last rows in a database, even though `pandas` does. @@ -755,6 +764,7 @@ rows in a database, even though `pandas` does. ``` ```{code-cell} ipython3 +:tags: ["output_scroll"] canlang_table_selected.tail(6) ``` @@ -768,14 +778,14 @@ But be very careful using `execute`: databases are often *very* big, and reading an entire table into Python might take a long time to run or even possibly crash your machine. So make sure you select and filter the database table to reduce the data to a reasonable size before using `execute` to read it into Python! - -### Reading data from a PostgreSQL database + +### Reading data from a PostgreSQL database ```{index} database; PostgreSQL ``` PostgreSQL (also called Postgres) is a very popular -and open-source option for relational database software. +and open-source option for relational database software. Unlike SQLite, PostgreSQL uses a client–server database engine, as it was designed to be used and accessed on a network. This means that you have to provide more information @@ -790,13 +800,13 @@ need to include when you call the `connect` function is listed below: Below we demonstrate how to connect to a version of the `can_mov_db` database, which contains information about Canadian movies. -Note that the `host` (`fakeserver.stat.ubc.ca`), `user` (`user0001`), and -`password` (`abc123`) below are *not real*; you will not actually +Note that the `host` (`fakeserver.stat.ubc.ca`), `user` (`user0001`), and +`password` (`abc123`) below are *not real*; you will not actually be able to connect to a database using this information. ```python conn = ibis.postgres.connect( - database = "can_mov_db", + database = "can_mov_db", host = "fakeserver.stat.ubc.ca", port = 5432, user = "user0001", @@ -819,7 +829,7 @@ conn.list_tables() We see that there are 10 tables in this database. Let's first look at the `"ratings"` table to find the lowest rating that exists in the `can_mov_db` -database. +database. ```python ratings_table = conn.table("ratings") @@ -887,18 +897,18 @@ then use `ibis` to translate `pandas`-like commands (the `[]` operation, `head`, etc.) into SQL queries that the database understands, and then finally `execute` them. And not all `pandas` commands can currently be translated via `ibis` into database queries. So you might be wondering: why should we use -databases at all? +databases at all? Databases are beneficial in a large-scale setting: - They enable storing large data sets across multiple computers with backups. - They provide mechanisms for ensuring data integrity and validating input. - They provide security and data access control. -- They allow multiple users to access data simultaneously +- They allow multiple users to access data simultaneously and remotely without conflicts and errors. - For example, there are billions of Google searches conducted daily in 2021 {cite:p}`googlesearches`. - Can you imagine if Google stored all of the data - from those searches in a single `.csv` file!? Chaos would ensue! + For example, there are billions of Google searches conducted daily in 2021 {cite:p}`googlesearches`. + Can you imagine if Google stored all of the data + from those searches in a single `.csv` file!? Chaos would ensue! ## Writing data from Python to a `.csv` file @@ -910,7 +920,7 @@ that has changed (through selecting columns, filtering rows, etc.) to a file to share it with others or use it for another step in the analysis. The most straightforward way to do this is to use the `to_csv` function from the `pandas` package. The default -arguments are to use a comma (`,`) as the separator, and to include column names +arguments are to use a comma (`,`) as the separator, and to include column names in the first row. We also specify `index = False` to tell `pandas` not to print row numbers in the `.csv` file. Below we demonstrate creating a new version of the Canadian languages data set without the "Official languages" category according to the @@ -921,18 +931,18 @@ no_official_lang_data = canlang_data[canlang_data["category"] != "Official langu no_official_lang_data.to_csv("data/no_official_languages.csv", index=False) ``` -% ## Obtaining data from the web -% +% ## Obtaining data from the web +% % > **Note:** This section is not required reading for the remainder of the textbook. It % > is included for those readers interested in learning a little bit more about % > how to obtain different types of data from the web. -% +% % ```{index} see: application programming interface; API % ``` -% +% % ```{index} API % ``` -% +% % Data doesn't just magically appear on your computer; you need to get it from % somewhere. Earlier in the chapter we showed you how to access data stored in a % plain text, spreadsheet-like format (e.g., comma- or tab-separated) from a web @@ -946,16 +956,16 @@ no_official_lang_data.to_csv("data/no_official_languages.csv", index=False) % data they have access to, and *how much* data they can access. Typically, the % website owner will give you a *token* (a secret string of characters somewhat % like a password) that you have to provide when accessing the API. -% +% % ```{index} web scraping, CSS, HTML % ``` -% +% % ```{index} see: hypertext markup language; HTML % ``` -% +% % ```{index} see: cascading style sheet; CSS % ``` -% +% % Another interesting thought: websites themselves *are* data! When you type a % URL into your browser window, your browser asks the *web server* (another % computer on the internet whose job it is to respond to requests for the @@ -963,117 +973,117 @@ no_official_lang_data.to_csv("data/no_official_languages.csv", index=False) % data into something you can see. If the website shows you some information that % you're interested in, you could *create* a data set for yourself by copying and % pasting that information into a file. This process of taking information -% directly from what a website displays is called +% directly from what a website displays is called % *web scraping* (or sometimes *screen scraping*). Now, of course, copying and pasting % information manually is a painstaking and error-prone process, especially when % there is a lot of information to gather. So instead of asking your browser to % translate the information that the web server provides into something you can % see, you can collect that data programmatically—in the form of -% **h**yper**t**ext **m**arkup **l**anguage -% (HTML) -% and **c**ascading **s**tyle **s**heet (CSS) code—and process it +% **h**yper**t**ext **m**arkup **l**anguage +% (HTML) +% and **c**ascading **s**tyle **s**heet (CSS) code—and process it % to extract useful information. HTML provides the % basic structure of a site and tells the webpage how to display the content % (e.g., titles, paragraphs, bullet lists etc.), whereas CSS helps style the -% content and tells the webpage how the HTML elements should -% be presented (e.g., colors, layouts, fonts etc.). -% +% content and tells the webpage how the HTML elements should +% be presented (e.g., colors, layouts, fonts etc.). +% % This subsection will show you the basics of both web scraping % with the [`BeautifulSoup` Python package](https://beautiful-soup-4.readthedocs.io/en/latest/) {cite:p}`beautifulsoup` % and accessing the Twitter API % using the [`tweepy` Python package](https://github.com/tweepy/tweepy) {cite:p}`tweepy`. -% +% % +++ -% +% % ### Web scraping -% +% % #### HTML and CSS selectors -% +% % ```{index} web scraping, HTML; selector, CSS; selector, Craiglist % ``` -% +% % When you enter a URL into your browser, your browser connects to the % web server at that URL and asks for the *source code* for the website. -% This is the data that the browser translates +% This is the data that the browser translates % into something you can see; so if we % are going to create our own data by scraping a website, we have to first understand % what that data looks like! For example, let's say we are interested % in knowing the average rental price (per square foot) of the most recently -% available one-bedroom apartments in Vancouver +% available one-bedroom apartments in Vancouver % on [Craiglist](https://vancouver.craigslist.org). When we visit the Vancouver Craigslist -% website and search for one-bedroom apartments, +% website and search for one-bedroom apartments, % we should see something similar to {numref}`fig:craigslist-human`. -% +% % +++ -% +% % ```{figure} img/craigslist_human.png % :name: fig:craigslist-human -% +% % Craigslist webpage of advertisements for one-bedroom apartments. % ``` -% +% % +++ -% +% % Based on what our browser shows us, it's pretty easy to find the size and price % for each apartment listed. But we would like to be able to obtain that information % using Python, without any manual human effort or copying and pasting. We do this by % examining the *source code* that the web server actually sent our browser to -% display for us. We show a snippet of it below; the -% entire source +% display for us. We show a snippet of it below; the +% entire source % is [included with the code for this book](https://github.com/UBC-DSCI/introduction-to-datascience-python/blob/main/source/img/website_source.txt): -% +% % ```html %
% %