Skip to content

Commit 48c8eb0

Browse files
authored
Update README.md
1 parent 7db7a26 commit 48c8eb0

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
# Video recommender
22

33
## What is it
4-
This project was built from the project oriented by the Data Science course [Como criar uma solução completa de Data Science | Mario Filho](http://mariofilho.com/curso/).
4+
This project was built from the one guided by the Data Science course [Como criar uma solução completa de Data Science | Mario Filho](http://mariofilho.com/curso/).
55

6-
The objective of this project is to build a Youtube video recommender, following a certain criterion which will be used to list a series of videos that, according to a machine learning score, we consider recommended.
6+
The objective of this project is to create a Youtube video recommender, following a certain criterion which will be used to list a series of videos that, according to a machine learning score, we consider recommended.
77

88
The data science process of this project is divided into 4 macro steps:
99
- Data Scraping
@@ -27,7 +27,7 @@ The development of the project used the following main tools:
2727
- The project is currently hosted on cloud: Heroku
2828

2929
## Data Scraping
30-
As its name says, data science. So the first step is to collect the necessary and useful data for our proposal. What we do here is Web Scraping, a common strategy which gets the entire HTML page one by one and create a csv file with the useful features got by the Youtube HTML tags. This because we don't have a database ready to work, so web scraping was the solution found. For this project, the web scraping will be on youtube search page with keywords:
30+
As its name says, data science. So the first step is to collect the necessary and useful data for our proposal. What we do here is Web Scraping, a common strategy which gets the entire HTML page one by one and creates a csv file with the useful features gotten by the Youtube HTML tags. That is because we don't have a database ready to work, so web scraping was the solution found. For this project, the web scraping will be on youtube search page with keywords:
3131
- Machine Learning
3232
- Data Science
3333
- Kaggle
@@ -55,16 +55,16 @@ To be more specific, 1600 videos were collected and I chose 500 videos to label.
5555
- Videos with more then 1500 views
5656
- No lives
5757
- No conference, except TED talk
58-
- Videos with not very low video and audio resolution
58+
- Videos with a reasonable image and audio resolution
5959

6060
The definition of the criteria depends a lot on the need and the objective of the project. In addition, well-defined labeling makes it easy to metrics validation.
6161

62-
After the first data labeling, we use the active learning strategy, which consists of using machine learning to search for a specific group of data that we call hard decisions so that I can manually classify and optimize the learning. In this project it uses the **Random Forest** to find the videos with probability between 0.45 and 0.55. And then I did the manual labeling for the second time.
62+
After the first data labeling, we use the active learning strategy, which consists in using machine learning to search for a specific group of data that we call hard decisions so that I can manually classify and optimize the learning. In this project it uses the **Random Forest** to find the videos with probability between 0.45 and 0.55. And then I did the manual labeling for the second time.
6363

6464
Now is just split the data for test and validation and also convert title to [Tf-IDF](https://pt.wikipedia.org/wiki/Tf%E2%80%93idf).
6565

6666
## Modeling
67-
We already realized that this problem is about classification, so the options we have are, among them: **Decision Tree**, **Random Forest**, **LightGBM**, **Logistic Regression** and **SVM**. After I tested each one, I decided to combine **Random Forest**, **LightGBM** and **Logistic Regression** with simple arithmetic mean. For **LightGBM** in particular, Bayesian Optimization was used to improve the parameters. And now we have trained models, and with that we can already predict the videos.
67+
We already realized that this problem is about classification, so the options are, among them: **Decision Tree**, **Random Forest**, **LightGBM**, **Logistic Regression** and **SVM**. After I tested each one, I decided to combine **Random Forest**, **LightGBM** and **Logistic Regression** with simple arithmetic mean. For **LightGBM** in particular, Bayesian Optimization was used to improve the parameters. Now we have trained models, and with that we can already predict the videos.
6868

6969
Of course in this project we did a little web app to display a list of videos with the best scores. The logic is simple, we do web scraping periodically and then we apply the trained models to get the scores. The displayed list is the list sort by score.
7070

0 commit comments

Comments
 (0)