You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+6-6Lines changed: 6 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,9 @@
1
1
# Video recommender
2
2
3
3
## What is it
4
-
This project was built from the project oriented by the Data Science course [Como criar uma solução completa de Data Science | Mario Filho](http://mariofilho.com/curso/).
4
+
This project was built from the one guided by the Data Science course [Como criar uma solução completa de Data Science | Mario Filho](http://mariofilho.com/curso/).
5
5
6
-
The objective of this project is to build a Youtube video recommender, following a certain criterion which will be used to list a series of videos that, according to a machine learning score, we consider recommended.
6
+
The objective of this project is to create a Youtube video recommender, following a certain criterion which will be used to list a series of videos that, according to a machine learning score, we consider recommended.
7
7
8
8
The data science process of this project is divided into 4 macro steps:
9
9
- Data Scraping
@@ -27,7 +27,7 @@ The development of the project used the following main tools:
27
27
- The project is currently hosted on cloud: Heroku
28
28
29
29
## Data Scraping
30
-
As its name says, data science. So the first step is to collect the necessary and useful data for our proposal. What we do here is Web Scraping, a common strategy which gets the entire HTML page one by one and create a csv file with the useful features got by the Youtube HTML tags. This because we don't have a database ready to work, so web scraping was the solution found. For this project, the web scraping will be on youtube search page with keywords:
30
+
As its name says, data science. So the first step is to collect the necessary and useful data for our proposal. What we do here is Web Scraping, a common strategy which gets the entire HTML page one by one and creates a csv file with the useful features gotten by the Youtube HTML tags. That is because we don't have a database ready to work, so web scraping was the solution found. For this project, the web scraping will be on youtube search page with keywords:
31
31
- Machine Learning
32
32
- Data Science
33
33
- Kaggle
@@ -55,16 +55,16 @@ To be more specific, 1600 videos were collected and I chose 500 videos to label.
55
55
- Videos with more then 1500 views
56
56
- No lives
57
57
- No conference, except TED talk
58
-
- Videos with not very low video and audio resolution
58
+
- Videos with a reasonable image and audio resolution
59
59
60
60
The definition of the criteria depends a lot on the need and the objective of the project. In addition, well-defined labeling makes it easy to metrics validation.
61
61
62
-
After the first data labeling, we use the active learning strategy, which consists of using machine learning to search for a specific group of data that we call hard decisions so that I can manually classify and optimize the learning. In this project it uses the **Random Forest** to find the videos with probability between 0.45 and 0.55. And then I did the manual labeling for the second time.
62
+
After the first data labeling, we use the active learning strategy, which consists in using machine learning to search for a specific group of data that we call hard decisions so that I can manually classify and optimize the learning. In this project it uses the **Random Forest** to find the videos with probability between 0.45 and 0.55. And then I did the manual labeling for the second time.
63
63
64
64
Now is just split the data for test and validation and also convert title to [Tf-IDF](https://pt.wikipedia.org/wiki/Tf%E2%80%93idf).
65
65
66
66
## Modeling
67
-
We already realized that this problem is about classification, so the options we have are, among them: **Decision Tree**, **Random Forest**, **LightGBM**, **Logistic Regression** and **SVM**. After I tested each one, I decided to combine **Random Forest**, **LightGBM** and **Logistic Regression** with simple arithmetic mean. For **LightGBM** in particular, Bayesian Optimization was used to improve the parameters. And now we have trained models, and with that we can already predict the videos.
67
+
We already realized that this problem is about classification, so the options are, among them: **Decision Tree**, **Random Forest**, **LightGBM**, **Logistic Regression** and **SVM**. After I tested each one, I decided to combine **Random Forest**, **LightGBM** and **Logistic Regression** with simple arithmetic mean. For **LightGBM** in particular, Bayesian Optimization was used to improve the parameters. Now we have trained models, and with that we can already predict the videos.
68
68
69
69
Of course in this project we did a little web app to display a list of videos with the best scores. The logic is simple, we do web scraping periodically and then we apply the trained models to get the scores. The displayed list is the list sort by score.
0 commit comments