|
1 |
| -# Deep clustering models for metabolomics data |
| 1 | +# Implementation of deep clustering models for metabolomics data. |
2 | 2 |
|
3 | 3 | [](https://github.com/carlescn/MSc_bioinformatics_thesis/blob/main/LICENSE)
|
4 | 4 | [](https://www.latex-project.org/)
|
5 | 5 | [](https://www.python.org/)
|
6 |
| -[](https://github.com/tensorflow/tensorflow) |
7 |
| -[](https://github.com/keras-team/keras) |
| 6 | +[](https://github.com/tensorflow/tensorflow) |
| 7 | +[](https://github.com/keras-team/keras) |
8 | 8 |
|
9 | 9 | This repository contains
|
10 |
| -my implementation of some deep clustering models |
11 |
| -written for my MSc thesis. |
12 |
| -It also contains |
13 |
| -the code written to train the models on multiple datasets |
14 |
| -and a summary of the clustering performances achieved. |
| 10 | +my implementation of some deep clustering models |
| 11 | +I wrote for my MSc thesis. |
| 12 | +It also contains the code written |
| 13 | +to train and evaluate the models on multiple datasets. |
| 14 | +The original thesis report can be read |
| 15 | +[here](https://github.com/carlescn/MSc_bioinformatics_thesis/raw/main/thesis_report/CriadoNina_Carles_TFM.pdf), |
| 16 | +but the document is in Catalan |
| 17 | +(I plan on translating it into English, |
| 18 | +but I have not set a deadline). |
15 | 19 |
|
16 |
| -The objective of the thesis was |
| 20 | +The original objective of the thesis was |
17 | 21 | to implement a VAE based deep clustering model
|
18 | 22 | and apply it on metabolomics data,
|
19 |
| -and then compare the resulting clustering with |
20 |
| -more classical techniques. |
| 23 | +then compare the results with |
| 24 | +more established techniques. |
21 | 25 | I expected that the found clusters
|
22 | 26 | would lend themselves to some biological interpretation.
|
23 | 27 |
|
24 |
| -The deep learning models implemented here are: |
| 28 | +The VAE based model did not perform well, |
| 29 | +which prompted me to try other models, |
| 30 | +also based on the autoencoder architecture. |
| 31 | +The deep learning models I implemented are: |
25 | 32 |
|
26 |
| -- AE (Autoencoder) |
27 |
| -- DEC (Deep Embedded Clustering) [https://arxiv.org/abs/1312.6114v10]() |
28 |
| -- VAE (Variational Autoencoder) [https://arxiv.org/abs/1511.06335v2]() |
29 |
| -- VaDE (Variational Deep Embedding) [https://arxiv.org/abs/1611.05148v3]() |
| 33 | +- AE (Autoencoder) [https://arxiv.org/abs/2201.03898](https://arxiv.org/abs/2201.03898) |
| 34 | +- DEC (Deep Embedded Clustering) [https://arxiv.org/abs/1312.6114v10](https://arxiv.org/abs/1312.6114v10) |
| 35 | +- VAE (Variational Autoencoder) [https://arxiv.org/abs/1511.06335v2](https://arxiv.org/abs/1511.06335v2) |
| 36 | +- VaDE (Variational Deep Embedding) [https://arxiv.org/abs/1611.05148v3](https://arxiv.org/abs/1611.05148v3) |
30 | 37 |
|
31 | 38 | All the models where implemented
|
32 | 39 | using [Keras](https://github.com/keras-team/keras)
|
33 |
| -over [Tensorflow](https://github.com/tensorflow/tensorflow), |
34 |
| -using Python. |
35 |
| -For the training process, I leveraged the virtual machines provided by |
36 |
| -[Paperspace Gradient](https://www.paperspace.com/gradient) (over a paid subscription). |
37 |
| - |
38 |
| - |
39 |
| -The models where first trained on the |
40 |
| -[MNIST data set](https://doi.org/10.1109/MSP.2012.2211477) |
41 |
| -and the clustering performance was measured |
42 |
| -using multiple metrics |
43 |
| -(Accuracy, Rand Index, Mutual Information, Silhouette). |
44 |
| -The results where then compared to some classic clustering techniques |
45 |
| -(K-Means and GMM over the raw data and over the PCA transformation). |
46 |
| - |
47 |
| -Then, the best performing models (DEC, VaDE) |
48 |
| -where evaluated on the [ExposomeChallenge2021 data set](https://arxiv.org/abs/2202.01680), |
49 |
| -which contains metabolomics data. |
50 |
| -The results where also compared with come classic clustering techniques. |
51 |
| -*(work in progress)* |
| 40 | +with [Tensorflow](https://github.com/tensorflow/tensorflow), |
| 41 | +using [Python](https://www.python.org/). |
| 42 | +For the training process, |
| 43 | +I leveraged the virtual machines provided by |
| 44 | +[Paperspace Gradient](https://www.paperspace.com/gradient) (paid subscription). |
52 | 45 |
|
53 | 46 | ## File structure
|
54 | 47 |
|
55 |
| -- `models.py` Python module that contains |
56 |
| - the implementation of all the models. |
57 |
| -- `draw_embeddings.py` Python module that contains |
58 |
| - some functions used to draw graphical representations |
59 |
| - of the embeddings and cluster assignments |
60 |
| - obtained by the models. |
61 |
| -- `clustering_metrics.py` Python module that contains |
62 |
| - some functions used to evaluate |
63 |
| - the performance of the models, |
64 |
| - measuring some metrics of clustering quality. |
65 |
| -- `MNIST` Folder that contains |
66 |
| - some Jupyter Notebooks |
67 |
| - where I train the different models on the MNIST dataset, |
68 |
| - then I evaluate their clustering performance |
69 |
| - and compare it to some classic clustering techniques |
70 |
| - (see `results.ipynb` for a summary). |
71 |
| -- `ExposomeChallenge` Folder that contains |
72 |
| - some Jupyter Notebooks |
73 |
| - where I train the different models on the ExposomeChallenge2021 data set, |
74 |
| - then I evaluate their clustering performance |
75 |
| - and compare it to some classic clustering techniques. |
76 |
| -- `_learning_keras` Folder that contains |
77 |
| - some Jupyter Notebooks |
78 |
| - where I trained myself on the use of Keras over Tensorflow |
79 |
| - for the implementation of artificial neural network models. |
| 48 | +- [models.py](https://github.com/carlescn/MSc_bioinformatics_thesis/blob/main/models.py): |
| 49 | + Python module that contains |
| 50 | + my implementation of all the deep clustering models. |
| 51 | +- [draw_embeddings.py](https://github.com/carlescn/MSc_bioinformatics_thesis/blob/main/draw_embeddings.py) |
| 52 | + Python module that contains some functions |
| 53 | + to draw graphical representations of the embeddings |
| 54 | + and cluster assignments. |
| 55 | +- [clustering_metrics.py](https://github.com/carlescn/MSc_bioinformatics_thesis/blob/main/clustering_metrics.py): |
| 56 | + Python module that contains some functions |
| 57 | + to evaluate the performance of the models. |
| 58 | +- [thesis_report](https://github.com/carlescn/MSc_bioinformatics_thesis/tree/main/thesis_report) |
| 59 | + Folder that contains my full thesis report, |
| 60 | + both the PDF file and the latex source. |
| 61 | +- [MNIST](https://github.com/carlescn/MSc_bioinformatics_thesis/tree/main/MNIST) |
| 62 | + Folder that contains the Jupyter Notebooks I wrote |
| 63 | + to train and evaluate the models on the MNIST data set. |
| 64 | + Also contaings the metrics and cluster assignments |
| 65 | + on CSV files. |
| 66 | +- [ExposomeChallenge](https://github.com/carlescn/MSc_bioinformatics_thesis/tree/main/ExposomeChallenge) |
| 67 | + Same as above, for the Exposome Data Challenge Event data set. |
| 68 | +- [PrivateDataset](https://github.com/carlescn/MSc_bioinformatics_thesis/tree/main/PrivateDataset) |
| 69 | + Same as above, for the DCH-NG data set. |
| 70 | +- [_learning_keras](https://github.com/carlescn/MSc_bioinformatics_thesis/tree/main/_learning_keras) |
| 71 | + Folder that contains some Jupyter Notebooks I wrote |
| 72 | + while training myself on the use of Keras and Tensorflow. |
| 73 | + |
| 74 | +## Abstract |
| 75 | + |
| 76 | +I implemented several deep clustering models |
| 77 | +based on the Autoencoder architecture |
| 78 | +with the aim of evaluating their performance in metabolomics datasets. |
| 79 | +Using the MNIST dataset and two metabolomic datasets, |
| 80 | +I evaluated the performance of several variations of the VAE, DEC and VaDE architectures |
| 81 | +using internal and external validation metrics |
| 82 | +to measure clustering quality. |
| 83 | +I compared the results with more established methods |
| 84 | +such as K-means, GMM and agglomerative clustering. |
| 85 | +I found found that the VAE architecture is not conducive to good clustering quality. |
| 86 | +The clusters obtained with the DEC, Vade and consolidated techniques |
| 87 | +show a high level of overlap with each other, |
| 88 | +but yield low performances according to the validation metrics. |
| 89 | +The DEC model excels over the rest in the internal validation metric, |
| 90 | +but is very sensitive to the initialization parameters. |
| 91 | +The VaDE model achieves similar results to the rest of the techniques, |
| 92 | +and has the added value of having generative capacity, |
| 93 | +which could be used in artificial data augmentation techniques. |
| 94 | +The multivariate distribution of the covariates |
| 95 | +(as well as that of the most variable metabolites) |
| 96 | +shows a differential distribution by the clusters obtained, |
| 97 | +although the results are not clear. |
| 98 | +This suggests a possible biological interpretation of the clusters, |
| 99 | +but it will be necessary to study it in more depth to draw conclusions. |
0 commit comments