Skip to content

Commit 468f56a

Browse files
authored
Update README.md
1 parent 6bc7421 commit 468f56a

File tree

1 file changed

+80
-60
lines changed

1 file changed

+80
-60
lines changed

README.md

+80-60
Original file line numberDiff line numberDiff line change
@@ -1,79 +1,99 @@
1-
# Deep clustering models for metabolomics data
1+
# Implementation of deep clustering models for metabolomics data.
22

33
[![GPLv3 license](https://img.shields.io/badge/License-GPLv3.0-blue.svg?logo=)](https://github.com/carlescn/MSc_bioinformatics_thesis/blob/main/LICENSE)
44
[![made-with-latex](https://img.shields.io/badge/Made%20with-Latex-1f425f.svg?logo=latex)](https://www.latex-project.org/)
55
[![made-with-python 3.9](https://img.shields.io/badge/Made%20with-Python%203.9-1f425f.svg?logo=python)](https://www.python.org/)
6-
[![tensorflow 2.9.1](https://img.shields.io/badge/Tensorflow-2.9.1-green.svg?logo=tensorflow)](https://github.com/tensorflow/tensorflow)
7-
[![keras 2.9.0](https://img.shields.io/badge/Keras-2.9.0-green.svg?logo=keras)](https://github.com/keras-team/keras)
6+
[![tensorflow 2.9.1](https://img.shields.io/badge/Tensorflow-2.9.1-darkgreen.svg?logo=tensorflow)](https://github.com/tensorflow/tensorflow)
7+
[![keras 2.9.0](https://img.shields.io/badge/Keras-2.9.0-darkgreen.svg?logo=keras)](https://github.com/keras-team/keras)
88

99
This repository contains
10-
my implementation of some deep clustering models
11-
written for my MSc thesis.
12-
It also contains
13-
the code written to train the models on multiple datasets
14-
and a summary of the clustering performances achieved.
10+
my implementation of some deep clustering models
11+
I wrote for my MSc thesis.
12+
It also contains the code written
13+
to train and evaluate the models on multiple datasets.
14+
The original thesis report can be read
15+
[here](https://github.com/carlescn/MSc_bioinformatics_thesis/raw/main/thesis_report/CriadoNina_Carles_TFM.pdf),
16+
but the document is in Catalan
17+
(I plan on translating it into English,
18+
but I have not set a deadline).
1519

16-
The objective of the thesis was
20+
The original objective of the thesis was
1721
to implement a VAE based deep clustering model
1822
and apply it on metabolomics data,
19-
and then compare the resulting clustering with
20-
more classical techniques.
23+
then compare the results with
24+
more established techniques.
2125
I expected that the found clusters
2226
would lend themselves to some biological interpretation.
2327

24-
The deep learning models implemented here are:
28+
The VAE based model did not perform well,
29+
which prompted me to try other models,
30+
also based on the autoencoder architecture.
31+
The deep learning models I implemented are:
2532

26-
- AE (Autoencoder)
27-
- DEC (Deep Embedded Clustering) [https://arxiv.org/abs/1312.6114v10]()
28-
- VAE (Variational Autoencoder) [https://arxiv.org/abs/1511.06335v2]()
29-
- VaDE (Variational Deep Embedding) [https://arxiv.org/abs/1611.05148v3]()
33+
- AE (Autoencoder) [https://arxiv.org/abs/2201.03898](https://arxiv.org/abs/2201.03898)
34+
- DEC (Deep Embedded Clustering) [https://arxiv.org/abs/1312.6114v10](https://arxiv.org/abs/1312.6114v10)
35+
- VAE (Variational Autoencoder) [https://arxiv.org/abs/1511.06335v2](https://arxiv.org/abs/1511.06335v2)
36+
- VaDE (Variational Deep Embedding) [https://arxiv.org/abs/1611.05148v3](https://arxiv.org/abs/1611.05148v3)
3037

3138
All the models where implemented
3239
using [Keras](https://github.com/keras-team/keras)
33-
over [Tensorflow](https://github.com/tensorflow/tensorflow),
34-
using Python.
35-
For the training process, I leveraged the virtual machines provided by
36-
[Paperspace Gradient](https://www.paperspace.com/gradient) (over a paid subscription).
37-
38-
39-
The models where first trained on the
40-
[MNIST data set](https://doi.org/10.1109/MSP.2012.2211477)
41-
and the clustering performance was measured
42-
using multiple metrics
43-
(Accuracy, Rand Index, Mutual Information, Silhouette).
44-
The results where then compared to some classic clustering techniques
45-
(K-Means and GMM over the raw data and over the PCA transformation).
46-
47-
Then, the best performing models (DEC, VaDE)
48-
where evaluated on the [ExposomeChallenge2021 data set](https://arxiv.org/abs/2202.01680),
49-
which contains metabolomics data.
50-
The results where also compared with come classic clustering techniques.
51-
*(work in progress)*
40+
with [Tensorflow](https://github.com/tensorflow/tensorflow),
41+
using [Python](https://www.python.org/).
42+
For the training process,
43+
I leveraged the virtual machines provided by
44+
[Paperspace Gradient](https://www.paperspace.com/gradient) (paid subscription).
5245

5346
## File structure
5447

55-
- `models.py` Python module that contains
56-
the implementation of all the models.
57-
- `draw_embeddings.py` Python module that contains
58-
some functions used to draw graphical representations
59-
of the embeddings and cluster assignments
60-
obtained by the models.
61-
- `clustering_metrics.py` Python module that contains
62-
some functions used to evaluate
63-
the performance of the models,
64-
measuring some metrics of clustering quality.
65-
- `MNIST` Folder that contains
66-
some Jupyter Notebooks
67-
where I train the different models on the MNIST dataset,
68-
then I evaluate their clustering performance
69-
and compare it to some classic clustering techniques
70-
(see `results.ipynb` for a summary).
71-
- `ExposomeChallenge` Folder that contains
72-
some Jupyter Notebooks
73-
where I train the different models on the ExposomeChallenge2021 data set,
74-
then I evaluate their clustering performance
75-
and compare it to some classic clustering techniques.
76-
- `_learning_keras` Folder that contains
77-
some Jupyter Notebooks
78-
where I trained myself on the use of Keras over Tensorflow
79-
for the implementation of artificial neural network models.
48+
- [models.py](https://github.com/carlescn/MSc_bioinformatics_thesis/blob/main/models.py):
49+
Python module that contains
50+
my implementation of all the deep clustering models.
51+
- [draw_embeddings.py](https://github.com/carlescn/MSc_bioinformatics_thesis/blob/main/draw_embeddings.py)
52+
Python module that contains some functions
53+
to draw graphical representations of the embeddings
54+
and cluster assignments.
55+
- [clustering_metrics.py](https://github.com/carlescn/MSc_bioinformatics_thesis/blob/main/clustering_metrics.py):
56+
Python module that contains some functions
57+
to evaluate the performance of the models.
58+
- [thesis_report](https://github.com/carlescn/MSc_bioinformatics_thesis/tree/main/thesis_report)
59+
Folder that contains my full thesis report,
60+
both the PDF file and the latex source.
61+
- [MNIST](https://github.com/carlescn/MSc_bioinformatics_thesis/tree/main/MNIST)
62+
Folder that contains the Jupyter Notebooks I wrote
63+
to train and evaluate the models on the MNIST data set.
64+
Also contaings the metrics and cluster assignments
65+
on CSV files.
66+
- [ExposomeChallenge](https://github.com/carlescn/MSc_bioinformatics_thesis/tree/main/ExposomeChallenge)
67+
Same as above, for the Exposome Data Challenge Event data set.
68+
- [PrivateDataset](https://github.com/carlescn/MSc_bioinformatics_thesis/tree/main/PrivateDataset)
69+
Same as above, for the DCH-NG data set.
70+
- [_learning_keras](https://github.com/carlescn/MSc_bioinformatics_thesis/tree/main/_learning_keras)
71+
Folder that contains some Jupyter Notebooks I wrote
72+
while training myself on the use of Keras and Tensorflow.
73+
74+
## Abstract
75+
76+
I implemented several deep clustering models
77+
based on the Autoencoder architecture
78+
with the aim of evaluating their performance in metabolomics datasets.
79+
Using the MNIST dataset and two metabolomic datasets,
80+
I evaluated the performance of several variations of the VAE, DEC and VaDE architectures
81+
using internal and external validation metrics
82+
to measure clustering quality.
83+
I compared the results with more established methods
84+
such as K-means, GMM and agglomerative clustering.
85+
I found found that the VAE architecture is not conducive to good clustering quality.
86+
The clusters obtained with the DEC, Vade and consolidated techniques
87+
show a high level of overlap with each other,
88+
but yield low performances according to the validation metrics.
89+
The DEC model excels over the rest in the internal validation metric,
90+
but is very sensitive to the initialization parameters.
91+
The VaDE model achieves similar results to the rest of the techniques,
92+
and has the added value of having generative capacity,
93+
which could be used in artificial data augmentation techniques.
94+
The multivariate distribution of the covariates
95+
(as well as that of the most variable metabolites)
96+
shows a differential distribution by the clusters obtained,
97+
although the results are not clear.
98+
This suggests a possible biological interpretation of the clusters,
99+
but it will be necessary to study it in more depth to draw conclusions.

0 commit comments

Comments
 (0)