pixlr-image-generator-24217709-2427-4252-bcbd-c18fc5d0b10f.png

Classification of metagenomic sequences using machine learning (SVM)

Development of a machine learning model using Support Vector Machine (SVM) for the classification of DNA sequences from different types of microorganisms from metagenomes

Thank you very much for accessing my portfolio and for your interest in this project!

This portfolio is still under construction, and soon I will provide a more detailed description of the analysis and pipeline development.

For now, you can access the thesis containing this work by following the link below:

If you have any questions, please do not hesitate to contact me for further inquiries and clarifications at tahilaandrighetti@gmail.com .

Download "Aprendizado de máquina para análises taxonômicas de dados de metagenômica" (in Portuguese)

ABSTRACT

The acknowledgement of the importance of microbiota composition is increasing steadily after the advent of metagenomics. This approach allows sequencing and analyzing genetic material from a microbial community without the need of microbial culture. Since 99% of microorganisms are not culturable, metagenomics is the standard methodology to investigate microbiomes composition and dynamics. However, the actual output data of metagenome sequencing consists of a bunch of DNA fragments originated from various microorganisms. Moreover, the lack of reference genomes in databases challenges taxonomic identification of unknown organisms in these samples. In this work, we evaluated the predictive power of Support Vector Machine (SVM) learning tool on taxonomic classification of unknown metagenomics DNA reads.

To simulate the identification of unknown microorganisms, we used Gammaproteobacteria sequences excluding Escherichia coli sequences as the training set in SVM. From the trained model, we classified the sequences of E. coli and analyzed if they were correctly assigned on Gammaproteobacteria group.

The tests were performed for 100, 400 and 1000 bp test sequences to evaluate the influence of size on the prediction. The simulations were performed using the following DNA measurements as SVM input: GC content, di, tri and tetraplet entropy, di, tri and tetranucleotides frequencies (2, 3 and 4-mers), dinucleotide abundance and tetranucleotide derived z-score correlations (TETRA).

We tested sets of measurements composed by all parameters but excluding one to compare the relative impact of each measure. We found that the groups which excluded TETRA shows less predictive power for the most of sizes tested, specially for 100 bp. The other groups showed AUC values higher than 0.7 for prediction of unknown sequences.

The use of sequence features is an interesting approach to characterize sequences of not fully sequenced organisms and characterizing the taxonomic composition of different envoironments.