Skip to main content

Web Content Display Web Content Display

Skip banner

Web Content Display Web Content Display

Web Content Display Web Content Display

Our research

Multifarious aspects of the chaos game representation and its applications in biological sequence analysis

The development and applications of CGR have embraced mainly linear nucleotide sequences. However, there were also some attempts to create a representation of proteins. The latter need to be more sophisticated, as arbitrary coordinates for amino acids do not reflect their properties which is crucial during the encoding process. In this paper, the authors summarised various variations of CGRs and their limitations. We began by studying the PROSITE motifs and showed the immense number of amino acid properties employed by different proteins. To this aim, we harnessed the Principal Component Analysis (PCA) and studied the relation between explained variance and the number of features that describe them. It appeared that even after many reductions, about 50 features are non-redundant. This was the reason we introduced an embedding concept from natural language processing which enables adjusting features for a given list of sequences. We presented a simple neural network architecture with one hidden layer and one neuron within it and showed it provides satisfactory results in phylogenetic tree construction in ND5 and SPARC protein cases. To this aim, we transformed CGR representations for all considered sequences using Discrete Fourier Transform (DFT) and applied Unweighted Pair Group Method with Arithmetic Mean (UPGMA) algorithm. Moreover, we indicated some similarities between CGR and Recurrent Neural Networks (RNN). In the end, we attempted to include information about the RNA secondary structure and defined some measures to validate biological significance. We studied their properties and showed on ALMV-3 example its usefulness.

Kania A, Sarapata K. Multifarious aspects of the chaos game representation and its applications in biological sequence analysis. Comput Biol Med. 2022 Dec;151(Pt A):106243. doi: 10.1016/j.compbiomed.2022.106243. Epub 2022 Oct 25. PMID: 36335814.