Keywords: Clustering, self-organizing-maps, microarray
2.1 Introduction
Machine Learning can be coined as equipping the machine (computers) to learn from the environment through experience by facilitating the machines with some tasks whose performance can be measured using some metrics and algorithms. This broad spectrum of machine learning is subdivided into few areas as mentioned below.
Supervised learning—In the above categories supervised learning is stipulated as learning system where the data (input) is provided and the output is also known which states that output is dependent on the input provided. From the experience of learning from the data provided this approach predicts labels for the newly given data.
Reinforcement Learning—This learning approach drives on a goal oriented approach in an interactive environment, and functions on the basis of feedback system using the cases rewards and punishments based on the interaction with the data and its outcomes.
Unsupervised Learning—This learning approach explores all the hidden patterns from the input provided as the output is unknown. Prediction is performed on the dataset where the algorithms are applied and the resultant outcome is produced [1].
As the biological data is vast because of compound protein structures and genome sequences, understanding and decrypting the function of cells is resilient. So as to study the rudimentary biological processes, machine learning approaches paves a way to make the system hassle free in developing tools, software and algorithms. This chapter dives in introducing the unsupervised learning approaches, algorithms and their practices in bioinformatics domain which is an interdisciplinary field of science grouping together biology, statistics and computer science in order to analyse and assess the huge amounts of biological data [2].
In unsupervised learning approach the machine learns from the dataset given as input and labels or groups data accordingly [1]. This can also be referred as self-organization, where the algorithm applied structures the data based on the input provided with minimum human intervention. This approach draws all the hidden patterns that exist in the data and also reveals the relationship of the patterns present.
Unsupervised learning basically operates on few common algorithms [3]
Clustering
Association
Anomaly detection
Latent variable
Dimensionality reduction.
Figure 2.1 Machine learning in bioinformatics.
Among the above approaches this chapter explores about the algorithmic techniques that are widely applied in bioinformatics paradigm.
Unsupervised learning in bioinformatics—Machine learning in bioinformatics is spread across 6 realms [6] as shown in Figure 2.1.
Genomics and proteomics—the complete set of genes in a cell of an organism is called genome. Genes are structures in which DNA is stored produced from RNA (mRNA-messenger RNA) that is made up from proteins [7]. Every cell of an organism is developed from proteins which are dynamic in nature because every other tissue produces non identical set of proteins. This dynamic nature of proteins is based on the gene expression data. Unlike proteomes, genomes are constant. The set of proteins present in a cell provides insights about the structure and function of a cell [8]. It is difficult to handle gene expression data manually due to its size. Hence machine learning approach such as clustering algorithms are deployed upon varied gene expression data so as to group up similar functions and structures of tissues and explore hidden information.
2.2 Clustering in Unsupervised Learning
Clustering is a technique through which the unlabeled dataset is being grouped based upon the similarity and the characteristics of the data from which a structured output is obtained [4]. The popular algorithms [5] of clustering include
k-means (partitions the data)
hierarchial (AGNES—agglomerative nesting)
Density-based (DBSCAN—Density based spatial clustering with noise)
Model-based (SOM—self-organizing maps)
Grid-based (STING—statistical information grid)
Soft clustering (FCM—Fuzzy Class Membership).
2.3 Clustering in Bioinformatics—Genetic Data
Clustering algorithms in bioinformatics are mostly used to decrypt the salient data in gene expression which is used to acknowledge biological processes in an organism. Gene expression exhibits divergent nature under varying clinical conditions, different tissues and different organisms. The observations drawn from the above conditions enhance the study and analysis of gene functionality. This in turn supports in drug discovery based on the diseased area and the varying nature of genes with diseased condition [28]. With the presence of voluminous amount of genes, it is complex to interpret the data. Therefore, the hidden patterns are being revealed by applying clustering techniques providing a better understanding of functioning of genes and the cellular and biological process of a cell as well [9]. Clustering can be either hard or overlapping. Contemplating every gene as a single cluster is referred as hard clustering whereas overlapping cluster is defined as degree of integration of relatable genes in diverse clusters to every gene expression [10].
2.3.1 Microarray Analysis
The exploration of genomics is centered upon the study of single genes; contrarily microarray analysis is a technology where thousands of gene expressions and its levels are identified under a microscopic slide upon which chips are placed to collect the data which in turn referred as gene chips or DNA chips [12].
In microarray analysis mRNA molecules are gathered from a reference sample (e.g. sample of diseased patient) and test sample (molecules of any individual). The data is combined using probes and if similarities are identified then they are moved into a cluster. If they are found to be dissimilar, they are moved to another cluster. Hence, the clusters are now labeled based on their similarities [12].
During the microarray analysis the data gathered from microarray images are used to construct matrices (refer to Table 2.1) in which rows hold genes and columns hold different conditions viz. different tissues, clinical condition, different biological processes. Expression level data is maintained in each cell of the matrix (refer to Table 2.1) which is uniquely numbered using every gene in every other sample.
Table 2.1 Gene expression data matrix representation.
Sample 1 | Sample 2 |
|