Robust Fuzzy Cluster Ensemble on Cancer Gene Expression Data
StatisticsView Usage Statistics
In the past few decades, there has been tremendous growth in the scale and complexity of biological data generated by emerging high-throughput biotechnologies, including gene expression data generated by microarray technology. High-throughput gene expression data may contain gene expression measurements of thousands or millions of genes in a single data set, and provide us opportunities to explore the cell on a genome wide scale. Finding patterns in genomic data is a very important task in bioinformatics research and biomedical applications. Many clustering algorithms have been applied to gene expression data to find patterns. Nonetheless, there are still a number of challenges for clustering gene expression data because of the specific characteristics of such data and the special requirements from the domain of biology. Data noise and data high dimensionality are among the top challenges. In this dissertation, we propose a novel fuzzy cluster ensemble methodology which is effective and efficient in addressing the data noise and data high dimensionality challenges. It consists of an improved fuzzy clustering approach with different initializations as its base clusterings in order to reduce the impact of noises and improve accuracy and stability in general. The improved fuzzy clustering approach uses new weighted fuzzy techniques in computing cluster centers and assigning feature vectors, to avoid or alleviate the effects of noise. We conducted extensive experiments for our methodology on both real cancer gene expression data sets and synthetic noisy data sets created by introducing different percentages of artificial noise to real cancer gene expression data sets. We chose an external clustering validity measure for evaluating domain meaningfulness. For experiments on real cancer gene expression data sets, the results were evaluated using comparisons with numerous benchmark clustering and cluster ensemble algorithms. We also conducted parameter analysis on various parameters with different settings, complexity analysis on time cost and space cost, and noise robustness analysis on synthetic noisy data sets. The results from real cancer gene expression data sets have proved to be biologically and medically meaningful. It is the top performer on three of the eight data sets, more than any other methods evaluated, and it performs well on most of the other data sets. Additionally, our methodology have proved to be stable with varying parameter settings. For complexity analysis on time cost and space cost, it is computational efficient and scalable to high dimensional data sets. For noise robustness analysis experiments, the results have proved to be robust against highly noisy data.