Cancer Detection Using Biclustering

Published Date: 02 Nov 2017

Uma Sahu, Antony John, Ancy Alphonso, Amit Kamath

Computer Department

Don Bosco Institute of Technology (DBIT)

Kurla, Mumbai

[email protected]

Amiya Tripathy

Computer Department

Indian Institute of Technology Bombay

Mumbai

[email protected]

Abstractâ€” The extraction and identification of gene groups with similar expression pattern, plays an important role in the analysis of genes. Clustering and biclustering methods are the primary techniques involved in analyzing gene expresion data which include grouping of genes, classification of genes and classification of a sample. Apart from classical clustering methods, biclustering is being preferred to analyse biological datasets, due to its ability to group both genes across conditions simultaneously. In this paper the spot is discovered by us in a breast where a tumor will be detected. Furthermore, the probability of the tumor to identify its type is also taken by us i.e. benign, suspicious or malignant.

Keywords- Biclustering, Image Processing, Data mining

I. Introduction

In 2007, cancer is responsible for near about 7.6 million peopleâ€™s live in world [11]. This figure goes on increasing day by day, year by year. When cancer starts its growth process it starts from one abnormal cell. The abnormal cell divides into 2 abnormal cells, then 4 cells, and so forth throughout cancer stages. This process of cell division performs at various rates. An aggressive type of cancer may double its size in 4 weeks while a slower growing cancer may take up to 7 months. Over a period of up to 5 years, a cancer may duplicate itself up to 20 times [11]. Although there are various traditional methods for detecting cancer, an automated system was developed known as CAD (Computer Aided Detection) [12].

Cancer is abnormal growth of cells which never die. Regular cells in the body follow a systematic way of growth, separation, and destruction. Programmed cell death is termed as apoptosis and when this process resolves, cancer cells are formed. Unlike regular cells, cancer cells do not experience programmatic death and instead continue to grow and divide which results in a mass of abnormal cells that grows out of control. There are over 100 different types of cancer and each is classified by the type of cell that is initially affected. Cancer harms the body when damaged cells divide uncontrollably to form masses of tissue which are also called as tumors. Tumors that stay in one spot and demonstrate limited growth are generally considered to be benign. When a tumor successfully spreads to other parts of the body and grows, invading and destroying other healthy tissues, it is said to have metastasized. This process is called metastasis and its result is a serious condition that is very crucial for treatment [11].

Breast cancer is a malignant tumor that starts in the cells of the breast. A malignant tumor is a group of cancer cells that can grow into (invade) surrounding tissues or spread (metastasize) to distant areas of the body. The disease occurs almost entirely in women, but men can get it, too. Despite an increased global effort to end breast cancer, it continues to be the most common cancer and the second leading cause of cancer deaths in women in the United States.Â In 2011, an estimated 230,480 new cases of breast cancer are expected among women in the United States. The number of victims of this disease can reach 40,000 or more each year. Thus it is very important to have more research on breast cancer and various methods to detect it [13].

Bbiclustering was first used by Cheng and Church [5], [6] in gene expression data analysis. It belongs to a distinct class of clustering algorithms that perform simultaneous row-column clustering. Biclustering algorithms have also been proposed and used in some application fields such as co-clustering, bi-dimensional clustering, two-mode clustering and subspace clustering [3]. Biclustering is an important technique in two way data analysis. Biclustering is an extremely useful data mining tool used for identifying patterns, where different genes are correlated based on the subset of conditions in the gene expression dataset. This methodology is effectively applied to extract finer details about the behaviour of genes under certain experimental samples [9]. Thus biclustering can be very well used for detecting cancer.

Mammography still remains the first step for breast cancer screening and investigation though it is less accurate in patients with dense breast tissue, implants or other factors that result in complex breast tissue. There are various testing options beyond mammography for breast cancer diagnosis like Computed Tomography (CT), Magnetic Resonance Imaging (MRI), and Extreme Drug Resistance (EDR). There are also cancer detection methods available using tumors and risk markers.

Many biclustering algorithms have been used so far [2], [5]. Some of them are explained as follows:

Spectral Biclustering

Spectral biclustering approaches use techniques from linear algebra to identify bicluster structures in the input data. In this model, it is assumed that the expression matrix has a hidden checkerboard-like structure that we try to identify using eigenvector computations. The spectral algorithm was applied to human cancer data and its results were used for classification of tumour type and identification of marker genes.

The SAMBA Algorithm

The SAMBA algorithm [5],[6] (Statistical-Algorithmic Method for Bicluster Analysis) uses probabilistic modelling of the data and graph theoretic techniques to identify subsets of genes that jointly respond across a subset of conditions, where a gene is termed responding in some condition if its expression level changes significantly at that condition w.r.t. its normal level

Cheng and Churchâ€™s Algorithm

The algorithm constructs one bicluster at a time using a statistical criterion â€“ a low mean squared residue (the variance of the set of all elements in the bicluster, plus the mean row variance and the mean column variance). Once a bicluster is created, its entries are replaced by random numbers, and the procedure is repeated iteratively specific effects. After removing row, column and submatrix averages, the residual level should be as small as possible. To discover more than one bicluster, Cheng and Church [5], [6] suggested repeated application of the biclustering algorithm on modified matrices.

In this study, a mammographic image is given as an input to the system. By using image processing the algorithm filters out the unwanted part from the mammographic image. It considers only the white part from the entire mammographic image and discards the black portion from the mammographic image, thus emphasizing the most significant portion of the image which would be useful further processing. A specific threshold value is then decided above which there are the chances for occurrence of cancer.

In order to study the various ways for detection of cancer the most important thing is getting the mammographic images of the cancerous patients. The most renowned hospital for cancer is TATA Memorial Hospital. We got some mammographic images of cancerous patients from TATA Memorial Hospital [4]. With the help of some of the doctors of that hospital it was very easy to collect all the required information about the breast cancer such as the various methods available to detect it, what new can be introduced in the field.

II. MATERIALS AND METHODS

Biclustering algorithms simultaneously cluster both rows and columns. These types of algorithms are applied to gene expression data analysis to find a subset of genes that exhibit similar expression pattern under a subset of conditions [3]. We used the concept of Gene Expression Data Matrix to detect cancer.It will be working with an nxm data matrix, where each element aij will be a given real value. In the case of gene expression matrices, aij represents the expression level of gene i under condition j. Table 1 illustrate the arrangement of a gene expression matrix [3], [7].

TABLE 1. GENE EXPRESSION DATA MATRIX

Condition 1

â€¦

Condition j

â€¦

Condition c

a11

â€¦

a1j

â€¦

a1c

Gene â€¦

â€¦

ai1

â€¦

aij

â€¦

aic

Gene â€¦

â€¦

ar1

â€¦

arj

â€¦

arc

Let us consider the general case of a data matrix P, with set of rows A and set of columns B i.e, the data matrix is P=(A,B) where the element aij corresponds to a value representing the relation between row i and column j. Such a matrix P, with n rows and m columns, is defined by its set of rows, A= {a1,â€¦,ar} and its set of columns, B= {b1,â€¦,bc}. We will use (A, B) to denote the matrix P. Consider the data matrix P as mentioned above. We define a group of genes as a subset of genes that shows similar behavior across the set of all conditions. Similarly, a group of conditions is a subset of conditions that exhibit similar behavior across the set of all genes. A bicluster is a subset of genes that exhibit similar behavior across a subset of conditions, and vice versa [3].

For detecting cancer the most important step is data collection, which includes getting the mammographic images of the tissues of cancerous patients. Certain mammographic images of cancerous patients were acquired from TATA Memorial Hospital [4]. Biologically cancer is detected using a technique known as microarray. A microarray is typically a glass slide on to which DNA molecules are fixed in an orderly manner at specific locations called spots. A microarray may contain thousands of spots and each spot may contain a few million copies of identical DNA molecules that uniquely correspond to a gene. The DNA in a spot may either be genomic DNA or short stretch of oligo-nucleotide strands that correspond to a gene. It is used for image processing, transformation and normalization. [1], [6]

An attempt has been made to develop a computerized system for detection of cancer which makes use of image processing technique. Image processing involves the following steps:

Identification of the spots and distinguishing them from spurious signals.

Determination of the spot area to be surveyed, determination of the local region to estimate background hybridization.

Reporting summary statistics and assigning spot intensity after subtracting for background intensity.

After image processing it would transform this image to an n X n matrix. This matrix helps in easy grouping of genes, classification of a new gene and classification of a new sample. Given the data matrix A, as defined above, it define a cluster of rows as a subset of rows that shows similar behavior across the set of all columns. Similarly, a cluster of columns is a subset of columns that exhibit similar behavior across the set of all rows. Now after converting the image to matrix format it considers a threshold value for determining whether it is a cancerous tissues image or not. A schematic diagram of our implementation is as follows:

Figure 1. Schematic representation of cancer detection application

A mammographic image is given as input to the system, which then is converted to matrix. Now by considering some specific threshold value, the final result can be determined. If the result is above the threshold value, then there are chances for that person to have cancer. If it is below the threshold value, then no cancer is present (figure 1).

For executing this, software known as Visual Studio has been used; this made this process very simple. We need to first store all the mammographic images in one folder in the computers drive. Using Visual Studio it is then very simple to browse the image as shown in figure (figure 2)

After performing the above procedure the result can be obtained. It could detect the presence of cancer in the selected image. In Visual Studio, when we click on â€˜STARTâ€™ button, the execution of program starts. It then eliminates the black spots and takes into consideration only the white spots. The place where the density of white spots is more is considered as the place where cancerous tumor may be present. Such place gets highlighted within a red circle. Since itâ€™s a computerized system one cannot totally rely on the result. Thus, the thought for flashing the probability of the presence of cancerous tumor came into picture.

For knowing the probability of the presence of cancer, the following mathematical formula can be used:

Probability = WhitePixelCount * 100

TotalPixelCount

Figure 2: Screenshot after browsing the image

III. RESULTS AND DISCUSSIONS

After studying various mammographic images, many cancerous and non-cancerous tumors were detected by carrying out the above procedure. When the result exceeds the threshold value, the area where the density of white spots is more gets highlighted with a red circle. The absence of this red circle proves that the tumor is non-cancerous.

Figure 3. Screenshot of 0% probability

A non-cancerous tumor is being detected in the above figure (figure 3). Since there are no red colored circles, it is can be concluded that there is no cancer.

Figure 4. Screenshot of having cancer with 6% probability

The presence of white spots is too low (0%-15%) (Figure 4). Thus it is very clear that the probability of that tumor having cancer would also be very low. Thus only a very small part from the entire tissue is highlighted.

Figure 5. Screenshot of having cancer with 53% probability

The presence of white spots is 53% (figure 5). Thus it is clear that the probability of having cancer would be more i.e. more than 50% is the probability.

Figure 6. Screenshot of having cancer with 87% probability

There are many (above 85%) white spots in this mammographic image (figure 6). Thus it is clear that the probability of having cancer would be greater i.e. more than 80% is the probability.

Conclusion

Thus in this paper have a comprehensive survey on Biclustering, the various algorithms used is been presented. Biclustering is a relatively young area & it has a great potential to make significant contributions to biology and to other fields. Many other applications in biological data analysis, gene network identification, data mining, and collaborative filtering remain to be explored. We have also seen how microarray helps in biclustering through image processing, transformation and data normalization. There are various applications for biclustering. It analyzes data from different individuals suffering from different types of cancer. It contains data collected from several individuals with a particular cancer or healthy people. As a next step, we can do test on various kinds of cancerous tissue and can try finding the present stage of the cancerous patient.

Acknowledgment

This work is carried out in Don Bosco Institute of Technology, Project Lab. The authors would like to thank Tata Memorial Hospital for their support and expertise.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now