Application Of Genetic Algorithm For Discovery

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Ming Yang1, Josiah Poon2*, Shaomo Wang1, Lijing Jiao1, Simon Poon2, Lizhi Cui2, Peiqi Chen1, Daniel Man-Yuen Sze3, Ling Xu1*

1Longhua Hospital Affiliated to Shanghai University of TCM, Shanghai 200032, China

2School of Information Technology, University of Sydney, Sydney, NSW 2006, Australia

3Department of Health Technology and Informatics, The Hong Kong Polytechnic University, HKSAR, China

Abstract: Research on Core and Effective Formulae (CEF) does not only summarize Traditional Chinese Medicine (TCM) treatment experience, it also helps reveal the underlying knowledge in the formulation of a TCM prescription. In this paper, CEF discovery from tumor clinical data are discussed. The concepts of confidence, support and effectiveness of the CEF are defined. Genetic Algorithm (GA) is applied to find the CEF from a lung cancer dataset with 595 records from 161 patients. The results had 9 CEF with positive fitness values with 15 distinct herbs. The CEF have all had relative high average confidence and support. An herb-herb network was constructed and it shown that all the herbs in CEF are core herbs. The dataset was divided into CEF group and non-CEF group. The effective proportions of former group are significantly greater than that of latter group. A formula was created using these 15 core herbs. Though this formula was highly effective, it had too small a support to be important. A Synergy Index (SI) was defined to evaluate the interaction of between two herbs. There were 4 pairs of herbs with high SI values to indicate the synergy between the herbs. All the results agreed with the TCM theory, which demonstrates the feasibility of our approach.

1.Introduction

Traditional Chinese medicine (TCM) has been developed and practised in China for thousands of years, and herbal prescription has played a key role in the therapeutic. Large amount of herbal prescriptions has been recorded over the years where valuable TCM knowledge is hidden. It is urgent and critical to analyse these data so that TCM models can be developed in the modernisation of this ancient knowledge. Although TCM is still in practice and more countries consider it as an alternative treatment method, the principle of formulating TCM prescription remains unknown. However, it is a daunting task to analyze such a large dataset manually. The methods of knowledge discovery in database (KDD) have been suggested as viable approaches.

KDD allows TCM researchers to find interesting patterns efficiently, and they may direct further laboratory work that leads to discovery . Many successful projects have been reported. For example, Wang et al.illustrated the use of structure equation modeling (SEM) to explore the diagnosis of the sub-optimal health status (SHS) and provided evidence for the standardization of TCM patterns. Multi label learning model was introduced for TCM syndrome identification. Complex network was built for the clinical data mining in TCM.

Generally, KDD research in TCM has been divided into two main categories. The first one attempts to extend our understanding using existing TCM knowledge, while another one attempts to identify core knowledge from existing TCM data, so that each piece of extracted knowledge can be further validated using scientific evidence. This paper belongs to the latter one and, in particular, pays attention to the study on TCM formulae from clinical data.

The efficiency of a formula can be interpreted as a collaboration of its member herbs. It is common to find most the prescriptions are of some relatively smaller fixed composition(s) that can be called core formula (CF). Adding herbs into and/or subtracting herbs from CFs are usually carried out in order to realize the personalised treatment. For example, although there are 113 prescriptions in one of the greatest TCM classics, named "Shang Han Lun", only 8 CFs exist, such as GuizhiTang that forms the basis of the formation of GuizhiJiaGui Tang, GuizhiXinjia Tang, Gegeng Tang and DangguiSini Tang.

Research on CFs does not only summarise Traditional Chinese Medicine (TCM) treatment experience, it also helps reveal the underlying knowledge in the formulation of a TCM prescription. Several computational models were proposed in the past decade to mine the TCM formulae, such as factor analysis, the information theory based association rule algorithm or clustering method, machine learning models, latent tree (LT) models and network analysis. These methods can reveal the core herbs and herb-collaboration patterns in TCM prescriptions or uncover the relationship between the herb and symptom, but they seldom concern the related clinical effect. In clinical activities, a number of herbs are combined to form a formula and different formulae are prescribed to different patients, but not all the formulae are effective. It is vital to determine whether a herb combination is effective or not in order to arrive at the valuable formulae. Those core and effective formulae (CEF) are of great interest to TCM practitioners as well as pharmaceutical companies that manufacture medicine using Chinese herbs.

Integrated tumor treatment using Chinese and western medicine is getting standardized in China and has become an important method of prevention and treatment. Many clinical studies considered that TCM is effective and potentially meets the demands of integrated treatment with multi-target therapeutics. Although the current evaluation approach of cancer treatment is still using tumor response and survival as the main indices, TCM concerns the patient as a whole rather than just the tumor; it means that the overall effect should be evaluated instead. Many researchers suggested the use of Quality of Life (QOL) as a proxy to evaluate the efficacy of TCM treatment. To be more specific, it considers the treatment efficacy via the reduction in symptoms severity. For that reason, those herbs combination patterns that are effective in improving symptoms significantly can be regarded as core and effective formulae in TCM tumor clinics.

Therefore, a major goal of this paper is to discuss approaches and strategies for the discovery of core and effective formulae (CEF) in tumor clinical data. Genetic algorithm (GA) was applied, which is a search heuristic that mimics the process of natural evolution. GA generates solutions to optimization problems using techniques inspired by biological evolution, such as inheritance, mutation, selection, and crossover. This is similar to the process of TCM development: prescriptions were created in different herbal combinations for various symptoms, and only the effective prescriptions would make their way into text and records. This is followed by practitioners, who used these effective prescriptions, adapt and create more effective prescriptions. Our previous work has proven that, given proper fitness function and search space, GA is suitable for the complex combinatorial optimization in TCM.

This paper is organized as follows. The Materials and Dataset section contains the process of the data acquisition and a description of the data. The Methods section is the methodological part of this paper. It contains definitions related to the assessment of CEF and a description of the genetic algorithm including the definition of fitness function. Complex network is presented in order to address the core herbs analysis in combination of prescriptions, and the analysis of herb-herb interactions is performed in the Results section.

2.Materials and Dataset

2.1. Data Source

The dataset used in this paper came from the in-patient lung cancer (LC) records of Shanghai Longhua Hospital of TCM. 161 patient sat different stages of LC (both early and metastatic disease) only receiving TCM therapy were included. Their prescriptions and symptoms were recorded during Feb2010 to Aug2012. Fifteen LC symptoms were recorded and they are:

cough

expectoration

short of breath

chest tightness

chest pain

fatigue

loss of appetite

bloody sputum

dry mouth and throat

fever

spontaneous and night sweating

insomnia

diarrhea

nocturia

five upset hot

A 4-points response scale(0 – not at all, 1 – a little, 2 –quite a bit, 3 – very much) was used to indicate the severity of the symptoms. Since the efficacy of a prescription can only be made known when the patient is met again in the next consultation, hence, to evaluate the efficacy of a prescription and to find the TCM treatment principles, only patients with multiple records (visits) were chosen.

2.2. Data Preprocessing and Description

There were 595 transaction records for the 161 patients, which range from 1 to 9 visits and the average number of transaction records per patient is near 4. Each record has its information of symptoms and the corresponding prescription. The interval of time between two visits were one or two weeks, during which patient took the same prescription. After excluding those patients who had only one visit, 586 transaction records for the 152 patients were left behind which had a total 230 different herbs being used. In the next stage, the symptom score in each record was calculated as follows:

Where scorei represents the score for symptom i, and m represents the number of symptom. Symptom Change Value (SCV)was calculated as the following formula

Figure1. An illustrative example for data format and SCV calculation

An illustrative example for data format and SCV calculation is shown in Figure 1 where there are 10 transaction records for 4 patients, and the visit range 1 to 4. The first patient is excluded because of his single visit. Since the symptom score for evaluating the prescription is recorded at the next (following) visit, SCV1 for evaluating prescription "P2" is calculated by "SS2" and "SS3". In the context of SCV, prescription which belongs to neither single visit nor last visit has its corresponding SCV.

After removing missing values, 419 SCVs for the 150 patients were obtained. According to the TCM theory, the criterion to be effective requires the SCV to be greater than or equal to 30%, in other words, it is a positive outcome and the value is set as 1, otherwise, the outcome is marked as 0.

At the end of this step, 93 out of the 419 records have positive outcome, making it an imbalanced dataset with 22.2% being effective. The statistic information for the number of herbs is shown in Table 1. The top 50 frequently used herbs based on records and patients are shown in Table 2.

Table 1 Statistic information for number of herbs

per record

per patient

average number per patient per record

minimum

9

9

9

average

23

29

22

maximum

36

73

33

Table 2 Top 50 frequently used herbs

Rank

Record based

Patient based

Herb

Frequency

Herb

Frequency

1

Chinese Sage Herb

395

Chinese Sage Herb

147

2

Doederlein's Spikemoss Herb

393

Doederlein's Spikemoss Herb

146

3

Akebia Fruit

359

Akebia Fruit

133

4

Herba Oldenlandiae

332

Herba Oldenlandiae

129

5

Atractylis Ovata

321

Atractylis Ovata

127

6

Rice-grain Sprout

268

Astragalus Root

116

7

Malt

268

Pachyma Cocos

107

8

Pachyma Cocos

266

Chicken's Gizzard-membrane

107

9

Astragalus Root

263

Rice-grain Sprout

106

10

Chicken's Gizzard-membrane

252

Malt

106

11

Common Selfheal Spike

239

Common Selfheal Spike

104

12

Rhizoma Batatatis

235

Rhizoma Batatatis

97

13

Coix Seed

223

Coix Seed

96

14

Tangerine Peel

195

Coastal Glehnia Root

86

15

Coastal Glehnia Root

195

Oysters

85

16

Oysters

183

Tangerine Peel

80

17

Rhizoma Amorphophalli

172

Pericarpium Trichosanthis

79

18

Pericarpium Trichosanthis

162

Asparagus Cochinchinensis

72

19

Asparagus Cochinchinensis

158

Rhizoma Amorphophalli

68

20

Edible Tulip

139

Edible Tulip

64

21

Crataegus Pinnatifida

133

Crataegus Pinnatifida

62

22

Ophiopogonis Japonicus

122

Tatarian Aster Root and Rhizome

58

23

Chinese Date

120

Short-horned Epimedium Herb

58

24

Glycyrrhiza Uralensis

119

Pilose Asiabell Root

58

25

Pilose Asiabell Root

119

Ophiopogonis Japonicus

57

26

Tatarian Aster Root and Rhizome

112

Glycyrrhiza Uralensis

55

27

Chinese Taxillus Herb

111

Chinese Taxillus Herb

53

28

Short-horned Epimedium Herb

109

Chinese Date

53

29

Pinellia Tuber

99

Pinellia Tuber

51

30

Baikal Skullcap Root

98

Heartleaf Houttuynia Herb

48

31

Suberect Spatholobus Stem

97

Suberect Spatholobus Stem

46

32

Heartleaf Houttuynia Herb

95

Glossy Privet Fruit

46

33

Glossy Privet Fruit

91

Baikal Skullcap Root

45

34

Noble Dendrobium Stem Herb

89

Balloonflower Root

43

35

Chekiang Fritillary Bulb

82

Chekiang Fritillary Bulb

42

36

Paris Polyphylla Smith

79

Eucommia Bark

40

37

Almond

78

Paris Polyphylla Smith

40

38

Eucommia Bark

76

Barbury Wolfberry Fruit

40

39

Balloonflower Root

75

Almond

40

40

Barbury Wolfberry Fruit

73

Noble Dendrobium Stem Herb

39

41

Pyrrosia Leaf

65

Cherokee Rose Fruit

35

42

Cherokee Rose Fruit

63

Chinese Dodder Seed

32

43

Reed Rhizome

57

Pyrrosia Leaf

31

44

Toad Skin

57

Reed Rhizome

30

45

Radix Semiaquilegiae

55

Toad Skin

30

46

Finger Citron Fruit

55

Common Macrocarpium Fruit

29

47

Chinese Dodder Seed

54

Immature Bitter Orange

28

48

Common Macrocarpium Fruit

53

Radix Semiaquilegiae

25

49

Dragon's Bones

51

Radish Seed

24

50

Immature Bitter Orange

51

Dragon's Bones

24

3. Method

The aim of this paper is to find the core and effective formula (CEF). The measure of effectiveness of a formula helps determine the efficacy of the herbal interaction in TCM medicine, while the core-ness of the prescriptions can help us summarize the TCM treatment principle. The identification of CEF comes from a high dimensional search space of symptoms and herbs; hence, the discovery of CEF can be described as a complicated combinatorial optimization problem. The analytic process of this paper can be described as follows:

(1) Recognizing and defining the problem

(2) Constructing and solving a model for the problem

(3) Validating the obtained solutions

The following sections discuss the different process steps in detail.

3.1 Recognizing and defining the problem

The discovery of CEF concerns about the choice of a best combination of herbs to achieve effective. The typical data format is shown in Figure 2. A combinatorial optimization problem H=(Q, f) can be defined by:

a set of herbs X={Herb1,Herb2,...Herbm};

an outcome variable SCV={y1,y2,...yn};

herb domains D1,...,Dm, , where B indicates the set of binary values {0,1};

an objective function f to be maximized, where f:

The set of possible feasible combinations is

Figure 2. Data format for combinatorial optimization problem of discovery of CEF

Q is a search space which contains all herbs in the data, as each combination of herbs can be seen as a candidate solution. To solve a combinatorial optimization problem of discovery of CEF, we have to find a solution with maximum objective function value, that is, . Before using such a formulation, we have to select the evaluation criterion for CEF. According to the meaning of CEF, definitions of three heuristics are introduced here:

Average confidence of a Core Effective Formula (CEF),

Let the prescriptions in the dataset are x1, x2,…,xn, the number of prescriptions is n, the average confidence of a given CEF is

number() is a counting function,  is intersection operator, represents the percentage of the common herbs between CEF and xi with respect to the number of herbs in CEF. The value of  is between 0 and 1 inclusive. The larger the value of  is, the given CEF is more representative.When  is 1, it implies every prescription carries all the herbs in the given CEF.

Support under a confidence , S

With different confidences, we define support as follows:

The higher the value S, the higher the occurrence of CEF is in the dataset. Let say we have a CEF with the S0.8 = 0.25, it means that there are 25% of the prescriptions in the dataset are composed of at least 80% of the herbs from the given CEF.

Effectiveness Value (EV)

Effectiveness Value (EV) is the difference of SCV between two groups. Let the prescription in dataset are xi, i=1,2, …,n and their effectiveness (outcome variable) yi, i=1,2, …, n. If x*is a CEF, and ((number(x*xi)/number(x*))/n, it means that the confidence of xi is greater than or equal to CEF;when  is great, it indicates the proportion of herbs from x* being used in xi is higher than the average confidence, . In other words, xi is an application of x*. Let x1*denote the group of all the xi, otherwise, x0*. The Effectiveness Value, EV, is defined as

If yi is continuous, then andrepresent the average effectiveness of the group of x1* and x0* respectively. If the yi is binary, then and represent the effective proportion of the group of x1* and x0* respectively. represents the joint standard deviation. The bigger the value EV is, the better the effectiveness is. Furthermore, we should consider the minimum number of herbs contained in x*.

3.2 Constructing and solving a model for the problem

To construct and solve a model for combinatorial optimization is a difficult task, in generally, we start with a realistic but possible solution, then iteratively optimize. As a computational model of evolutionary processes, GA not only has the ability to solve combinatorial optimization problems that are non-parametric, in contrast to most other algorithms that find one solution at a time, it has the strength to find multiple pareto-optimum solutions in parallel at the same time. This is compatible with TCM treatment that multiple formulae are applicable to a set of symptoms, i.e. it is equifinality. The concept of equifinality refers to many alternative ways of attaining the same objective. Using the previous definitions in 3.1, the sequence of steps of GA for the application of the discovery of CEF is shown in Figure 3:

Figure 3. Flow chart of GA and explanations of the sequence of GA steps for the discovery of CEF

Step 1:Encoding and initial population.

The herb combination to be optimized are represented by a chromosome whereby each herb is encoded in a binary string called gene according to the original herb space. Since there were 230 distinct herbs, the chromosome was made up of a string 230 binary characters, with the value of "0" and "1", to describe a prescription. A population, which consisted of a given number of chromosomes, was initially created by randomly assigning "1" to all genes with probability Pi. The value "1" in a gene meant that the corresponding herb was used in this prescription. Otherwise, the herb was not used in this prescription.

Step 2:The design of fitness function (objective function).

A crucial point in using GA is the design of fitness function, which determines what a GA should optimize. The goal of this study is to find CEF, which is a small subset of herbs that are frequently used and most significant for effectiveness. Fitness was measured by two criterias of CEF, one is core-ness that is represented by , S,,and N (minimum number of herbs contained in CEF), and the other is effectiveness that is represented by EV. N is typically decided by TCM theory, while the determination of S depends on how representative and frequently used the CEF are required. An important characteristic of GA is the way it deals with infeasible solutions (unsatisfactory CEF). The offspring might be potentially infeasible when recombining solutions. The most general and simple way is to reject infeasible solutions. Therefore, penalizing infeasible solutions in the fitness function that measures the qualification of a solution is more appropriate, which was presented in our research. Hence, the fitness function, f, is defined as follows:

Where, Sset and Nset is the pre-defined value of the support under a certain confidence and the minimum number of the herb contained in x*(mentioned above). R is a penalty constant, which is used to penalize the infeasible solutions. Thus, the evaluation of fitness started with the randomly generating prescription that was composed of all the presence of herb whose gene was coded as "1". Then the prescription's core-ness and effectiveness were evaluated by S,, N, and EV. Finally, the fitness was measured by the fitness function f. In the context of f, those prescriptions whose core-ness meet the requirement with high EV will have the higher probability to survive.

Step 3:Design of the GA operator.

After evolving the fitness of the population, the chromosomes were selected by means of the tournament selection, which involved running several "tournaments" among a few chromosomes chosen at random from the population. The winner of each tournament (the one with the best fitness) was more likely selected. Then children chromosomes were created from parent chromosomes by multi-point crossover operator. After that, the chromosomes were mutated with a three-way swap of three randomly chosen genes in a permutation, which could lead to new chromosomes in the searching space. Sometimes this could lead to new and better results. Mathematically using crossing over will find a local optima, where as with mutation new and better optima can be found.

Step 4:Terminal condition.

GA is an iterative search method, which will approach the optimized region but may not arrive the optimized solution. So a terminal condition is needed. Here, we terminated GA process after a predefined number of generations. The chromosomes of the last generation with the highest value of f were considered to be the CEF candidates.

3.3 Validating the obtained solutions

After finding optimal or near-optimal solutions (prescriptions), we had to evaluate them. Based on the meaning of CEF, solutions were evaluated on both core-ness and effectiveness. In this study, the measurement of confidence and support are proxy to the core-ness property of a formula. The definitions  and S which denote the confidence and support of a solution were used to evaluated the core-ness. Generally speaking, when S has to be calculated,  should have a minimum value of 0.7 according to the TCM expert practitioner, otherwise, it may undermine the representative property of CEF. The greater these two values are, the better core-ness of a solution is, that is also to say, the constituent herbs are more widely used in the prescriptions and the formula is more frequently used. As for the evaluation of effectiveness, the dataset can be divided into two groups, namely the CEF group and non-CEF group. By the definition mentioned in 3.1, in CEF group, all the records whose prescriptions carried more than preset  proportion of herbs in the specific solution, while the prescriptions in the non-CEF group did not. Then Z-test for the difference between two effective proportions (EP) was carried out at a 5% significance level (p<0.05).

4.Results

4.1. GA results

The basic parameters of the GA for the experiments are shown in Table 3. In order to get CEF with the highest confidence, was set to 1.The other parameters were set to the following values: R(the penalty constant) is set to 200, Nset were given in the range of 8 to 11, the initial number of herb selection were set to Nset , while S1 began from 2% and stepped up by 1% in each increment until it reached 10%.

Table 3. Parameters for GA

Parameter

Value

Population size

1000

Initial herb selection probability (Pi)

Nset/230

Crossover probability

0.7

Tournament selection size

15

Generation

200

The results in the following discussion were averaged over three execution using the same parameters. A heat map in Figure 4 shows the sensitivity of fitness values in relation to the different setup of Sset and Nset. We can see that fitness value are mostly positive with values of Nset being 8 or 9,while S1 is in the range of 2~6%. After removing duplicates, there were 9 CEF which are solutions with positive fitness values. These 9 CEF are composed of 15 distinct herbs (Table 4 & Table 5). The maximum number of herbs in a CEF is 11, which is much smaller than the average number of herbs in a transaction in the dataset. There are a few herbs existing that are common across 6 CEF, such as AF, AO, AR, DS, PC, PT, RB and TP. These common herbs are related to nourishing Yin, regulating Qi, and strengthening the spleen function, which are generally consistent to the TCM principle in LC treatment.

Figure 4. Fitness value by GA with different N and S1 combination

Table 4. Herbs Abbreviation

AR

Astragalus Root

AF

Akebia Fruit

AO

Atractylis Ovata

CD

Chinese Date

CS

Chinese Sage Herb

CSE

Coix Seed

DS

Doederlein's Spikemoss Herb

HO

Herba Oldenlandiae

MA

Malt

PC

Pachyma Cocos

PL

Pyrrosia Leaf

PT

Pinellia Tuber

RB

Rhizoma Batatatis

RS

Rice-grain Sprout

TP

Tangerine Peel

Table 5.CEF obtained by GA

 

 

COMPOSITION

No.

Number of herbs

AR

AF

AO

CD

CS

CSE

DS

HO

MA

PC

PL

PT

RB

RS

TP

1

10

 

X

X

 

 

X

 

 

X

X

X

X

X

X

X

2

9

X

X

X

 

X

 

X

X

X

 

 

X

X

 

 

3

8

X

X

X

 

 

 

 

 

 

X

 

X

X

X

X

4

9

X

X

X

 

 

 

X

X

 

X

 

X

X

 

X

5

8

 

X

X

 

X

 

X

 

 

X

 

X

X

 

X

6

10

 

 

X

X

X

X

X

X

 

X

 

X

X

 

X

7

9

X

X

X

 

X

 

X

 

X

 

 

X

X

 

X

8

11

X

X

X

 

X

 

X

X

 

X

 

X

X

X

X

9

10

X

X

X

 

 

 

X

 

X

X

 

X

X

X

X

6

8

9

1

5

2

7

4

4

7

1

9

9

4

8

4.2. Evaluation

4.2.1. Core-ness

Herb-herb network was constructed using aco-occurrence frequency-based method. The degree value of one node (herb) was defined as the number of other nodes (herbs) that it connects to;it is a simple but an important property of any complex network. A node has a more significant role to play if it has a higher degree value.The importance of a herb was studied according to its degree value and frequency in the dataset. These values were sorted into descending order and shown in Table 6. Among the 230 herbs in the dataset, the 15 herbs that make up CEF are all ranked in the top 50; in terms of degree and frequency based on both records and patients. In other words, it is a good indication that these 15 herbs in CEF are core herbs.

Table 6. Core herb identification

Herb

Degree

Degree rank

Record based

Patient based

Frequency

Frequency rank

Frequency

Frequency rank

DS

225

1

393

2

146

2

CS

225

2

395

1

147

1

AF

223

3

359

3

133

3

AO

220

4

321

5

127

5

HO

219

5

332

4

129

4

PC

207

6

266

8

107

7

AR

198

9

263

9

116

6

CSE

197

10

223

13

96

13

RB

194

11

235

12

97

12

RS

191

12

268

6

106

9

MA

191

13

268

7

106

10

TP

184

16

195

14

80

16

CD

158

30

120

23

53

28

PT

152

34

99

29

51

29

PL

127

47

65

41

31

43

Average confidence of the prescription () and support under the different  confidence (S) were calculated in order to evaluate the core-ness of CEF. In order to evaluate the correlation within individual , patient based support (PBS) were also calculated for each CEF (Table 7). The values of  and S are all relatively high. In particular, the second CEF (CEF2) has its  value exceeds 0.7, which means that the prescriptions in dataset are consistently composed of more than 70% herbs from the CEF2. The values under S0.7 of both CEF8 and CEF9 exceed 0.5, which means that there are more than 50% of the prescriptions are composed of 70% or more herbs from these two CEF. As for PBS, a CEF is no valuable for its small PBS when it is concentratedly used for the minority. Results show that all PBS are larger than the corresponding S1 (record based support), which indicates no concentrated use on patient level for CEF.

Table 7. Confidence and support of CEF

No.



S0.7

S0.8

S0.9

S1

PBS

1

0.549

0.396

0.248

0.084

0.021

0.040

2

0.707

0.499

0.239

0.041

0.041

0.067

3

0.598

0.375

0.198

0.043

0.043

0.073

4

0.653

0.356

0.191

0.050

0.050

0.073

5

0.675

0.489

0.294

0.084

0.084

0.120

6

0.616

0.461

0.265

0.122

0.036

0.060

7

0.670

0.406

0.196

0.041

0.041

0.067

8

0.678

0.501

0.310

0.167

0.041

0.067

9

0.637

0.511

0.332

0.181

0.041

0.067

4.2.2. Effectiveness

To test the effectiveness of CEF , the dataset was divided into two groups, namely the CEF group and non-CEF group. In this study,  was set to 1. In other words, all the prescriptions in CEF group carried all the herbs of the specific CEF, while the prescriptions in the non-CEF group did not have a full set of herbs from CEF.The Z-test for the difference between two effective proportions (EP) was performed for each CEF. Table 8 shows that EP of all the CEF groups are significant better than the non-CEF group.

Table 8. Z test for the difference of EP for CEF

No.

EP of Non-CEF Group

EP of CEF Group

p value

1

0.210

0.778

0.000

2

0.206

0.588

0.002

3

0.204

0.611

0.000

4

0.206

0.524

0.004

5

0.203

0.429

0.009

6

0.210

0.533

0.013

7

0.206

0.588

0.002

8

0.206

0.588

0.002

9

0.206

0.588

0.002

Sampling is a simple and well known method for parameter studies and robustness evaluations. To test the robustness of effectiveness, in this study leave one (patient) out analysis was performed. After removing one patient from the original data, effectiveness of CEF was remeasured for the remaining patients. This was repeated such that each patient in the data was removed once. EP and p value of Z-test were calculated for each CEF each time. Results are shown in Table 9. Then p value was transformed into -log(p), where -log(p) was larger than 1.301 indicated p value was smaller than 0.05. It can be seen in Table 9 that there is little change in EP of both groups from the original to the perturbed and all -log(p) exceed 1.33,which shows good robustness for the effectiveness evaluation with a small perturbation in sample (patient level) space.

Table 9. Leave one (patient) out analysis to test the robustness of effectiveness (Total 150 times)

No.

EP of Non-CEF Group

EP of CEF Group

-Log(p)

Mean

Range

Mean

Range

Mean

Range

1

0.209

[0.204,0.213]

0.778

[0.750,0.857]

4.215

[3.308,5.867]

2

0.206

[0.201,0.210]

0.589

[0.563,0.643]

2.764

[2.314,3.194]

3

0.204

[0.199,0.208]

0.612

[0.588,0.667]

3.273

[2.791,3.779]

4

0.206

[0.201,0.209]

0.524

[0.500,0.579]

2.360

[1.990,2.936]

5

0.203

[0.197,0.206]

0.429

[0.412,0.455]

2.036

[1.757,2.337]

6

0.210

[0.205,0.214]

0.533

[0.500,0.583]

1.859

[1.335,2.160]

7

0.206

[0.201,0.210]

0.589

[0.563,0.643]

2.764

[2.314,3.194]

8

0.206

[0.201,0.210]

0.589

[0.563,0.643]

2.764

[2.314,3.194]

9

0.206

[0.201,0.210]

0.589

[0.563,0.643]

2.764

[2.314,3.194]

4.3. Assumption analysis

There were 9 CEF and 15 core herbs generated from the GA process. Since the number of distinct herbs from the overall CEF was relatively small, which provoked our interest to find out whether a CEF consisting of these 15 core herbs exists or not, if so, its effectiveness. It was found that such a combination of herbs was in the dataset. Its core-ness and effectiveness were evaluated (Table 10). Although  and EP values were relatively high and may be acceptable, S1 value was only 0.009, which meant there were only 4 records covering this combination. Its value is too small to be considered as a core formula, but it is still worthwhile to carry out clinical trial in the future because of its higher effectiveness.

Table 10. Core-ness and effectiveness evaluation of 15 core herbs combination



S1

EP of Non-CEF Group

EP of CEF Group

p value

0.605

0.009

0.217

0.750

0.014

4.4. Analysis of herb-herb interactions in CEF

A herb combination is chosen to promote desirable herb-herb interaction; the efficacy of a TCM formula comes from the synergistic effects of its constituent herb pairs. Therefore, practitioners are interested to identify the potential interacting herbs from a prescription. Based on the previous work of the analysis of herb-herb interactions in CEF , the synergy index (SI) was calculated for each herb pair in CEF :

Where E11 denotes the EP value of co-occurring of the two herbs, and E01 or E10 is the EP value of each one used without the other herb, whiledenotes a maximum function, i.e. max(E01, E10).When SI is equal to 1, it indicates no real advantage of putting the two herbs together. When SI is greater than 1, it shows potential synergy. When SI is getting a larger value, it indicates a synergistic interaction between the two herbs in the pair. Figure 5 shows the distribution of SI of all core herb pairs. Although most SIs are closed to 1, the distribution skews more to the positive side (greater than 1), which indicates the existence of some potential synergies. All the SIs values are greater than 0.9, which imply no obvious antagonistic effect among the core herb pairs. Permutation test was performed to test the significance of SI by permuting the outcome variable 2000 times. As a result, 4 significantly synergistic effects of core herb pairs were obtained (Table 11). Table 11 shows that most of these pairs were related to the functions of Regulating Qi to Promote Diuresis, Eliminating Dampness to Eliminate Phlegmin according to TCM theory.

Figure 5. Distribution of SI

Table 11. Analysis of herb-herb interactions in CEF

No.

Herb Pair

SI

p value

1

PL

PT

1.673

0.004

2

CD

PT

1.419

0.012

3

CSE

PL

1.363

0.028

4

PT

TP

1.077

0.025

5. Discussion

Prescription for a diagnosis is a complicated and flexible procedure that integrates the knowledge of TCM theory. TCM practitioners put heavy emphasis on individualities when prescribing formulae in clinical practices. This is very different from the modern western medical therapies that usually comply with a common and operational clinical guideline. Revealing the regularity in prescriptions is an important step to reveal the underpinning TCM theory. It has generated much research interest to discover the regularity from the TCM prescriptions. Although computational models have been applied to reveal the core herbs and herb-collaboration patterns, not much effort has been expended to study their effectiveness. This is critical and important research to discover these hidden patterns that are core and effective herbal formula.

As for the discovery of CEF, it can be described as a complicated combinatorial optimization problem mathematically, which is concerned with the efficient combination of herbs to meet requirement. The purpose of this study is to set the stage and give an outline of properties of optimization problems that are relevant for discovery of CEF in TCM. We described the process of how to define this problem model that could be solved by GA method. In brief, analytic process consisted of recognizing and defining problems, constructing and solving models, and evaluating solutions. Furthermore, we looked at important properties of CEF, which could be used as the validation criteria. For CEF, there are two key questions to be answered. One is how to evaluate the core-ness of a TCM formula and the other is the assessment of its clinical effectiveness.

In this study, the measurement of confidence and support are proxy to the core-ness property of a formula; the greater these two values are, the constituent herbs are more widely used in the prescriptions and the formula is more frequently used. The definitions  and S denote the confidence and support of a given CEF respectively. It is quite common for a TCM practitioner to pick a subset of formulae, which are CEF ,as templates and personalize them for a patient.Upon the selected template(s), the practitioner can add or remove or replace herbs. The confidence value () well explains the flexible usage of the CEF and the personalised adaptation in action.

Regarding the assessment of clinical effectiveness, the primary outcome measurement in our study was to quantify information related to the symptoms changes in a cancer treatment. In an internal panel meeting of TCM cancer experts, the most common LC symptoms were identified and they were consistent with literatures. Our results shown that the total improvement proportion in symptoms was only 22.19%, which indicated a great challenge for the LC treatment for the TCM practitioners. Also, it is no use at all even if there were frequently used herb combinations (CEF candidates) but they do not have high efficacy.

GA has the ability to solve combinatorial optimization problems, which was reported by many literatures. A basic GA has the following implementation steps: First, the feature values are encoded into chromosomes to form the initial population. Second, calculate the fitness of every chromosome using the defined fitness function. Thirdly, according to the fitness values, genetic operators are applied to select chromosomes to form a new population. This process repeats until a certain condition is satisfied. In our previous work, GA has successfully helped us to find meaningful relationship between herbs and symptoms after designing a proper fitness function. Therefore, it is our belief that the usefulness of GA for other combinatorial optimization problems in TCM can't be fairly assessed on the basis of its performance on the discovery of herb-symptom relationship alone.

In this study, we gave an outline description of the way in which a genetic algorithm worked. While a crucial point in using GA is the design of the fitness function, which determines what a GA should optimize. In this study, we designed the fitness function based on two evaluation criteria of CEF, one is core-ness which is represented by confidence and support defined in the present paper, the other is effectiveness which is evaluated by the statistic difference in effective proportion between CEF group and non-CEF group. The proposed fitness function is flexible and suitable for both binary and continuous outcome. To apply a penalty constant R in the fitness function is the strategy of removal of unsatisfactory CEF. This constant could be set to a value greater than the maximum value of EV to identify the CEF that meet the requirement, i.e. a CEF would be dropped if the fitness value was negative(not meeting expectation), it would be kept otherwise.

Parameter tuning is always a challenging task for GA. The GA toolbox for Matlab developed by the University of Sheffield was used in these experiments. Results were acceptable using the default GA parameters. The additional key parameters are N and S in our approach. In this study,  was set to 1 and S1 began from 2%. When S1 was high or N getting large, there were not many CEF with positive values. A small but reasonable number of CEF were reported after proper values were set for S and N.

In particular, for multiple records data, which can be also regarded as longitudinal data, there are three type correlation effect:(1)correlation between variables (herbs), (2)correlation within individual (patient), (3)correlation between individual (patient).

As for the research of TCM formula, the first one can be seen as the herb-herb relationship, such relationships are meaningful patterns of herb combination, which provokes many researchers to develop methods to uncover the underlying rules. For this purpose, support and confidence based association rules algorithms are generally introduced. Motivated by the idea of association algorithm, we presented the support and confidence based criteria ( and Sα) in order to evaluate the core-ness of herb combination. It is found that when  is equal to 1, Sα is effectively the concept of support as commonly used in the association mining algorithm , such as Apriori algorithm.

While it is hard to handle the second correlation, which may undermine the evaluation of herb combination. For example, when one CEF is used for only one patient who visit frequently, although its support may be relatively high because of its large number of times for visit, such CEF is meaningless. However this disadvantage can be reduced by choosing a large sample size. Hence, individual (patient) based support analysis could be helpful to identify the correlation within patient. In this paper, we gave support based on the patient for the analysis, and carried out robustness analysis of CEF's effectiveness by leaving one (patient) out method. Results showed no concentrated use on patient level for CEF and good robustness also implied the stability for the effectiveness evaluation with a small perturbation in sample (patient level) space, which meant that correlation within patient level in this study didn't undermine our evaluation on the effectiveness of CEF and our sample size was appropriate for discovery the reliable solutions.

The last correlation is related to the individual's factors, such as age, gender, pathology, family history, pulmonary function, and TCM syndrome, etc. In order to reveal the relationship between patient pattern and CEF, another mathematical pattern recognition model needs to be established, which will be in our future work.

A total of 9 CEF were reported with good core property and high effectiveness. In the calculation of the EP value of single use of each core herb, the maximum value was 31.3% that was significantly lower than any combinations (CEF). These results highlighted the advantages and rationality of the combined use of herbs in TCM.

In the theory of TCM, deficiency is the important cause and pathogenesis during the occurrence and development of tumor. Lack of vital Qi and deficiency of both Qi and Spleen can lead to a series of pathological changes, such as Qi stagnation, blood stasis, dampness and phlegm, and eventually lead to the tumor. For that reason, TCM therapeutic principles for LC are most related to Benefiting Qi, Nourishing Yin, Invigorating Spleen, Eliminating Dampness, and Eliminat Phlegm based on different syndromes and symptoms, which were indicated by the function of 9 CEFs.

One interesting observation is the similarity among the CEF ,this can help understand underlying TCM therapeutic principles for LC. Since it is fairly common for the doctors in the same hospital to use similar sets of herbs for the same disease (LC), it is necessary and beneficial to compare the results of CEF due to a LC dataset from another hospital. It is also worthwhile to observe what CEF are discovered if a larger dataset with higher supports is used.

The herb-herb interactions in CEF were also studied and reported. Four herb pairs with high and significant SI values indicated they were synergistic. Some of them are present in classic TCM formulae. For example, PT and TP are in BanxiaChenpi Tang, which contribute to the relieving of cough and reducing sputum.

Therefore, all the results confirmed with TCM theory, which indicated the feasibility and validity of the proposal. However, dosages were not considered in this work, which are key aspect in CEF, should be taken as the future work. GA is capable to represent its chromosomes in real numbers, a reformulation of the fitness function can accommodate this change. A mathematical model of dose-effect needs to be defined. This may increase complicate the definition of the fitness functions, but the valuable results will make the effort worthwhile.

6. Conclusions

After defining the confidence, support and effectiveness value related to a CEF were introduced, GA was used to discover the CEF from a TCM cancer clinical dataset. Results indicated that GA is suitable for the discovery of CEF that can be interpreted from the TCM principles. This is just an attempt and exploration of data mining to discover CEF from TCM clinical data. More work is still required to explore the strength, limitation and appropriateness of the measures if they are relevant to other type of diseases.

Availability

The source GA code in MATLAB with GNU General Public License Version 2.0 is freely available at our project website: https://code.google.com/p/longhua-ga-cef/

Acknowledgments

The authors are grateful to the anonymous reviewers and the editors for their helpful comments and suggestions, which substantially improved the quality of this paper. This study was supported by the National Natural Science Foundation of China (81173226), Clinical Study of Traditional Chinese Medicine of Malignant Tumor Base, State administration of traditional Chinese medicine Chinese medicine cancer diseases key disciplines, Longhua Medicial Project, China Scholarship Council, the student No. is: 201206740061. There is no conflict of interest involved in this paper.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now