Ial virulent proteins [38], predicting metalloproteinase family [39], predicting protein folding rate [40], predicting GABA(A) receptor proteins [41], predicting protein supersecondary structure [42], identifying protein quaternary structural attribute [43], predicting cyclin proteins [44], classifying amino acids [45], predicting enzyme family class [46], identifying risk type of human papillomaviruses [47], and discriminating outer membrane proteins [48], among many others (see a long list of references cited in [49]). Because it has been widely used, recently a powerful software called PseAAC-Builder [49] was proposed for generating various special modes of PseAAC, in addition to the web-server PseAAC [50] established in 2008. According to a recent review [34], the general form of PseAAC for a protein P can be formulated as P ?y1 y2 ?yu ?yV T ??Materials and Methods 1. Benchmark DatasetThe benchmark dataset Bench used in this study was taken from Verma et al. [2]. The dataset can be formulated asBenchz[{??where z contains 252 secretory proteins of malaria parasite, { contains S non-secretory proteins of malaria parasite, and the 252 symbol represents the union in the set theory. The same benchmark dataset was also used by Zuo and Li [4]. For reader’s convenience, the sequences of the 252 secretory proteins in z and those in { are given in Supporting 94361-06-5 supplier information S1.where T is a transpose operator, while the subscript V is an integer and its value as well as the components y1 , y2 , … will depend on how to extract the desired information from the amino acid sequence of P. The form of Eq.2 can cover almost all the various modes of PseAAC. Particularly, it can be used to reflect much more essential core features deeply hidden in complicated protein sequences, such as those for the functional domain (FunD) information [51,52,53] (cf. Eqs.9?0 of [34]), gene ontology (GO) information [54,55] (cf. Eqs.11?2 of [34]), and sequence evolution information [3] (cf. Eqs.13?4 of [34]). In 22948146 this study, we are to use a novel approach to define the V elements in Eq.2. As is well known, biology is a natural science with historic dimension. All biological species have developed starting out from a very limited number of ancestral species. It is true for protein sequence as well [56]. Their evolution involves changes of single residues, insertions and deletions of CP21 several residues [57], gene doubling, and gene fusion. With these changes accumulated for a long period of time, many similarities between initial and resultant amino acid sequences are gradually eliminated, but the corresponding proteins may still share many common attributes, such as having basically the same biological function and residing at a same subcellular location. To incorporate this kind of sequence evolution information into the PseAAC of Eq.2, let us use the information of the PSSM (Position-Specific Scoring Matrix) [3], as described below. According to [3], the sequence evolution information of protein P with L amino acid residues can be expressed by a 20|L matrix, as given by 2 6 P(0) 6 PSSM 6 m(0) 1,2,2. A Novel PseAAC Feature Vector by Incorporating Sequence Evolution Information via the Grey System TheoryTo develop a powerful predictor for a protein system, one of the keys is to formulate the protein samples with an effective mathematical expression that can truly reflect their intrinsic6 6 m(0)m(0) 1,2 m(0) 2,2 . . . m(0) L,? ?. . . ?. 6 . 4 . m(0) L,7 m(0) 7 2,20 7 7 .Ial virulent proteins [38], predicting metalloproteinase family [39], predicting protein folding rate [40], predicting GABA(A) receptor proteins [41], predicting protein supersecondary structure [42], identifying protein quaternary structural attribute [43], predicting cyclin proteins [44], classifying amino acids [45], predicting enzyme family class [46], identifying risk type of human papillomaviruses [47], and discriminating outer membrane proteins [48], among many others (see a long list of references cited in [49]). Because it has been widely used, recently a powerful software called PseAAC-Builder [49] was proposed for generating various special modes of PseAAC, in addition to the web-server PseAAC [50] established in 2008. According to a recent review [34], the general form of PseAAC for a protein P can be formulated as P ?y1 y2 ?yu ?yV T ??Materials and Methods 1. Benchmark DatasetThe benchmark dataset Bench used in this study was taken from Verma et al. [2]. The dataset can be formulated asBenchz[{??where z contains 252 secretory proteins of malaria parasite, { contains S non-secretory proteins of malaria parasite, and the 252 symbol represents the union in the set theory. The same benchmark dataset was also used by Zuo and Li [4]. For reader’s convenience, the sequences of the 252 secretory proteins in z and those in { are given in Supporting Information S1.where T is a transpose operator, while the subscript V is an integer and its value as well as the components y1 , y2 , … will depend on how to extract the desired information from the amino acid sequence of P. The form of Eq.2 can cover almost all the various modes of PseAAC. Particularly, it can be used to reflect much more essential core features deeply hidden in complicated protein sequences, such as those for the functional domain (FunD) information [51,52,53] (cf. Eqs.9?0 of [34]), gene ontology (GO) information [54,55] (cf. Eqs.11?2 of [34]), and sequence evolution information [3] (cf. Eqs.13?4 of [34]). In 22948146 this study, we are to use a novel approach to define the V elements in Eq.2. As is well known, biology is a natural science with historic dimension. All biological species have developed starting out from a very limited number of ancestral species. It is true for protein sequence as well [56]. Their evolution involves changes of single residues, insertions and deletions of several residues [57], gene doubling, and gene fusion. With these changes accumulated for a long period of time, many similarities between initial and resultant amino acid sequences are gradually eliminated, but the corresponding proteins may still share many common attributes, such as having basically the same biological function and residing at a same subcellular location. To incorporate this kind of sequence evolution information into the PseAAC of Eq.2, let us use the information of the PSSM (Position-Specific Scoring Matrix) [3], as described below. According to [3], the sequence evolution information of protein P with L amino acid residues can be expressed by a 20|L matrix, as given by 2 6 P(0) 6 PSSM 6 m(0) 1,2,2. A Novel PseAAC Feature Vector by Incorporating Sequence Evolution Information via the Grey System TheoryTo develop a powerful predictor for a protein system, one of the keys is to formulate the protein samples with an effective mathematical expression that can truly reflect their intrinsic6 6 m(0)m(0) 1,2 m(0) 2,2 . . . m(0) L,? ?. . . ?. 6 . 4 . m(0) L,7 m(0) 7 2,20 7 7 .