# Biologically meaningful pathways in compounds forming system Essay

Given a compounds-forming system, i.e. , a system dwelling of some compounds and their relationship, can it organize a biologically meaningful pathway? It is a cardinal job in system biological science. Nowadays, there are a batch of information on different beings, at both familial and metabolic degrees and many specific databases, such as KEGG/LIGAND, ENZYME, BRENDA, EcoCyc and MetaCyc, have been established to hive away these information. Based on these informations, it is executable to turn to such an indispensable job. Metabolic tract is one sort of compounds-forming systems and we analyzed them in barm by pull outing different ( biological and in writing ) characteristics from each of the 13,736 compounds-forming systems, of which 136 are positive tracts, i.e. , known metabolic tract from KEGG ; while 13,600 were negative, i.e. , non formed as a biologically meaningful pathway harmonizing to the current information. Each of these compounds-forming systems was represented by 144 characteristics, of which 88 are graph characteristics and 56 biological characteristics. “ Minimal Redundancy Maximum Relevance ” and “ Incremental Feature Choice ” were utilized to analyse these characteristics and 16 optimum characteristics were selected as being able to foretell a question compounds-forming system most successfully. It was found through Jackknife cross-validation that the overall success rate of placing the positive tracts was 74.26 % . It is anticipated that this fresh attack and encouraging consequence may give meaningful light to look into this of import subject.

Keywords: Compounds-forming system, Metabolic pathway, Minimum redundancy upper limit relevancy, Nearest neighbour algorithm, Jackknife cross-validation, Compound similarity, Chemical functional group

## Introduction

A compounds-forming system, i.e. , a system dwelling of some compounds and their relationship, is an of import research country in system biological science. It is still a great challenge to foretell that a given compounds-forming system can organize a meaningful biologically pathway or non.

During the past decennary, bioinformatics has seen an detonation in the sum of biological informations and undergone rapid development. The information on different beings has been accumulated a batch both on familial and metabolic degrees. Therefore some specific databases, such as KEGG/LIGAND [ 1, 2 ] , ENZYME [ 3 ] , BRENDA [ 4 ] , EcoCyc and MetaCyc [ 5, 6 ] , has been established to hive away these information. KEGG ( Kyoto Encyclopedia of Genes and Genomes ) [ 1, 2, 7 ] is a widely used database for systematic analysis of cistron maps in footings of the interactions between cistrons and molecules, which consists of graphical diagrams of biochemical tracts. It is known that metabolic tracts, one sort of compounds-forming systems and one of the most of import constituents in KEGG, have been developed and accumulated rapidly, which make it possible to analyze metabolic tracts consistently [ 8 ] . Analysis of metabolic tracts is really utile to understand the relationship between genotype to phenotype. It involves the undermentioned of import jobs: biological development processes reading, acknowledgment of metabolites common to a set of functionally-related metabolic tracts and alternate metabolic tracts finding. Furthermore, it can give aid in metamorphosis mold and map anticipation. Metabolic tract web is really of import to analyze pharmacological marks, biological science development and other biotechnological applications [ 9, 10 ] .

For most tracts in KEGG, it is hardly possible to get their graph features by manual question executing. The current survey developed a new attack to turn to this job that may give meaningful light to in-depth analyzing assorted pathway web systems.

## MATERIALS AND METHODS

## Materials

The information of metabolic tracts was collected from the populace available database KEGG/LIGAND ( file transfer protocol: //ftp.genome.jp/pub/kegg/pathway/map/ ) . Since there are compounds without information of compound similarity or biological belongingss, we remove these compounds in each tract. Besides, pathways affecting less than three compounds were besides excluded. As a consequence, 136 metabolic tracts, or compounds-forming systems, in barm were obtained and they are termed as “ positive tracts ” . The 136 positive tracts every bit good as the compound codifications contained in each of such tracts are given in Online Supporting Information A1.

The information of negative tracts was generated by the following two ways. First, indiscriminately choice compounds as the vertices of a graph, followed by making some discharge between these compounds in random. Note that the figure of discharge in each negative tract was assigned harmonizing to the size distribution of the discharge in the positive tracts. Second, replace about half of compounds in each positive tract by other compounds, and the discharge between the compounds, including both the original and the replaced 1s, remain unchanged. Since positive tracts are really rare comparing to the huge bulk of negative tracts, the figure of negative tracts therefore generated was 100 times every bit many as that of the positive 1s in the current survey. The 13,600 negative tracts are given in Online Supporting Information A2.

## Features

As in writing attacks can give utile intuitive penetrations, it is widely utilized to analyze biological systems, such as drug metamorphosis systems [ 11 ] , protein turn uping dynamicss [ 12 ] , suppression of HIV-1 contrary RNA polymerase [ 13-15 ] , and biological and medical related jobs [ 16-19 ] .

Both graph characteristics and biological belongingss were used to code each tract in the current survey. Since reactions between two compounds in each metabolic tract are directional, i.e. one compound can be transformed into another compound with the engagement of certain enzyme while the rearward way does non ever keep, each metabolic tract can be seen as a directed graph where the vertices indicate compounds and the arcs reactions. In this survey, 88 graph characteristics are extracted from each directed graph that represents a tract, and 56 characteristics about biological belongingss were derived from chemical functional groups. Therefore, there are wholly 88+56=144 characteristics. In this instance, we can specify each of the 13,736 tracts in a 144-D ( dimensional ) infinite, see Online Supporting Information B1 and Online Supporting Information B2 for the codifications of 136 positive tracts and 13,600 negative tracts, severally.

Many graph characteristics were derived in [ 20-22 ] , where the characteristics were extracted from an adrift graph and these characteristics were successfully used to place protein complex [ 23, 24 ] . Since each tract can be deem as a directed graph, we made some alteration to these characteristics. The discharge are weighted by the compound similarity of the corresponding two compounds, which will be given detail account in Section “ Compound Similarity ” . The 144 characteristics are divided into the undermentioned groups.

Graph size and graph denseness: Let be a tract graph, with vertices and discharge. The graph size is the figure of compounds in the tract. Suppose is the theoretical maximal figure of possible discharge in. The graph denseness is defined as. [ 20 ]

Degree statistics: The in-degree ( out-degree ) is defined as the figure of in-neighbors ( out-neighbors ) of a vertex. Average in-degree, discrepancy of in-degrees, average in-degree, maximal in-degree, average out-degree, discrepancy of out-degrees, average out-degree and maximal out-degree are taken as characteristics. [ 21 ]

Edge weight statistics: Let be a leaden tract graph where each discharge is weighted by a weight in the scope of. It is possible that for some discharge, we extracted characteristics from two instances: ( a ) all arcs in graph are considered including those with zero weights, so take mean and discrepancy of these weights as characteristics ; ( B ) discharge with non-zero weights are considered so as to take mean and discrepancy of the non-zero weights as characteristics. [ 20 ]

Topological alteration: Let be a leaden tract graph. This group of characteristics is obtained by mensurating the topological alterations when different cutoffs of the weights are applied to the graph. The weight cutoffs included 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7 and 0.8. Let be the graph that merely includes the discharge with weights higher than remained, i.e.. Topology alterations are measured as for ( if ) .

Degree correlativity: Let be a pathway graph with. For each vertex, denote its in-neighbors as and out-neighbors as. Suppose and are two induced subgraphs of. Define ( if ) and ( if ) . Take the mean, discrepancy and upper limit of and, severally. [ 22 ]

Clustering: Let be a pathway graph with. For each vertex, denote its in-neighbors as and out-neighbors as. Let and be two induced subgraphs of. Define ( if ) and ( if ) . Take the mean, discrepancy and upper limit of and, severally, as characteristics. [ 21 ]

Topological: Let be a pathway graph with. For each brace of vertices, denote as the figure of both in-neighbor of and in-neighbor of, as the figure of both in-neighbor of and out-neighbor of, as the figure of both out-neighbor of and in-neighbor of and as the figure of both out-neighbor of and out-neighbor of. For each vertex, denote and as the figure of in-neighbors and out-neighbors of. Let ( if ) , ( if ) , ( if ) , ( if ) . For each vertex, allow be the mean of for. Topological characteristics are defined as the mean, discrepancy and upper limit of for, severally. [ 22 ]

Remarkable values: Let be a tract graph and be its next matrix. The first three largest remarkable values are taken as the characteristics. [ 20 ]

Local denseness alteration: Let be a pathway graph with. For each vertex, allow and be the in-neighbors and out-neighbors of. We merely demo how to derive characteristics from out-neighbors of each vertex under different cutoffs, which included 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.9. Construct a leaden adrift complete graph with vertices and the weight of each brace of vertices is the compound similarity of the corresponding compounds ( see Section “ Compound Similarity ” ) . Suppose the cutoff is, which may be 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 or 0.9. Extract a crossing subgraph of with borders whose weights are greater than. Compute ( if ) . Take the mean and upper limit of as characteristics under cutoff.

The above characteristics are for graph representation, while the undermentioned characteristics are for biological belongingss. Encouraged by the successes of utilizing chemical functional groups in undertaking some biological jobs [ 25, 26 ] , they were taken as the biological belongingss in the current survey. Suppose a tract consists of compounds, the mean and maximal values of biological belongingss of the compounds are taken as characteristics.

Chemical functional groups: Organic compounds consisted of some fixed constructions, which are called functional groups of atoms combined in different ways. Compounds with the same functional group by and large react in a similar manner. Presently, there are more than 100 functional groups. In the current survey, we selected 28 common 1s. These groups are: “ intoxicant ” , “ aldehyde ” , “ amide ” , “ aminoalkane ” , “ hydroxamic acid ” , “ P ” , “ carboxylate ” , “ methyl ” , “ ester ” , “ quintessence ” , “ imine ” , “ ketone ” , “ nitro ” , “ halogen ” , “ thiol ” , “ sulfonic acid ” , “ sulfone ” , “ sulfa drug ” , “ sulfoxide ” , “ sulphide ” , “ a_5c_ring ” , “ ar_6c_ring ” , “ non_ar_5c_ring ” , “ non_ar_6c_ring ” , “ hetero ar_6_ring ” , “ hetero non_ar_5_ring ” , “ hetero non_ar_6_ring ” , and “ hetero ar_5_ring ” . In this survey, chemical functional group is calculated by package “ fc_analyzer ” which can be downloaded at hypertext transfer protocol: //pcal.biosino.org/fc_analyzer.html. [ 27 ]

In decision, the entire figure of characteristics is

( 1 )

As for the item distribution of the 144 characteristics, see Table 1.

## Compound Similarity

Using graph representations to mensurate the similarity of two compounds was foremost proposed by Hattori et Al. [ 28 ] , and it has been used to undertake some jobs in biological system, such as the anticipation of interactiveness between little molecules and enzymes [ 29 ] , the anticipation of web of substrate-enzyme-product threes [ 30 ] , and drug-target interaction anticipation [ 31 ] . Since each chemical construction can be represented by a planar ( 2D ) graph where vertices denote atoms and borders denote bonds between them, the similarity of two compounds, harmonizing to their method, can be estimated based on the size of the maximal common subgraph between two matching graphs utilizing a graph alliance algorithm. The similarity mark between two compounds by this method can be calculated by an on-line web site at hypertext transfer protocol: //www.genome.jp/ligand-bin/search_compound.

## Minimal Redundancy Maximum Relevance

Feature choice can cut down characteristic dimensions and better computational efficiency. In the current survey, Minimum Redundancy Maximum Relevance ( mRMR ) , foremost proposed by Peng et Al. [ 32 ] , is utilised as it aims to equilibrate minimal redundancy and maximal relevancy and has been widely used to undertake assorted biological jobs [ 23, 25, 33-36 ] . The maximal relevancy warrants that features that contribute most to the categorization will be selected, while the minimal redundancy warrants that characteristics whose anticipation ability has already been covered by selected characteristics will be excluded. mRMR attempts to add each characteristic in order into the feature list. In each unit of ammunition, a characteristic with maximal relevancy and minimal redundancy is selected. As a consequence, a feature list with the choice order can be obtained. Both redundancy and relevancy can be computed through common information ( MI ) , which is defined as follows

( 2 )

where and are two random variables, is the joint probabilistic distribution of and ; and are the fringy chances of and, severally.

Let denote the whole characteristic set. Suppose is the selected characteristic set with characteristics, while is the to-be-selected characteristic set with characteristics. The relevancy of a characteristic and the mark variable can be computed as, and the redundancy between a featureand the selected characteristic set can be computed as

( , if ) ( 3 )

For each characteristic in, calculate the undermentioned equation

( 4 )

To maximise relevancy and minimise redundancy, select a characteristic such that

( 5 )

Then take from and take it into. For the remainder characteristics, each clip the most relevant and least excess characteristic is selected from and taken into, until all characteristics are in. Thus, for a characteristic pool with characteristics, mRMR plan will put to death unit of ammunitions and supply an ordered characteristic list:

( 6 )

where denotes the unit of ammunition at which the characteristic is selected.

## Nearest Neighbor Algorithm

In this survey, Nearest Neighbor Algorithm ( NNA ) [ 37, 38 ] was adopted to foretell the category of tract ( positive or negative ) . The “ nearness ” is defined by the Euclidian distance as below

( 7 )

where is dot merchandise of two vectors and, and are the modulus of vector and, severally. The smaller the, the nearer the two variables are [ 39 ] .

In NNA, say there are developing tracts, each of them is either positive or negative, and a new tract demands to be determined to be either positive or negative. The distances between each of the preparation tracts and the new tract are calculated, and the nearest neighbour of the new tract is found. If the nearest neighbour is positive or negative, so the new tract is assigned to be positive or negative, severally.

## Jackknife Cross-validation

The anticipation theoretical account was tested by jackknife trial [ 40 ] . There are three cross-validation methods in statistical anticipation: clasp knife trial, K-fold cross-validation trial, and independent dataset trial [ 40 ] . However, jackknife trial is deemed more nonsubjective and effectual than other two methods [ 39, 41 ] . So it has been widely used to measure the truth of assorted anticipation theoretical accounts [ 23, 25, 42-47 ] . In such a trial, each sample in the dataset is singled out in bend as the proving informations and the remainder samples are used to develop the anticipation theoretical account. Therefore every sample is tested precisely one time.

## Incremental Feature Selection ( IFS )

From mRMR, an ordered characteristic list was obtained. Specify the i-th characteristic set as, i.e. contains the first characteristics of. For every, we perform NNA with the characteristics in and an truth of right foretelling the positive tracts, evaluated by clasp knife cross-validation, was obtained. As a consequence, we can plot a curve named IFS curve, with designation truth as its y-axis and the index of as its x-axis.

## RESULTS AND DISCUSSION

## Consequences of mRMR

The mRMR plan can be downloaded from hypertext transfer protocol: //research.janelia.org/peng/proj/mRMR/ . Besides, it was run with default parametric quantities. There are two feature lists in the consequence of mRMR plan: ( I ) MaxRel features list ; ( II ) mRMR characteristics list ( see Online Supporting Information C ) .

For the MaxRel characteristics list, the most relevant 10 % of the characteristics ( wholly 14 ) was investigated and the distribution of them was shown in Fig. 1. Among these 14 characteristics, 12 ( 85.7 % ) characteristics were extracted from the matching graph of a tract, bespeaking that among the adopted characteristics, graph characteristics contribute most to the forming of metabolic tracts. Of the 14 characteristics, 9 ( 64.29 % ) features come from the 9th characteristic group “ local denseness alteration ” , which quantifies the similarity between some compounds where these compounds can be transformed into or out of a peculiar compound. It indicates that compounds linked by an discharge, irrespective of its way, are frequently really similar. From which it is easy to infer that compounds that can be transformed into the same compound are frequently really similar in construction.

## Consequences of IFS

Shown in Fig. 2 is the IFS curve. The highest truth of IFS for positive tracts is 74.26 % utilizing 16 characteristics ( see Online Supporting Information C ) . Furthermore, for the readers ‘ involvement, the truth of placing the negative tracts and overall truth utilizing these optimized 16 characteristics are 99.64 % and 99.39 % , severally. The item informations of IFS can be found in Online Supporting Information D.

Shown in Fig. 3 is the distribution of the optimized 16 characteristics. It is straightforward to see that 10 ( 62.5 % ) features come from tract graph, among which 6 ( 37.5 % ) features come from the 9th characteristic group “ local denseness alteration ” making the same consequences as that in Section “ Results of mRMR ” . In add-on, 6 ( 37.5 % ) features come from the chemical functional groups, bespeaking that chemical map groups besides contribute towards the forming of metabolic tracts.

## Analysis of the Important Features

In this survey, we present a fresh metabolic tract web analysis method based on intercrossed belongingss, the graph belongingss and biological belongingss. The most contributed single characteristic is the “ weight_edge_mean ( with_missing_edges ) ” which is the mean of leaden borders in a metabolic tract including the zero weights. If fewer zero-weighted borders or more to a great extent weighted borders present in the graph, the characteristic tends to give a greater value. Fewer zero-weighted borders mean that the graph is more dumbly connected, and more to a great extent leaden borders mean that the compounds in the metabolic tract are linked more strongly with higher assurance. The “ out_local_density ” and “ in_local_density ” intend the downstream and upstream chemical similarity, severally. The characteristics such as “ topological_change_0.7_0.8 ” , “ topological_change_0.6_0.7 ” mirror the chemical construction changes in the metabolic tracts. Several algorithms for general topological belongingss of metabolic webs are good characterized [ 48, 49 ] . Several methods of minimisation metabolic accommodation have shown that smasher metabolic screens undergo a minimum redistribution with regard to the constellation of the wild type [ 50, 51 ] . The importance of these characteristics shows that the correlativity exists between the structural similarity and the pathway connectivity of chemical compounds. The inclination that structurally similar compounds are closely positioned on the tract can be confirmed by the distribution of compound similarity scores along the KEGG pathways [ 28 ] .

Some other contributed characteristics such as “ non_ar_6c_ring_mean ” , “ ar_6c_ring_max ” , “ methyl_max ” , “ sulfonamide_max ” , “ carboxylate_mean ” , “ halogen_max ” are extremely related to the specific chemical constructions of the metabolic tracts. The largest bunchs of similar compounds were related to saccharides, 10 and characteristics derived from chemical functional groups reveal the intensive saccharide metamorphosis. In fact, enzymatic reactions matching to the connection between two sub-pathways are lyases moving on Cs, such as a decarboxylase for cut downing or raising the figure of C atoms [ 52 ] .

## Decision

In this survey, we tried to analyse 144 characteristics extracted from each of the positive and negative tracts. Of the 144 characteristics, 88 were graph characteristics, as each tract can be deem as a directed graph ; and 56 were derived from chemical functional group of compounds. The “ Minimum Redundancy Maximum Relevance ” and “ Incremental Feature Selection ” techniques were employed to analyse these characteristics. Nearest neighbour algorithm and jackknife trial were used to measure the truth of our theoretical account in anticipation of the positive tracts. As a consequence, 16 characteristics among the adopted characteristics were found as the of import characteristics for the categorization. This part might be of usage for exciting in-depth surveies on such an of import and ambitious subject and might be helpful for bettering the apprehension of metabolic tracts.