Sequence and Pathway analysis


PLK1 and TRIM22 are promising druggable targets for treating colorectal neoplasms that control activity of TP53, AR and ESR2 transcription factor on promoters of genes carrying sequence variations in colon tissue

Demo User
geneXplain GmbH
info@genexplain.com

Data received on 13/08/2019 ; Run on 30/08/2019 ; Report generated on 30/08/2019


Abstract

In the present study we applied the software package "Genome Enhancer" to a data set that contains genomics data obtained from colon tissue. The study is done in the context of colorectal neoplasms. The goal of this pipeline is to identify potential drug targets in the molecular network that governs the studied pathological process. In the first step of analysis pipeline discovers transcription factors (TFs) that regulate genes activities in the pathological state. The activities of these TFs are controlled by so-called master regulators, which are identified in the second step of analysis. After a subsequent druggability checkup, the most promising master regulators are chosen as potential drug targets for the analyzed pathology. At the end the pipeline comes up with (a) a list of known drugs and (b) novel biologically active chemical compounds with the potential to interact with selected drug targets.

From the data set analyzed in this study, we found the following TFs to be potentially involved in the regulation of the genes carrying sequence variations: TP53, AR and ESR2. The subsequent network analysis suggested PLK1, KAT2B, PTPRE, TRIM22 and DUSP2 as the most promising and druggable molecular targets. Finally, the following drugs were identified as the most promising treatment candidates: Alendronate, Coenzyme A, 4-(4-METHYLPIPERAZIN-1-YL)-N-[5-(2-THIENYLACETYL)-1,5-DIHYDROPYRROLO[3,4-C]PYRAZOL-3-YL]BENZAMIDE, 1-3 Sugar Ring of Pentamannosyl 6-Phosphate, Adenosine-5'-Monophosphate Glucopyranosyl-Monophosphate Ester and 2-Phenyl-Ethanol.

1. Introduction

Recording "-omics" data to measure gene activities, protein expression or metabolic events is becoming a standard approach to characterize the pathological state of an affected organism or tissue. Increasingly, several of these methods are applied in a combined approach leading to large "multiomics" datasets. Still the challenge remains how to reveal the underlying molecular mechanisms that render a given pathological state different from the norm. The disease-causing mechanism can be described by a re-wiring of the cellular regulatory network, for instance as a result of a genetic or epigenetic alterations influencing the activity of relevant genes. Reconstruction of the disease-specific regulatory networks can help identify potential master regulators of the respective pathological process. Knowledge about these master regulators can point to ways how to block a pathological regulatory cascade. Suppression of certain molecular targets as components of these cascades may stop the pathological process and cure the disease.

Conventional approaches of statistical "-omics" data analysis provide only very limited information about the causes of the observed phenomena and therefore contribute little to the understanding of the pathological molecular mechanism. In contrast, the "upstream analysis" method [1-5] applied here has been deviced to provide a casual interpretation of the data obtained for a pathology state. This approach comprises two major steps: (1) analysing promoters and enhancers of genes carrying sequence variations for the transcription factors (TFs) involved in their regulation and, thus, important for the process under study; (2) re-constructing the signaling pathways that activate these TFs and identifying master regulators at the top of such pathways. For the first step, the database TRANSFAC® [6] is employed together with the TF binding site identification algorithms Match [7] and CMA [8]. The second step involves the signal transduction database TRANSPATH® [9] and special graph search algorithms [10] implemented in the software "Genome Enhancer".

The "upstream analysis" approach has now been extended by a third step that reveals known drugs suitable to inhibit (or activate) the identified molecular targets in the context of the disease under study. This step is performed by using information from HumanPSD™ database [11]. In addition, new potential small molecular ligands are subsequently predicted for the revealed targets. A general druggability check is performed using a precomputed database of biologcal activities of chemical compounds from a library of about 13000 pharmaceutically most active compounds. The spectra of biological activities are computed using the program PASS on the basis of a (Q)SAR approach [12-14].

2. Data

For this study the following experimental data was used:

Table 1. Experimental datasets used in the study

File name Data type
CRC_variants Genomics

Figure 1. Annotation diagram of experimental data used in this study. With the colored boxes we show those sub-categories of the data that are compared in our analysis.

3. Results

We have analysed the following condition: Experiment: short-term survival.

3.1. Identification of target genes

In the first step of the analysis target genes were identified from the uploaded experimental data. The most frequently mutated genes were used as target genes.

Table 2. Top ten the most frequently mutated genes in Experiment: short-term survival.

See full table  →

ID Gene symbol Gene schematic representation Number of variations
ENSG00000132570 PCBD2 innerImage 172
ENSG00000242086 LINC00969 innerImage 147
ENSG00000248923 MTND5P11 innerImage 126
ENSG00000234745 HLA-B innerImage 122
ENSG00000154237 LRRK1 innerImage 117
ENSG00000259755 RP11-505E24.2 innerImage 111
ENSG00000230021 RP5-857K21.4 innerImage 104
ENSG00000067057 PFKP innerImage 92
ENSG00000247627 MTND4P12 innerImage 91
ENSG00000281344 HELLPAR innerImage 88

3.2. Functional classification of target genes

A functional analysis of genes carrying sequence variations was done by mapping the genes to several known ontologies, such as Gene Ontology (GO), disease ontology (based on HumanPSD™ database) and the ontology of signal transduction and metabolic pathways from the TRANSPATH® database. Statistical significance was computed using a binomial test.

Figures 2-4 show the most significant categories.

The most frequently mutated genes in Experiment: short-term survival:

GO (biological process)

Figure 2. Enriched GO (biological process) of the most frequently mutated genes in Experiment: short-term survival.

Full classification  →

TRANSPATH® Pathways (2019.2)

Figure 3. Enriched TRANSPATH® Pathways (2019.2) of the most frequently mutated genes in Experiment: short-term survival.

Full classification  →

HumanPSD(TM) disease (2019.2)

Figure 4. Enriched HumanPSD(TM) disease (2019.2) of the most frequently mutated genes in Experiment: short-term survival. The size of the bars correspond to the number of bio-markers of the given disease found among the up-regulated genes.

Full classification  →

3.3. Analysis of enriched transcription factor binding sites and composite modules

In the next step a search for transcription factors binding sites (TFBS) was performed in the regulatory regions of the target genes by using the TF binding motif library of the TRANSFAC® database. We searched for so called composite-modules that act as potential condition-specific enhancers of the target genes in their upstream regulatory regions (-1000 bp upstream of transcription start site (TSS)) and identify transcription factors regulating activity of the genes through such enhancers.

Classically, enhancers are defined as regions in the genome that increase transcription of one or several genes when inserted in either orientation at various distances upstream or downstream of the gene [8]. Enhancers typically have a length of several hundreds of nucleotides and are bound by multiple transcription factors in a cooperative manner [9].

We analysed mutations that were revealed in the potential enhancers located upstream, downstream or inside the target genes (see Table 3). We identified 14590 mutations potentially affecting gene regulation. Table 4 shows the following lists of PWMs whose sites were lost or gained due to these mutations. These PWMs were put in focus of the CMA algorithm that constructs the model of the enhancers by specifying combinations of TF motifs (see more details of the algorithm in the Method section).

Table 3. Mutations revealed in genes in the most frequently mutated genes

See full table  →

ID Gene symbol Gene schematic representation Number of variations
ENSG00000132570 PCBD2 innerImage 172
ENSG00000242086 LINC00969 innerImage 147
ENSG00000248923 MTND5P11 innerImage 126
ENSG00000234745 HLA-B innerImage 122
ENSG00000154237 LRRK1 innerImage 117
ENSG00000259755 RP11-505E24.2 innerImage 111
ENSG00000230021 RP5-857K21.4 innerImage 104
ENSG00000067057 PFKP innerImage 92
ENSG00000247627 MTND4P12 innerImage 91
ENSG00000281344 HELLPAR innerImage 88

Table 4. PWMs whose sites were lost or gained due to mutations in the most frequently mutated genes

See full table  →

ID P-value (gains) P-value (losses) yesCount (gains) yesCount (losses)
V$HNF3B_Q6 3.6E-2 5.96E-7 10 334
V$RBPJK_01 3.28E-2 3.96E-6 94 251
V$P53_Q3 2.29E-2 2.34E-4 6 140
V$KAISO_01 2.17E-2 1.79E-8 2196 2454
V$ZIC1_05 8.08E-3 1.11E-4 3 6
V$RHOX11_01 3.75E-3 1.14E-5 106 1135
V$FREAC3_01 3.71E-3 1.1E-4 6 0
V$CEBPA_Q6 3.29E-3 3.98E-5 331 63
V$LRH1_Q5_01 2.47E-3 2.59E-4 20 289
V$CRX_Q4_01 1.99E-4 9 null
V$GFI1_Q6_01 1.28E-4 5.92E-4 158 92
V$NANOG_01 1.15E-4 6.48E-3 2346 5157
V$CDPCR1_01 8.91E-5 7.21E-4 745 3429
V$BBX_03 8.5E-5 4.05E-3 49 2
V$ZFP105_04 7.54E-5 4.27E-2 197 389
V$GLI_Q3 3.69E-5 1.41E-3 1251 590
V$GCM2_01 6.96E-7 1.46E-2 2574 80
V$HMGA2_01 2.29E-7 2.76E-2 283 4
V$MEF2A_Q6 1.06E-9 8.23E-5 219 165

We applied the Composite Module Analyst (CMA) [8] method to detect such potential enhancers, as targets of multiple TFs bound in a cooperative manner to the regulatory regions of the genes of interest. CMA applies a genetic algorithm to construct a generalized model of the enhancers by specifying combinations of TF motifs (from TRANSFAC®) whose sites are most frequently clustered together in the regulatory regions of the studied genes. CMA identifies the transcription factors that through their cooperation provide a synergistic effect and thus have a great influence on the gene regulation process.

Enhancer model potentially involved in regulation of target genes (the most frequently mutated genes in Experiment: short-term survival).

The model consists of 2 module(s). Below, for each module the following information is shown:
- PWMs producing matches,
- number of individual matches for each PWM,
- score of the best match.

Module 1:
V$GCM_Q2
0.00; N=3
V$GLI_Q6
0.00; N=2
V$MZF1_Q5
0.98; N=3
V$IRF7_Q3_01
0.00; N=3
V$ERBETA_Q5_01
0.95; N=2
Module width: 65

Module 2:
V$STAT4_Q4
0.00; N=1
V$P53_Q3
0.94; N=3
V$AP2ALPHA_Q4
0.97; N=3
V$AR_14_H
0.00; N=3
V$CDPCR1_01
0.00; N=3
V$SLUG_Q6
0.00; N=2
V$ISL1_Q6
0.97; N=2
Module width: 73


Model score (-p*log10(pval)): 13.47
Wilcoxon p-value (pval): 4.11e-29
Penalty (p): 0.475
Average yes-set score: 4.24
Average no-set score: 2.93
AUC: 0.77
Middle-point: 3.71
False-positive: 27.59%
False-negative: 30.69%

See model visualization table  →

On the basis of the enhancer models we identified the following transcription factors potentially regulating the target genes of our interest. We found 13 transcription factors controlling expression of the genes associated with genomic variations (see Table 5).

Table 5. Transcription factors of the predicted enhancer model potentially regulating the genes carrying sequence variations (the most frequently mutated genes in Experiment: short-term survival). Yes-No ratio is the ratio between frequencies of the sites in Yes sequences versus No sequences. It describes the level of the enrichment of binding sites for the indicated TFin the regulatory target regions. Regulatory score is the measure of involvement of the given TF in the controlling of expression of genes that encode master-regulators presented below (through positive feedback loops).

See full table  →

ID Gene symbol Gene description Regulatory score Yes-No ratio
MO000019548 TP53 tumor protein p53 5.53 1.21
MO000021454 AR androgen receptor 4.7 5.39
MO000059335 ESR2 estrogen receptor 2 4.15 1.88
MO000024708 CUX1 cut like homeobox 1 4.09 8.62
MO000028767 SNAI2 snail family transcriptional repressor 2 3.67 1.69
MO000019117 GLI1 GLI family zinc finger 1 3.43 1.28
MO000019621 STAT4 signal transducer and activator of transcription 4 3.43 2.96
MO000007703 IRF7 interferon regulatory factor 7 3.27 7
MO000001275 TFAP2A transcription factor AP-2 alpha 2.98 1.29
MO000026306 GCM1 glial cells missing homolog 1 2.6 3.77

3.4. Finding master regulators in networks

In the second step of the upstream analysis common regulators of the revealed TFs were identified. We identified 173 signaling proteins whose structure and function is highly damaged by the mutations (see Table 6).

Table 6. Signaling proteins whose structure and function is damaged by the mutations in the most frequently mutated genes

See full table  →

ID Title Mutation count Consequence Codons
MO000138949 Drp1(h) 13 NMD_transcript_variant,stop_gained Gaa/Taa
MO000019673 p85alpha(h) 9 stop_gained Cga/Tga
MO000113258 MYPT1(h) 8 NMD_transcript_variant,frameshift_variant aga/aAga
MO000127741 SMC4L1(h) 8 stop_gained Cga/Tga
MO000214698 MS4A6A(h) 8 NMD_transcript_variant,frameshift_variant -/T,tta/ttTa
MO000035319 kinectin(h) 7 NMD_transcript_variant,frameshift_variant -/A
MO000144675 NULP1(h) 7 NMD_transcript_variant,frameshift_variant -/A
MO000145695 Anamorsin(h) 7 NMD_transcript_variant,frameshift_variant -/A
MO000206935 C11orf74(h) 7 stop_gained Gaa/Taa
MO000068933 HLA-G(h) 6 NMD_transcript_variant,splice_region_variant,stop_lost Tga/Aga

Top 100 mutated proteins for the most frequently mutated genes were used in the algorithm of master regulator search as a list of nodes of the signal transduction network that are removed from the network during the search of master regulators (see more details in of the algorithm in the Method section). These master regulators appear to be the key candidates for therapeutic targets as they have a master effect on regulation of intracellular pathways that activate the pathological process of our study. The identified master regulators are shown in Table 7.

Table 7. Master regulators that may govern the regulation of the most frequently mutated genes in Experiment: short-term survival. Total rank is the sum of the ranks of the master molecules sorted by keynode score, CMA score, transcriptomics and proteomics data (if used).

See full table  →

ID Master molecule name Gene symbol Gene description Total rank
MO000022403 plk1(h) PLK1 polo like kinase 1 86
MO000096187 plk1(h) PLK1 polo like kinase 1 97
MO000034388 PDK1(h){pS241} PDPK1 3-phosphoinositide dependent protein kinase 1 110
MO000058803 PDK1-isoform1(h) PDPK1 3-phosphoinositide dependent protein kinase 1 121
MO000021819 PDK1(h) PDPK1 3-phosphoinositide dependent protein kinase 1 126
MO000009253 MAPKAPK2(h) MAPKAPK2 mitogen-activated protein kinase-activated protein kinase 2 139
MO000056491 p/CAF(h) KAT2B lysine acetyltransferase 2B 147
MO000025871 Staf-50(h) TRIM22 tripartite motif containing 22 148
MO000102384 PDK1-isoform2(h) PDPK1 3-phosphoinositide dependent protein kinase 1 149
MO000022406 plk1(h){p} PLK1 polo like kinase 1 164

The intracellular regulatory pathways controlled by the above-mentioned master regulators are depicted in Figure 5. This diagram displays the connections between identified transcription factors, which play important roles in the regulation of genes carrying sequence variations, and selected master regulators, which are responsible for the regulation of these TFs.

Figure 5. Diagram of intracellular regulatory signal transduction pathways of the most frequently mutated genes in Experiment: short-term survival. Master regulators are indicated by red rectangles, transcription factors are blue rectangles, and green rectangles are intermediate molecules, which have been added to the network during the search for master regulators from selected TFs. Orange frames highlight molecules presented in original mapping.

See full diagram  →

4. Identification of potential drugs

In the last step of the analysis we strived to identify known drugs as well as new potentially active chemical compounds that are potentially suitable for inhibition (or activation) of the identified molecular targets in the context of specified human disease.

First, we identify known drugs using information from HumanPSD™ database [11] about their targets and about clinical trials where the drugs have been tested for the treatment of various human diseases. Table 8 shows the resulting list of druggable master regulators that represent the predicted drug targets of the studied pathology. Table 9 lists chemical compounds and known drugs (from the HumanPSD™ database) potentially acting on corresponding master regulators.

Table 8. Known drug targets for known drugs revealed in this study.The column Druggability score contains the number of drugs that are potentially suitable for inhibition (or activation) of the target.

See full table  →

ID Gene symbol Gene description Druggability score Total rank
ENSG00000166851 PLK1 polo like kinase 1 5 164
ENSG00000114166 KAT2B lysine acetyltransferase 2B 3 176
ENSG00000132334 PTPRE protein tyrosine phosphatase, receptor type E 1 183
ENSG00000101966 XIAP X-linked inhibitor of apoptosis 2 250
ENSG00000101182 PSMA7 proteasome subunit alpha 7 3 326
ENSG00000005844 ITGAL integrin subunit alpha L 8 508
ENSG00000115232 ITGA4 integrin subunit alpha 4 8 508
ENSG00000115594 IL1R1 interleukin 1 receptor type 1 3 519
ENSG00000171608 PIK3CD phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit delta 3 520
ENSG00000145391 SETD7 SET domain containing lysine methyltransferase 7 1 558

Table 9. The list of drugs (from Human PSD) approved or used in clinical trials for the application in colorectal neoplasms and acting on master regulators revealed in our study. The column Target activity score contains the value of numeric function that depends on ranks of all targets that were found for the drug. The column Disease activity score contains the weighted sum of user selected diseases where the drug is known to be applied. We use sum of clinical trials phases as the weight of the disease. Drug rank column contains total rank of given drug among all found. See Methods section for details.

See full table  →

ID Name Target names Target activity score NA Phase 1 Phase 2 Phase 3 Phase 4 Disease activity score Drug rank
DB08896 Regorafenib KIT, KDR, ABL1, PDGFRB, FGFR1, RET, PDGFRA... 2 Colorectal Neoplasms, Adenocarcinoma, Carcinoma, Hepatocellular, Cholangiocarcinoma, Gastrointestinal Stromal Tumors, Glioblastoma, Liver Neoplasms... Colorectal Neoplasms, Carcinoma, Hepatocellular, Carcinoma, Small Cell, Esophageal Neoplasms, Gastrointestinal Neoplasms, Gastrointestinal Stromal Tumors, Intestinal Neoplasms... Colorectal Neoplasms, Adenocarcinoma, Bile Duct Neoplasms, Brain Abscess, Breast Neoplasms, Carcinoid Tumor, Carcinoma, Adenoid Cystic... Colorectal Neoplasms, Carcinoma, Hepatocellular, Colonic Neoplasms, Esophageal Neoplasms, Gastrointestinal Stromal Tumors, Neoplasms, Noma... Colorectal Neoplasms, Gastrointestinal Stromal Tumors, Neoplasms, Rectal Neoplasms 11 4
DB00398 Sorafenib KIT, KDR, PDGFRB, FGFR1, BRAF, RAF1, RET 1.49 Colorectal Neoplasms, Adenocarcinoma, Ascites, Brain Abscess, Brain Neoplasms, Breast Neoplasms, Carcinoma, Hepatocellular... Colorectal Neoplasms, Adenocarcinoma, Adenoma, Adenoma, Liver Cell, Astrocytoma, Bile Duct Neoplasms, Biliary Tract Neoplasms... Colorectal Neoplasms, Adenocarcinoma, Adenoma, Adenoma, Liver Cell, Adrenocortical Carcinoma, Bile Duct Neoplasms, Biliary Tract Neoplasms... Adenocarcinoma, Breast Neoplasms, Carcinoma, Carcinoma, Hepatocellular, Carcinoma, Non-Small-Cell Lung, Carcinoma, Renal Cell, Digestive System Diseases... Carcinoma, Hepatocellular, Carcinoma, Renal Cell, Liver Neoplasms, Neoplasms, Noma, Thrombosis 4 25
DB09079 Nintedanib FGFR3, SRC, KDR, LYN, FGFR1 1.08 Colorectal Neoplasms, Carcinoma, Carcinoma, Non-Small-Cell Lung, Endometrial Neoplasms, Fallopian Tube Neoplasms, Idiopathic Pulmonary Fibrosis, Lung Diseases... Adenocarcinoma, Breast Neoplasms, Carcinoma, Hepatocellular, Carcinoma, Non-Small-Cell Lung, Carcinoma, Renal Cell, Carcinoma, Small Cell, Colonic Neoplasms... Colorectal Neoplasms, Adenocarcinoma, Adenocarcinoma, Clear Cell, Adenocarcinoma, Mucinous, Angiomyoma, Appendiceal Neoplasms, Breast Neoplasms... Colorectal Neoplasms, Carcinoma, Non-Small-Cell Lung, Idiopathic Pulmonary Fibrosis, Lung Diseases, Lung Diseases, Interstitial, Mesothelioma, Neoplasms... Idiopathic Pulmonary Fibrosis, Pulmonary Fibrosis 6 29
DB06616 Bosutinib CAMK2G, SRC, ABL1, MAP2K1, LYN 1.55 Breast Neoplasms, Leukemia, Leukemia, Lymphoid, Leukemia, Myelogenous, Chronic, BCR-ABL Positive, Leukemia, Myeloid, Neoplasms, Precursor Cell Lymphoblastic Leukemia-Lymphoma Colorectal Neoplasms, Acute Kidney Injury, Breast Neoplasms, Carcinoma, Non-Small-Cell Lung, Cholangiocarcinoma, Cognitive Dysfunction, Dementia... Colorectal Neoplasms, Brain Abscess, Breast Neoplasms, Cholangiocarcinoma, Cysts, Glioblastoma, Kidney Diseases, Cystic... Leukemia, Leukemia, Myelogenous, Chronic, BCR-ABL Positive, Leukemia, Myeloid Leukemia, Myeloid 3 31
DB01254 Dasatinib KIT, SRC, ABL1, PDGFRB, YES1, FYN, ABL2 1.29 Brain Neoplasms, Carcinoma, Squamous Cell, Carcinoma, Transitional Cell, Gastrointestinal Stromal Tumors, Glioblastoma, Leukemia, Leukemia, Lymphoid... Colorectal Neoplasms, Adenocarcinoma, Adenocarcinoma, Clear Cell, Adenocarcinoma, Mucinous, Brain Abscess, Brain Diseases, Breast Neoplasms... Colorectal Neoplasms, Adenocarcinoma, Adenocarcinoma, Clear Cell, Blast Crisis, Brain Abscess, Brain Diseases, Brain Neoplasms... Leukemia, Leukemia, Lymphoid, Leukemia, Myelogenous, Chronic, BCR-ABL Positive, Leukemia, Myeloid, Leukemia, Myeloid, Accelerated Phase, Leukemia, Myeloid, Acute, Leukemia, Myeloid, Chronic-Phase... Leukemia, Leukemia, Lymphoid, Leukemia, Myelogenous, Chronic, BCR-ABL Positive, Leukemia, Myeloid, Precursor Cell Lymphoblastic Leukemia-Lymphoma 3 36

Table 10. The list of drugs (from HumanPSD) known to be acting on master regulators revealed in our study that can be proposed as a drug repurposing initiative for the treatment of colorectal neoplasms. Target activity score column contains value of numeric function that depends on ranks of all targets that were found for the drug. Drug rank column contains total rank of given drug among all found. See Methods section for details.

ID Name Target names Target activity score NA Phase 1 Phase 2 Phase 3 Phase 4 Drug rank
DB00098 Anti-thymocyte Globulin (Rabbit) ITGB1, ITGAV, ITGAL, ITGB3, CD4 1.12 Arthritis, Osteoarthritis Anemia, Aplastic, Hemoglobinuria, Hemoglobinuria, Paroxysmal, Hodgkin Disease, Leukemia, Leukemia, Lymphoid, Leukemia, Myelogenous, Chronic, BCR-ABL Positive... Sepsis, Shock, Shock, Septic Anemia, Anemia, Aplastic, Leukemia, Liver Diseases 85
DB00630 Alendronate PTPRS, PTPRE 0.94 Adenocarcinoma, Bone Diseases, Metabolic, Carcinoma, Squamous Cell, Cystic Fibrosis, Cysts, Esophageal Neoplasms, Hyperparathyroidism... Bone Diseases, Metabolic, Breast Neoplasms, Hepatitis, Hepatitis B, Necrosis, Neoplasms, Osteoporosis Adenocarcinoma, Arthritis, Arthritis, Rheumatoid, Asthma, Bone Diseases, Metabolic, Chronic Periodontitis, Constriction, Pathologic... Arteritis, Arthritis, Arthritis, Rheumatoid, Asthma, Bone Diseases, Metabolic, Breast Neoplasms, Chronic Periodontitis... Arteriosclerosis, Arthritis, Arthritis, Rheumatoid, Bone Demineralization, Pathologic, Bone Diseases, Bone Diseases, Metabolic, Cystic Fibrosis... 94
DB00046 Insulin Lispro INSR, IGF1R 0.67 Alzheimer Disease, Cognitive Dysfunction, Diabetes Mellitus, Diabetes Mellitus, Type 1, Diabetes Mellitus, Type 2, Hypertrophy, Hypotension... Diabetes Mellitus, Diabetes Mellitus, Type 1, Diabetes Mellitus, Type 2 Diabetes Mellitus, Diabetes Mellitus, Type 1, Diabetes Mellitus, Type 2, Diabetes, Gestational Diabetes Mellitus, Diabetes Mellitus, Type 1, Diabetes Mellitus, Type 2, Hyperglycemia, Kidney Diseases, Renal Insufficiency, Chronic Coronary Artery Disease, Diabetes Mellitus, Diabetes Mellitus, Type 1, Diabetes Mellitus, Type 2, Diabetes, Gestational, Hyperglycemia, Myocardial Infarction... 116
DB00047 Insulin Glargine INSR, IGF1R 0.67 Acidosis, Diabetes Mellitus, Diabetes Mellitus, Type 1, Diabetes Mellitus, Type 2, Diabetic Ketoacidosis, Fatty Liver, Fatty Liver, Alcoholic... Diabetes Mellitus, Diabetes Mellitus, Type 1, Diabetes Mellitus, Type 2, Hyperglycemia Diabetes Mellitus, Diabetes Mellitus, Type 1, Diabetes Mellitus, Type 2 Diabetes Mellitus, Diabetes Mellitus, Type 1, Diabetes Mellitus, Type 2, Hyperglycemia, Kidney Diseases, Leukemia, Lymphoma... Acidosis, Coronary Artery Disease, Diabetes Mellitus, Diabetes Mellitus, Type 1, Diabetes Mellitus, Type 2, Diabetic Ketoacidosis, Fatty Liver... 116
DB00675 Tamoxifen ESR2, PRKCA 0.66 Adenocarcinoma, Breast Carcinoma In Situ, Breast Neoplasms, Carcinoma in Situ, Carcinoma, Ductal, Carcinoma, Ductal, Breast, Carcinoma, Intraductal, Noninfiltrating... Amyotrophic Lateral Sclerosis, Barrett Esophagus, Breast Neoplasms, Gastrointestinal Neoplasms, Hepatitis, Hepatitis C, Hepatitis C, Chronic... Adenocarcinoma, Adrenocortical Carcinoma, Affect, Amyotrophic Lateral Sclerosis, Bipolar Disorder, Breast Carcinoma In Situ, Breast Diseases... Adenocarcinoma, Bipolar Disorder, Breast Carcinoma In Situ, Breast Diseases, Breast Neoplasms, Breast Neoplasms, Male, Carcinoma in Situ... Adenoma, Breast Diseases, Breast Neoplasms, Cysts, Fibroadenoma, Fibrocystic Breast Disease, Infertility... 120

Next, new potential small molecular ligands were predicted for the revealed targets and a general druggability check was run using a pre-computed database of spectra of biological activities of chemical compounds from a library of 13040 most pharmaceutically active known compounds. The spectra of biological activities has been computed using the program PASS [12-14] on the basis of a (Q)SAR approach. Table 11 shows the resulting list of druggable master regulators, which represent the predicted drug targets of the studied pathology. Table 12 lists chemical compounds and known drugs potentially acting on the corresponding master regulators.

Table 11. Extended list of drug targets revealed in this study (targets that are predicted by PASS program potentially targeted by an extended list of known drugs and pharmaceutically active chemical compounds). The column Druggability score contains a numeric value which indicates how suitable this target is to be inhibited (or activated) by a drug. See Methods section for details.

See full table  →

ID Name Gene symbol Gene description Druggability score Total rank
ENSG00000132274 TRIM22 TRIM22 tripartite motif containing 22 17.43 148
ENSG00000166851 PLK1 PLK1 polo like kinase 1 1.42 164
ENSG00000158050 DUSP2 DUSP2 dual specificity phosphatase 2 13.33 165
ENSG00000114166 KAT2B KAT2B lysine acetyltransferase 2B 31.96 176
ENSG00000132334 PTPRE PTPRE protein tyrosine phosphatase, receptor type E 0.29 183
ENSG00000186187 ZNRF1 ZNRF1 zinc and ring finger 1 17.43 203
ENSG00000184545 DUSP8 DUSP8 dual specificity phosphatase 8 13.33 237
ENSG00000149480 MTA2 MTA2 metastasis associated 1 family member 2 15.7 247
ENSG00000162521 RBBP4 RBBP4 RB binding protein 4, chromatin remodeling factor 17.43 247
ENSG00000102096 PIM2 PIM2 Pim-2 proto-oncogene, serine/threonine kinase 4.19 264

Table 12. The chemical compounds and known drugs identified by the PASS program as potentially active for the treatment of colorectal neoplasms and acting on master regulators revealed in our study. Toxicity score column contains maximal value of probability to be active for all toxicities corresponding to the given compound. Disease activity score column contains maximal value of probability to be active for all activities corresponding to the selected diseases for the given compound. Target activity score column contains value of numeric function which depends on all activity-mechanisms correspondent to the drug. Drug rank column contains total rank of given drug among all found. See Methods section for details.

See full table  →

Name Structure Target names Target activity score Toxicity score Disease activity score Drug rank
2'-Deoxycytidine innerImage HDAC2, HDAC4, PTPN2, CSF2RB, HDAC3, CREBBP, HDAC1... 0.65 0.98 0.81 292
Docetaxel innerImage CLK4, TGFB1, TGFBR2, CHEK1, MDM4 0.25 1 0.81 562
Swainsonine innerImage MAPK11, AKT1 0.23 0.84 0.85 600
Epothilone B innerImage GSK3A, GSK3B, ABL2 4.85E-2 0.96 0.84 978
Mitomycin innerImage IL1B, CHEK1 4.26E-2 0.99 0.87 995

Table 13. The chemical compounds and known drugs identified by the PASS program as potentially acting on master regulators revealed in our study. Based on the revealed mechanism of action these compounds can be proposed for the treatment of colorectal neoplasms in the current pathological case. Toxicity score column contains maximal value of probability to be active for all toxicities corresponding to the given compound. Disease activity score column contains maximal value of probability to be active for all activities corresponding to the selected diseases for the given compound or 0 if no diseases were selected (in this case column will be hidden). Target activity score column contains value of numeric function which depends on all activity-mechanisms correspondent to the drug. Drug rank column contains total rank of given drug among all found. See Methods section for details.

Name Structure Target names Target activity score Toxicity score Disease activity score Drug rank
Monoisopropylphosphorylserine innerImage MTOR, PRKD3, PRKCQ, PRKCE, PRKCD, PRKCI, PRKCA... 4.72 0.99 0.71 49
Deoxyuridine-5'-Diphosphate innerImage PRKD3, PRKCQ, PRKCE, PRKCD, PRKCI, PRKCA, PRKD1... 3.37 0.97 0.72 62
D-Mannose 1-Phosphate innerImage KIT, ERBB3, EPHB2, FGFR3, NTRK2, KDR, PDGFRB... 2.71 0.98 0.77 64
Pterin Cytosine Dinucleotide innerImage KIT, ERBB3, EPHB2, FGFR3, NTRK2, KDR, PDGFRB... 4.59 0.98 0.62 78
2,5-Anhydroglucitol-1,6-Biphosphate innerImage KIT, HDAC2, HDAC4, ERBB3, EPHB2, FGFR3, GRIN1... 4.15 0.98 0.63 82

As a result of the drug search we came up with two lists of chemical compounds potentially applicable to the targets of our interest. The first list is based on drugs that are known as ligands for the revealed targets in the context of the diseases in our focus as well as in other disease conditions. The second list of identified compounds is based on the prediction of their potential biological activities, which was done using the program PASS. Such computational predictions should be taken as mere suggestions and should be used with care in further experiments.

5. Conclusion

We applied the software package "Genome Enhancer" to a data set that contains genomics data obtained from colon tissue. The study is done in the context of colorectal neoplasms. The data were pre-processed, statistically analyzed and genes carrying sequence variations were identified. Also checked was the enrichment of GO or disease categories among the studied gene sets.

We propose the following schema of how the selected drugs may interfere with the identified target molecules and pathogenic processes discovered by the study reported here.

6. Methods

Databases used in the study

Transcription factor binding sites in promoters and enhancers of differentially expressed genes were analyzed using known DNA-binding motifs described in the TRANSFAC® library [6], release 2019.2 (geneXplain GmbH, Wolfenbüttel, Germany) (http://genexplain.com/transfac).

The master regulator search uses the TRANSPATH® database (BIOBASE) [9]. A comprehensive signal transduction network of human cells is built by the software on the basis of reactions annotated in TRANSPATH®.

Methods for the analysis of enriched transcription factor binding sites and composite modules

Transcription factor binding sites in promoters and enhancers of differentially expressed genes were analyzed using known DNA-binding motifs. The motifs are specified using position weight matrices (PWMs) that give weights to each nucleotide in each position of the DNA binding motif for a transcription factor or a group of them.

We search for transcription factor binding sites (TFBS) that are enriched in the promoters and enhancers under study as compared to a background sequence set such as promoters of genes that were not differentially regulated under the condition of the experiment. We denote study and background sets briefly as Yes and No sets. In the current work we used a workflow considering promoter sequences of a standard length of 1100 bp (-1000 to +100). The error rate in this part of the pipeline is controlled by estimating the adjusted p-value (using the Benjamini-Hochberg procedure) in comparison to the TFBS frequency found in randomly selected regions of the human genome (adj.p-value < 0.01).

We have applied the CMA algorithm (Composite Module Analyst) for searching composite modules [7] in the promoters and enhancers of the Yes and No sets. We searched for a composite module consisting of a cluster of 10 TFs in a sliding window of 200-300 bp that statistically significantly separates sequences in the Yes and No sets (minimizing Wilcoxon p-value).

Methods for finding master regulators in networks

We searched for master regulator molecules in signal transduction pathways upstream of the identified transcription factors. The master regulator search uses a comprehensive signal transduction network of human cells. The main algorithm of the master regulator search has been described earlier [4,5]. The goal of the algorithm is to find nodes in the global signal transduction network that may potentially regulate the activity of a set of transcription factors found at the previous step of the analysis. Such nodes are considered as most promising drug targets, since any influence on such a node may switch the transcriptional programs of hundreds of genes that are regulated by the respective TFs. In our analysis, we have run the algorithm with a maximum radius of 12 steps upstream of each TF in the input set. The error rate of this algorithm is controlled by applying it 10000 times to randomly generated sets of input transcription factors of the same set-size. Z-score and FDR value of ranks are calculated then for each potential master regulator node on the basis of such random runs (see detailed description in [9]). We control the error rate by the FDR threshold 0.05.

Methods for analysis of pharmaceutical compounds

We seek for the optimal combination of molecular targets (key elements of the regulatory network of the cell) that potentially interact with pharmaceutical compounds from a library of known drugs and biologically active chemical compounds, using information about known drugs from HumanPSD™ and predicting potential drugs using PASS program.

Method for analysis of known pharmaceutical compounds

We selected compounds from HumanPSD™ database that have at least one target. Next, we sort compounds using "Drug rank" that is sum of three other ranks:

  1. ranking by "Target activity score" (T-scorePSD),
  2. ranking by "Disease activity score" (D-scorePSD),
  3. ranking by clinical trials phase.
To calculate clinical trials phase for the given compound we select the maximum phase of all diseases that are known to have clinical trials with this compound. "Target activity score" ( T-scorePSD) is calculated as follows:

where T is set of all targets related to the compound intersected with input list, |T| is number of elements in T, AT and |AT| are set set of all targets related to the compound and number of elements in it, w is weight multiplier, rank(t) is rank of given target, maxRank(T) equals max(rank(t)) for all targets t in T.
We use following formula to calculate "Disease activity score" ( D-scorePSD):

where D is the set of selected diseases, and if D is empty set, D-scorePSD=0. P is a set of all known phases for each disease, phase(p,d) equals to the phase number if there are known clinical trials for the selected disease on this phase and zero otherwise.

Method for prediction of pharmaceutical compounds

In this study, the focus was put on compounds with high pharmacological efficiency and low toxicity. For this purpose, comprehensive library of chemical compounds and drugs was subjected to a SAR/QSAR analysis. This library contains 13040 compounds along with their pre-calculated potential pharmacological activities of those substances, their possible side and toxic effects, as well as the possible mechanisms of action. All biological activities are expressed as probability values for a substance to exert this activity (Pa).

We selected compounds that satisfied the following conditions:

  1. Toxicity below a chosen toxicity threshold (defines as Pa, probability to be active as toxic substance).
  2. For all predicted pharmacological effects that correspond to a set of user selected disease(s) Pa is greater than a chosen effect threshold.
  3. There are at least 2 targets (corresponding to the predicted activity-mechanisms) with predicted Pa greater than a chosen target threshold.

The maximum Pa value for all toxicities corresponding to the given compound is selected as the "Toxicity score". The maximum Pa value for all activities corresponding to the selected diseases for the given compound is used as the "Disease activity score". "Target activity score" (T-score) is calculated as follows:

where M(s) is the set of activity-mechanisms for the given structure (which passed the chosen threshold for activity-mechanisms Pa); G(m) is the set of targets (converted to genes) that corresponds to the given activity-mechanism (m) for the given compound; pa(m) is the probability to be active of the activity-mechanism (m), IAP(g) is the invariant accuracy of prediction for gene from G(m); optWeight(g) is the additional weight multiplier for gene. T is set of all targets related to the compound intersected with input list, |T| is number of elements in T, AT and |AT| are set set of all targets related to the compound and number of elements in it, w is weight multiplier.
"Druggability score" (D-score) is calculated as follows:

where S(g) is the set of structures for which target list contains given target, M(s,g) is the set of activity-mechanisms (for the given structure) that corresponds to the given gene, pa(m) is the probability to be active of the activity-mechanism (m), IAP(g) is the invariant accuracy of prediction for the given gene.

7. References

  1. Kel A, Voss N, Jauregui R, Kel-Margoulis O, Wingender E. Beyond microarrays: Finding key transcription factors controlling signal transduction pathways. BMC Bioinformatics. 2006;7(S2), S13. doi:10.1186/1471-2105-7-s2-s13

  2. Michael H, Hogan J, Kel A et al. Building a knowledge base for systems pathology. Brief Bioinformatics. 2008;9(6):518-531. doi:10.1093/bib/bbn038

  3. Stegmaier P, Voss N, Meier T, Kel A, Wingender E, Borlak J. Advanced Computational Biology Methods Identify Molecular Switches for Malignancy in an EGF Mouse Model of Liver Cancer. PLoS ONE. 2011;6(3):e17738. doi:10.1371/journal.pone.0017738

  4. Koschmann J, Bhar A, Stegmaier P, Kel A, Wingender E. “Upstream Analysis”: An Integrated Promoter-Pathway Analysis Approach to Causal Interpretation of Microarray Data. Microarrays. 2015;4(2):270-286. doi:10.3390/microarrays4020270.

  5. Kel A, Stegmaier P, Valeev T, Koschmann J, Poroikov V, Kel-Margoulis OV, and Wingender E. Multi-omics “upstream analysis” of regulatory genomic regions helps identifying targets against methotrexate resistance of colon cancer. EuPA Open Proteom. 2016;13:1-13. doi:10.1016/j.euprot.2016.09.002

  6. Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, Lewicki-Potapov B, Saxel H, Kel AE, Wingender E. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 2006;34(90001):D108-D110. doi:10.1093/nar/gkj143

  7. Kel AE, Gössling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E. MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res.  2003;31(13):3576-3579. doi:10.1093/nar/gkg585

  8. Waleev T, Shtokalo D, Konovalova T, Voss N, Cheremushkin E, Stegmaier P, Kel-Margoulis O, Wingender E, Kel A. Composite Module Analyst: identification of transcription factor binding site combinations using genetic algorithm. Nucleic Acids Res. 2006;34(Web Server issue):W541-5.

  9. Krull M, Pistor S, Voss N, Kel A, Reuter I, Kronenberg D, Michael H, Schwarzer K, Potapov A, Choi C, Kel-Margoulis O, Wingender E. TRANSPATH: an information resource for storing and visualizing signaling pathways and their pathological aberrations. Nucleic Acids Res. 2006;34(90001):D546-D551. doi:10.1093/nar/gkj107

  10. Boyarskikh U, Pintus S, Mandrik N, Stelmashenko D, Kiselev I, Evshin I, Sharipov R, Stegmaier P, Kolpakov F, Filipenko M, Kel A. Computational master-regulator search reveals mTOR and PI3K pathways responsible for low sensitivity of NCI-H292 and A427 lung cancer cell lines to cytotoxic action of p53 activator Nutlin-3. BMC Med Genomics. 2018;11(1):12. doi:10.1186/1471-2105-7-s2-s13

  11. Michael H, Hogan J, Kel A, Kel-Margoulis O, Schacherer F, Voss N. Building a knowledge base for systems pathology. Brief Bioinformatics. 2008;9(6):518-531. doi:10.1093/bib/bbn038

  12. Filimonov D, Poroikov V. Probabilistic Approaches in Activity Prediction. Varnek A, Tropsha A. Chemoinformatics Approaches to Virtual Screening. Cambridge (UK): RSC Publishing.  2008;:182-216.

  13. Filimonov DA, Poroikov VV. Prognosis of specters of biological activity of organic molecules. Russian chemical journal. 2006;50(2):66-75 (russ)

  14. Filimonov D, Poroikov V, Borodina Y, Gloriozova T. Chemical Similarity Assessment Through Multilevel Neighborhoods of Atoms: Definition and Comparison with the Other Descriptors. ChemInform. 1999;39(4):666-670. doi:10.1002/chin.199940210

Supplementary material

  1. Supplementary table 1 - Detailed report. Composite modules and master-regulators (the most frequently mutated genes in Experiment: short-term survival).

  2. Supplementary table 2 - Detailed report. Pharmaceutical compounds and drug targets.

Disclaimer

Decisions regarding care and treatment of patients should be fully made by attending doctors. The predicted chemical compounds listed in the report are given only for doctor’s consideration and they cannot be treated as prescribed medication. It is the physician’s responsibility to independently decide whether any, none or all of the predicted compounds can be used solely or in combination for patient treatment purposes, taking into account all applicable information regarding FDA prescribing recommendations for any therapeutic and the patient’s condition, including, but not limited to, the patient’s and family’s medical history, physical examinations, information from various diagnostic tests, and patient preferences in accordance with the current standard of care. Whether or not a particular patient will benefit from a selected therapy is based on many factors and can vary significantly.

The compounds predicted to be active against the identified drug targets in the report are not guaranteed to be active against any particular patient’s condition. GeneXplain GmbH does not give any assurances or guarantees regarding the treatment information and conclusions given in the report. There is no guarantee that any third party will provide a refund for any of the treatment decisions made based on these results. None of the listed compounds was checked by Genome Enhancer for adverse side-effects or even toxic effects.

The analysis report contains information about chemical drug compounds, clinical trials and disease biomarkers retrieved from the HumanPSD™ database of gene-disease assignments maintained and exclusively distributed worldwide by geneXplain GmbH. The information contained in this database is collected from scientific literature and public clinical trials resources. It is updated to the best of geneXplain’s knowledge however we do not guarantee completeness and reliability of this information leaving the final checkup and consideration of the predicted therapies to the medical doctor.

The scientific analysis underlying the Genome Enhancer report employs a complex analysis pipeline which uses geneXplain’s proprietary Upstream Analysis approach, integrated with TRANSFAC® and TRANSPATH® databases maintained and exclusively distributed worldwide by geneXplain GmbH. The pipeline and the databases are updated to the best of geneXplain’s knowledge and belief, however, geneXplain GmbH shall not give a warranty as to the characteristics or to the content and any of the results produced by Genome Enhancer. Moreover, any warranty concerning the completeness, up-to-dateness, correctness and usability of Genome Enhancer information and results produced by it, shall be excluded.

The results produced by Genome Enhancer, including the analysis report, severely depend on the quality of input data used for the analysis. It is the responsibility of Genome Enhancer users to check the input data quality and parameters used for running the Genome Enhancer pipeline.

Note that the text given in the report is not unique and can be fully or partially repeated in other Genome Enhancer analysis reports, including reports of other users. This should be considered when publishing any results or excerpts from the report. This restriction refers only to the general description of analysis methods used for generating the report. All data and graphics referring to the concrete set of input data, including lists of mutated genes, differentially expressed genes/proteins/metabolites, functional classifications, identified transcription factors and master regulators, constructed molecular networks, lists of chemical compounds and reconstructed model of molecular mechanisms of the studied pathology are unique in respect to the used input data set and Genome Enhancer pipeline parameters used for the current run.