Brief summary of the analysis, performed by Genome Enhancer

The very first step (Step 1) of Genome Enhancer analysis aims to create a set of genes, which would describe the studied pathological process. In order to build this set, different actions are performed with different omics data types, depending on the initial input provided for the analysis.

After the gene set is identified, the Genome Enhancer pipeline proceeds to Step 2 to identify the regulatory regions (promoters and enhancers) of the selected genes.

On Step 3 the Genome Enhancer pipeline scans the regulatory regions, selected on Step 2, for transcription factors binding sites enrichment and identifies the modules of co-acting transcription factors using the TRANSFAC® database and the Match and Composite Module Analyst algorithms.

On Step 4 the Genome Enhancer pipeline performs network analysis and identifies the key nodes, responsible for the regulation of the studied pathological process. This is done by applying the Upstream Analysis approach towards the identified complexes of transcription factors with the use of TRANSPATH® database.

Based on the identified key nodes, the drug candidates search is performed on Step 5 of the Genome Enhancer pipeline with the use of HumanPSD™ database and PASS software.

Analysis overview schema

Analysis schema
Detailed description of the performed analysis

Once the Genome Enhancer analysis was launched, the pipeline will create a workflow in respect to the analysis input. The workflow architecture depends on the number of selected conditions, which should be analyzed (the colored boxes from the data annotation diagram). For performing the analysis, Genome Enhancer will always take the latest versions of the databases, installed on the server.

Step 1. Constructing the gene set, which describes the studied pathological process

On this step the Genome Enhancer pipeline constructs the gene set, which describes the studied pathological process, from the input set of omics data.

Transcriptomics data

The transcriptomics data could have been loaded to Genome Enhancer in one of the following formats:

  • Table data
  • FASTQ files
  • Affimetrix files
  • Agilent files
  • Illumina files

Table data

Genome Enhancer will separate the uploaded and submitted to the analysis transcriptomics table data into two types: (1) those files/columns, which came with numeric values (counts, logFC) and (2) those, which didn't have numeric values and thus should be considered as a plain list of genes.

If transcriptomics data with numerical values was uploaded and submitted to the analysis, Genome Enhancer will check for which conditions (categories for comparison) this data is available.

If transcriptomics data with numerical values was submitted for only 1 condition, two options are possible:

  1. There were 1500 or less genes present in the input data
    Then numerical values will be disregarded by Genome Enhancer and the provided data will be accepted as a plain gene list, which will be further taken to the next steps of analysis as the gene list, which describes the studied pathological process.
  2. There were more than 1500 genes present in the input data
    Then all columns in the table will be averaged and the following genes will be selected for further analysis:
    top - 300 highly expressed genes
    bottom - 300 low expressed genes
    middle - 500 genes with medium expression, which will be considered as a background set (non-changed genes) in further analysis
    In several further steps of analysis the following comparisons of expression of the genes (or other numerical values associated with the genes) will be made: top vs. middle and bottom vs. middle.

If transcriptomics data with numerical values was submitted for 2 conditions, then the values from the 2 conditions will be compared to each other and the upregulated, downregulated and non-changed gene lists will be constructed.

If EdgeR method will be applied in the further analysis (see bellow), no additional normalization of transcriptomics data will be performed, otherwise the table will be normalized by performing Quantile normalization.

If data appeared to come in a non-logarithmic scale and no counts were present, the data will be log2 transformed using Transform Table method.

Then all data from all conditions will be integrated into one table. All empty cells of the resulting table will be filled in with average values throughout the lines using Table imputation.

The upregulated, downregulated and non-changed gene lists will be constructed according to the following rules:

If numerical transcriptomics data were present in two conditions:

  1. If each category contains at least 2 numerical columns, and not all of them are counts (integer numbers) then the Limma algorithm will be performed to compare the numerical values (e.g. gene expression values) between two conditions.
  2. If each of the compared conditions (categories for comparison) has 2 or more columns with counts data, the EdgeR algorithm will be performed to compare the numerical values (e.g. gene expression values) between two conditions.
  3. Otherwise, the Fold Change algorithm will be applied to the input numerical transcriptomics data.

If Limma or EdgeR were applied to the input data, then significant upregulated and significant downregulated genes will be selected for further analysis, as well as non-significant genes (background set). If the number of significantly changed genes will be over 300, then the top 300 significant genes will be taken for further analysis. If the number of significantly changed genes will be between 10 and 300, then all of these up or down regulated genes will be taken for further analysis. If the number of significantly changed genes will be less than 10, and the number of up or down regulated genes will be more than 300, then the top 300 up or down regulated genes will be taken for further analysis. If the number of significantly changed genes will be less than 10, and the number of up or down regulated genes will be less than 300, then all of these up or down regulated genes will be taken for further analysis. If the gene list will contain more than 2500 genes, the middle 500 genes will be considered as a background set. Otherwise, Genome Enhancer will take middle 20% of genes as a background set.

If Fold Change was applied to the input data, then upregulated and downregulated genes will be selected for further analysis, as well as non-changed genes (background set) in the same quantities as for Limma or EdgeR, described above.

In any case, after performing EdgeR, Limma or Fold Change, Genome Enhancer requires at least 10 genes with logarithm of the fold change values below zero and at least 10 genes with the logarithm of the fold change above zero to be present in the resulting data, otherwise further analysis of this data will not be performed.

Having constructed the lists of (significantly) up and down regulated genes, Genome Enhancer will proceed to the next step of the pipeline, where (significant) upregulated genes will be compared to the non-changed genes and (significant) downregulated genes will be compared to the non-changed genes. These two comparisons will form two independent branches of Genome Enhancer workflow, each of which will proceed to the next steps of the pipeline (find regulatory regions analysis (Step 2) and Match and CMA analysis (Step 3)).

If transcriptomics data with numerical values was submitted for 3 conditions (categories for comparison) selected during the analysis launch, two variants are possible:

  • (1) baseline was selected
  • (2) clustering was selected

In the (1) case Genome Enhancer will take all non-baseline conditions and will compare them to the baseline in the same way the analysis for 2 conditions was performed above. The results of all such comparisons will be tables with genes and, if applicable, corresponding numerical expression values. Each comparison will form an independent branch of Genome Enhancer workflow, which will proceed to the next steps of the pipeline (find regulatory regions analysis (Step 2) and Match and CMA analysis (Step 3)).

In the (2) case the cluster analysis will be launched.

First, the Variance filter analysis will be applied, which will go through all columns of the united transcriptomics table and identify genes with high and low variability in expression. The unchanged cluster will be constructed from the genes with low variance in expression and high variance cluster will be built from the genes with high variance in expression. Top 3000 genes will be taken from the cluster with high variance and 500 genes will be taken from the unchanged cluster for further analysis.

Then CRC analysis will be performed on the selected set of 3000 genes with high variability. The CRC analysis will generate as many clusters as it will identify.

On the results of CRC analysis (multiple clusters received) the three best clusters will be selected by launching of the CR cluster selector method, which will select the top 3 clusters of genes, not more than 300 genes and not less than 10 genes in each. Each of the three clusters will be then compared to the unchanged genes (500 genes with minimal variance), which will be treated as a control set. Each of these three comparisons will form an independent branch of Genome Enhancer workflow which will proceed to the next steps of the pipeline (find regulatory regions analysis (Step 2) and Match and CMA analysis (Step 3)).

If non-numeric transcriptomics data are loaded and submitted to the analysis, Genome Enhancer will compare the corresponding lists of genes of each of the conditions (categories for comparison) with the gene list in the baseline category and will select those genes, which do not belong to the genes of the baseline category for further steps of the pipeline (find regulatory regions analysis (Step 2) and Match and CMA analysis (Step 3)).

FASTQ data

If RNA-seq files in fastq format were uploaded and submitted to the analysis, Genome Enhancer will process them using the HISAT algorithm for alignment. After HISAT, HtSeq analysis will be applied to these files and a summary table with identified read counts will be generated if fastq files were present in two conditions (categories for comparison), with not less than 2 fastq files in each of the conditions. Afterwards this table will be processed with EdgeR method. Otherwise, if only one fastq file is used in the compared categories the Cufflinks analysis will be performed.

The resulting table with logarithms of fold changes and the computed p-values will be then treated as transcriptomics table data (described above).

Affymetrix, Illumina and Agilent data

If Affymetrix data was loaded (.CEL files), it will be normalized and converted into Ensembl genes. The initial file name and values will be kept in a newly created table together with the corresponding Ensembl and Affimetrix IDs. This action will be performed with all Affymetrix files that were uploaded and submitted for the analysis inside each of the conditions (categories for comparison). The resulting table will be treated as initial transcriptomics table data which is then analysed with Limma to detect the differentially expressed genes (described above).

Similar actions will be performed with Agilent and Illumina input data (Agilent normalization and Illumina normalization will be done).

Epigenomics data

If epigenomics data was loaded, Genome Enhancer will calculate the ChIP-seq peaks and will unite them in 1 track for each of the conditions (categories for comparison) which were specified during the analysis launch.

The epigenomics data could be loaded to Genome Enhancer in one of the following formats:

  • Fastq files
  • Bam files
  • Tracks

If fastq file was uploaded and submitted for the analysis, Bowtie algorithm will be applied to it and respective bam file will be created. Next such files will be treated as if they were in bam format.

If bam file was loaded, the genome build of the file will be checked. Only hg38 bam files will be accepted by Genome Enhancer. For each of the valid bam files MACS 1.4 analysis will be performed, which will create the ChIP-seq peaks. All peaks inside one condition (category for comparison) will then be united. This will be done for all conditions (categories for comparison).

If track file was loaded, Genome Enhancer will check its genome build and will convert all non-hg38 tracks to hg38 genome build using the Liftover method. All tracks inside one condition (category for comparison) will be united. This will be done for all conditions (categories for comparison).

The performed actions will generate 1 resulting track, which will contain all respective ChIP-seq peaks, for each of the conditions (categories for comparison).

If no transcriptomics data was uploaded and submitted to the analysis and epigenomics data were present and submitted to the analysis, Genome Enhancer will take the unified track with peaks for each of the conditions and convert track to genes with Track to Gene set analysis , searching for peaks 1000bp downstream and 1000bp upstream of the gene. The resulting gene list is sorted by number of peaks intersected with the gene, then the top 300 genes will be considered as YES set and 500 bottom genes will be considered as NO set for each of the conditions. The pipeline will then proceed to the next steps of analysis (steps 2 and 3) with independent comparisons of YES set vs. NO set (top 300 genes vs. bottom 500 genes) for each of the conditions.

Proteomics data

If the gene set, which would describe the studied pathological process, was not constructed yet (from epigenomics or transcriptomics data because of their absence), Genome Enhancer will proceed to available proteomics data in order to generate such gene set on its basis.

Genome Enhancer will search throughout all conditions for proteomics data with numerical values (quantitative proteomics) and will perform the same actions on this data, as were described above for numerical transcriptomics data. All proteins will be converted to corresponding genes using the Convert table method.

The respective actions can be summarized as follows:

  • If numerical proteomics data will be present only in one condition, Genome Enhancer will compare it to the housekeeping genes;
  • If numerical proteomics data will be present in two conditions, they will be compared to each other;
  • If numerical proteomics data will be present in three or more conditions, clustering or pairwise comparison will be performed depending on the clustering/baseline option selected during the analysis launch.

If the list of proteins (non-numeric proteomics) is present in the conditions, it will be converted to genes using the Convert table method, then Genome Enhancer will compare the corresponding lists of genes of each of the conditions (categories for comparison) with the gene list in the baseline category and will select those genes, which do not belong to the genes of the baseline category for further steps of the pipeline.

If the gene set, which would describe the studied pathological process, was already constructed from epigenomics or transcriptomics data, than proteomics data will not be used for generating the initial list of genes.

On the next step of the pipeline the proteomics data will be used as so called “context set ” in the search for regulators in the signal transduction network. In the case if proteomics data are present only in one condition, the whole list of proteins is used as “context set”. If the proteomics with numeric values is present in two or more conditions, Fold change calculation is applied and proteins with the LogFC above zero are taken as the “context set”.

In the case of proteomics data without numerical values (simple list of protein IDs), if there is only one condition the whole protein list is used as the context set, but if there are two or more conditions then the proteins from the baseline condition will be disregarded and the proteins from the other conditions will be taken as the “context set”.

Genomics data

If the gene set, which would describe the studied pathological process, was not constructed yet (from epigenomics, transcriptomics or proteomics data because of their absence), Genome Enhancer will proceed to available genomics data in order to generate such gene set on its basis.

If a vcf files was loaded and submitted to the analysis, Genome Enhancer will check whether it comes in the hg38 genome build and, if not, then the input track will be converted to hg38 with the use of Liftover algorithm.

If an SNP list was loaded and submitted to the analysis, the SNP matching method will be applied and the respective VCF track will be retrieved as a result.

If a fastq file was loaded and submitted to the analysis, the respective vcf file will be generated using the workflow Find genome variants from full genome NGS.

All retrieved vcf tracks within one condition will be then joined.

If there were 2 conditions specified during the analysis launch, vcf track of Baseline will be subtracted from the vcf track of the other condition (studied pathology) using the Filter one track by another function. The Track to gene set method will be then applied to the resulting track and the set of genes from the mutation track will be retrieved.

Next the Mutations to genes with weights method will be applied.

The mutations present in the resulting track will be weighted depending on their localization:

  • mutations, which got into the exon regions of genes, will receive a coefficient 0.7
  • mutations, which got into the promoter regions of genes, will receive a coefficient 1.3
  • all other mutations will receive a coefficient 1

The genes, on which the mutations are localized, will also be weighted using the Calculate weighted Mutation Score analysis. This method will give genes the weights in respect to the number of mutations, which appeared to be localized on corresponding genes. In addition to that, genes belonging to TRANSPATH pathways will receive additional weight (a multiplier coefficient of 1.5 will be applied to all genes that happen to belong to TRANSPATH pathways). Also this analysis will take into account the known gene-disease associations from HumanPSD database for the diseases, which were specified during the analysis launch. Genes, which are happen to be associated with the diseases, that were selected during the analysis launch, will receive additional weight (a multiplier coefficient of 2 will applied to genes that will be associated with the studied pathologies).

The final list of mutations with their weights are used then for the analyses of TF binding sites at the later steps of the pipeline. In the case if no other type of data but only genomic data are present the list of mutations is used then to generate the initial list of genes as follows.

The resulting table of genes and their weights will then be sorted by weights values and top 300 genes will be selected from for further analysis as top mutated genes. These genes will be saved to the table called 'All mutated genes with description ranked' which can be found in the results folder of the analysis from the Genome Enhancer Expert view. The selected 300 genes will then be filtered by genes in the baseline category (if two or more categories are used during the analysis launch) or taken as it is if only one category is used. This resulting list of genes will be compared to the list of housekeeping genes at the step of promoter analysis.

Metabolomics data

If there was no transcriptomics data submitted to the analysis, the metabolomics data will be taken instead of it. If transcriptomic data was submitted to the analysis, the metabolomics data will be added to it.

If there were 2 conditions selected during the analysis launch, then only the list of metabolites from the non-baseline condition will be considered for further analysis.

If there were 1 or 3 and more conditions (and clustering option is used) selected during the analysis launch, then all lists of metabolites will be considered.

All tables of metabolites will be converted to Recon Substances using the Convert table function. Next, a list of genes encoding enzymes that are acting on these metabolites is obtained using the Match genes to metabolites function.

All genes will then be joined within one condition (category for comparison) and compared to the genes in baseline condition.

If there was only 1 numerical column present in the metabolomics data, Genome Enhancer will treat such data as a plain list of metabolites

If there were 2 numerical columns present in the metabolomics data, Genome Enhancer will apply Limma*, EdgeR* or Fold Change, in the same way as it was described for the transcriptomics data above.

*Limma and EdgeR will be calculated on Recon substances and only then the resulting up/down regulated metabolites will be converted to genes.

Then, similar to the way the transcriptomics data was treated, Genome Enhancer will find the significant upregulated, significant downregulated and unchanged genes if Limma or EdgeR were applied and upregulated and downregulated genes if Fold Change was applied.

The comparison will be further done between the (significantly) upregulated genes vs. housekeeping genes and (significantly) downregulated genes vs. housekeeping genes at the next steps of the pipeline.

The retrieved lists of genes from metabolomics analysis will be added to the previously received transcriptomics data, or, if no transcriptomics data were present, they will be taken as they are for further analysis.

Step 2. Identification of regulatory regions of the selected genes

On this step of Genome Enhancer pipeline the regulatory regions of the genes, which were selected on Step 1 for further analysis, will be identified. This will be done with Find regulatory regions method, which creates a track of regulatory regions for a set of input genes.

If epigenomics data was processed on the Step 1 of Genome Enhancer pipeline, then a united track with ChIP-seq peaks for each of the categories is constructed. This track will then be used in Find regulatory regions method for selecting the regulatory regions, which will tend to cover peaks intervals.

The Find regulatory regions method will select the peaks inside the input genes extended with flanks (shift on left gene bound position: -5000; shift on right gene bound position: 5000). Then intervals around the peak centers will be created, overlapped, and cut to input size (promoter from: -1000 and promoter to: 100, 1100bp in total). Finally, the closest to TSS interval will be selected as the result.

The TSS will be taken from Fantom/TSS database (CAGE TSS database: Fantom5-Tissue-hg38/TSS).

If no peaks are found around the gene, Fantom promoter will be used if tissue type was specified during the analysis launch and Ensembl promoter will be used if no tissue type was selected.

If genomics data was processed on the Step 1 of Genome Enhancer pipeline, then a track of identified mutations was constructed. In this case the Find regulatory regions with mutations method will be applied, which will create a track of regulatory regions for input genes using information about genomic variations from mutation track. The method will scan genes extended with flanking regions by window of input size: promoter from: -1000, promoter to: 100. It will find a position with highest sum of mutation weights that got into the scanned "window". If no mutations will be found around the gene, Fantom promoter will be used if tissue type was specified during the analysis launch and Ensembl promoter will be used if no tissue type was selected.

If no epigenomics or genomics data was processed in the analysis, then the Find regulatory regions method will be simply performed on the list of genes, which describe the studied pathological process (list of genes, that was constructed on Step 1 of Genome Enhancer pipeline).

The retrieved regulatory regions will be used in the next step of Genome Enhancer pipeline for identification of transcription factors binding sites and their complexes.

Step 3. Identification of transcription factor binding sites and their complexes

On this step Genome Enhancer pipeline the regulatory regions, which were identified on Step 2, will be scanned for transcription factors binding sites enrichment and the modules of co-acting transcription factors that regulate the genes of the studied pathological process will be identified using the TRANSFAC® database and the Match and Composite Module Analyst (CMA) algorithms. In the Match and CMA algorithms the frequencies of TFBS and TFBS composite modules are compared in the regulatory regions of foreground set of genes (Yes-set) to the regulatory regions of the background set of genes (No-set). In each comparison of a category to the baseline two Yes-sets are formed - one is the list of up-regulated genes and second is the list of down-regulated genes. The No-set is prepared from the non-changed genes. In the case of one category only, the Yes-sets are formed from the top and bottom genes, and No-set is formed from the middle genes (in the case of numerical values associated with the genes). In the cases of analysis of gene lists without numerical values (see above) the No-set is formed from the housekeeping genes.

In details, the Site search on track method is applied to the regulatory regions, selected on step 2 of the Genome Enhancer pipeline, to predict the transcription factor binding sites. The retrieved result is further optimized with the use of Site search result optimization method, which tunes the matrix weights to minimize p-values.

Independently, in the case of genomic data are used, the revealed mutations that are mapped to the regulatory regions of the differentially expressed genes are analyzed by the method Compare TFBS mutations, which identified TFBS that are lost or gained due to the mutations. The list of PWMs of the top significant lost or gained TFBS is used then in CMA analysis for specifying the search of the composite modules.

Next, the CMA (Composite Module Analyst) analysis is be performed by running the Construct composite modules on tracks method. Composite modules are combinations of binding sites common for promoters of functionally related genes and responsible for the major component of the gene expression pattern of these genes. The retrieved modules are used on next step of the Genome Enhancer pipeline for identification of the key nodes, responsible for the regulation of the studied pathological process.

Further information on methods for identification of transcription factor binding sites and their complexes can be found in the ‘Methods for the analysis of enriched transcription factor binding sites and composite modules’ subsection of the Methods section in Genome Enhancer analysis report.

Step 4. Identification of key nodes, responsible for the regulation of the studied pathological process

On this step of Genome Enhancer pipeline the network analysis is performed and the key nodes, which are responsible for regulation of the studied pathological process, are identified.

For this the Molecular networks regulator search method will be applied to the set of transcription factors, selected on the Step 5 of Genome Enhancer pipeline. The regulator search will be performed with the use of TRANSPATH® database for molecules upstream of the input list of transcription factors. As the result, this method will generate a set of proteins or their encoding genes, which were predicted to play a key role in regulating a maximal number of transcription factors from the input list.

If proteomics data was processed on Step 1 of the Genome Enhancer pipeline, the list of context proteins, that refer to the studied pathological process, will be used for weighting the respective molecules on the reconstructed network of intracellular reactions. Proteins, belonging to the list of proteins, which characterize the studied pathological process, will have higher priority for appearing on the reconstructed signaling network.

Opposite to that, a list of heavily mutated signaling proteins, that were extracted from genomics data in case of its presence on Step 1 of the Genome Enhancer pipeline, will be excluded from the constructed signaling network due to the loss of their function in the studied pathological process.

Further information on methods for identification of key nodes, responsible for regulation of the studied pathological process, can be found in the ‘Methods for finding master regulators in networks’ subsection of the Methods section in Genome Enhancer analysis report.

Step 5. Identification of prospective drug candidates

Based on the key nodes, identified on Step 4 of Genome Enhancer pipeline, the drug candidates search will be performed on Step 5 of the pipeline with the use of HumanPSD™ database and PASS software.

For identification of already approved drugs and drugs undergoing clinical trials for the studied pathology and for other diseases the PSD pharmaceutical compounds analysis will be launched by Genome Enhancer pipeline. This method will seek for the optimal combination of molecular targets among the set of the input genes (key nodes, identified on Step 4 of the Genome Enhancer pipeline), that can potentially interact with pharmaceutical compounds from a library of known drugs, using information from HumanPSD™ database. As a result, this method will provide the identified lists of prospective drug targets and respective treatments that were proven in clinical trials to affect the identified key nodes and thus potentially block the studied pathological process.

Then the chemoinformatics analysis will be applied with the use of PASS software and chemoinformatically predicted prospective drug targets and treatments will be identified by Genome Enhancer pipeline. This will be done by performing the Pharmaceutical Compounds analysis which will seek for the optimal combination of molecular targets among input genes (key nodes, identified on Step 4 of the Genome Enhancer pipeline), that potentially interact with pharmaceutical compounds from a library of known drugs and biologically active chemical compounds predicted with the cheminformatics tool PASS. As a result, this method will provide the identified lists of prospective drug targets and respective treatments that were chemoinformatically predicted to affect the identified key nodes and thus potentially block the studied pathological process.

Further information on methods for identification of prospective drug candidates can be found in the ‘Methods for analysis of pharmaceutical compounds’ subsection of the Methods section in Genome Enhancer analysis report.