Installation
MODifierDev requires Python 3 with additional Python libraries scipy, sqlite3, numpy and networkx. They are a standard part of Anaconda so installing Anaconda is recommended. It can be downloaded here:
In addition, some R packages are required. They can be acquired with the following code:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("AnnotationDbi",
"MODA",
"STRINGdb",
"limma",
"org.Hs.eg.db",
"foreach",
"doParallel",
"Rcpp",
"dynamicTreeCut",
"flashClust",
"reticulate",
"plyr",
"parallel",
"igraph",
"WGCNA",
"RSQLite",
"stackoverflow",
"preprocessCore",
"DESeq2",
"edgeR",
"openxlsx"),
version = "3.8")
After that, the package can be installed from github:
devtools::install_git(url = "https://github.com/ddeweerd/MODifieRDev.git")
Introduction
Cells are organized in a modular fashion, where essential functions are carried out by functional modules. Modules can be described as clusters of genes, gene products or metabolites that interact together, are co-regulated or physically interacting. Complex diseases rarely arise from a single causal factor but rather from multiple factors, with large individual variation. This leads to dysregulation of parts of functional modules and thereby gives rise to a disease phenotype. The underlying perturbation in parts of the functional modules and the connectivity between them makes up a disease module.
To better understand complex diseases it is crucial to identify disease modules. Genes present in the module might not have a significant impact on the disease on its own. However, the cumulative effect of multiple low penetrance genes could play a major role in the pathogenesis of complex diseases. Network-based approaches could be key in detecting these low penetrance genes, which could potentially be novel bio markers or therapeutic targets.
Various disease module inference methods have been proposed earlier, using different approaches. Here, MODifieR is presented, an R package that bundles 8 different disease module inference methods into a single package that takes either micro array or RNA-seq data as input. The 8 methods methods can be classified into 3 different algorithmic classes: seed-based, clique based and co-expression based methods. MODifieR is available under the GNU GPL open source license.
Module inference methods included are:
Clique Sum
Correlation Clique
DIAMoND
DiffCoEx
MCODE
MODA
Module Discoverer
WGCNA trait-based
How to run
This is an explanation on how to run the package. It starts with a short description on the core objects followed by an explanation on how to use the functions. The package also includes example data enabling the user to follow this tutorial.
Core objects
The package is build on two core objects, an input and a module object.
Input object
The first object is an input object for the disease module inference methods, with class MODifieR_input
. This object can be used for all disease module inference methods.
The MODifieR_input
class has two subclasses, either MicroArray
or RNA-seq
depending on the input data type.
The input objects contain all necessary input data for the methods. Below is a table showing what methods use an expression matrix, Differentially Expressed Genes (DEGs) or both.
Method | Annotated expression matrix | DEGs |
---|---|---|
Clique Sum | X | |
Correlation Clique | X | X |
DIAMoND | X | |
DiffCoEx | X | |
MCODE | X | |
MODA | X | |
Module Discoverer | X | |
WGCNA trait-based | X |
Table 1: Input types taken by inference methods
Microarray input object
The MODifieR_input
with subclass MicroArray
object is a named list and can contain the following components:
diff_genes
A 2 two column data.frame where the first column are genes and the second column unadjusted p-values obtained by differential expression analysis.
limma_probe_table
A data.frame from limma topTable
with added gene annotation.
annotated_exprs_matrix
A matrix where the rows are genes and the columns samples. Probes have been collapsed into genes.
expression_matrix
The original input expression matrix.
annotation_table
A data.frame containing the annotation for the microarray probes. Also part of the input
group_indici
A named list containing 2 numeric vectors. The names are the group labels, for example “patient” and “control”. The values are the indici for each group, so the column indici for the expression matrix for each of the aforementioned groups.
settings
The parameters used to generate the object.
RNA-seq input object
The MODifieR_input
with subclass RNA-seq
object is a named list and can contain the following components:
diff_genes
A 2 two column data.frame where the first column are genes and the second column unadjusted p-values obtained by differential expression analysis.
edgeR_deg_table
A data.frame from edgeR glmQLFit.
annotated_exprs_matrix
A matrix where the rows are genes and the columns samples. Probes have been collapsed into genes.
count_matrix
The orginal input count matrix.
group_indici
A named list containing 2 numeric vectors. The names are the group labels, for example “patient”. and “control”. The values are the indici for each group, so the column indici for the expression matrix for each of the aforementioned groups.
settings
The parameters used to generate the object.
Module object
The second object, with class MODifieR_module
is an object that is produced by the disease module inference methods. They also have a subclassses according to the method and input type (Microarray or RNA-seq) that produced them.
Like the input object, the module objects are also named lists. Module objects differ between each function that generates them (hence the subclass) but have certain features in common:
The module object will always contain an element (character vector) called module_genes which are the genes that make up the final module
The module object will always contain a named list called settings which holds the parameters that have been used when generating the object
Creating an input object
In order to run an analysis, an input object has to be created. This can be done in two ways: The first is the usual way and involves high throughput data and in case of microarray also probe annotation. The second way is to create a custom object using previously obtained DEGs or expression data.
Using microarray data
To create an input object with microarray data, the function create_input_microarray
can be used. The function takes several inputs:
expression_matrix
A matrix with normalized expression values. The rows are probes and the columns are samples.annotation_table
An two columns dataframe providing annotation for the probes. The first column should contain the probe ID, and the second column its corresponding annotation.group1_indici
A vector that should have the indici in the expression matrix for the samples belonging to group 1.group2_indici
Same, but for group 2group1_label
Label to identify group 1, for example “Control” or “Patient”group2_label
Same, but for group 2expression
A Boolean value indicating if expression matrix data should be calculated.differential expression
A Boolean value indicating if DEGs data should be calculated.use_adjusted
use adjusted p value for differential expression analysis?method
When collapsing the probes, the collapseRows() method from WGCNA is used. This argument determines the collapsing method to use. Please see the WGCNA documentation for more information. Options are:- MaxMean Highest mean value
- MinMean Lowest mean value
- MaxRowVariance Highest variance
- absMaxMean Highest mean absolute value
- absMinMean Lowest mean absolute value
- ME Eigenrow (first principal component)
Using RNA-seq data
Creating an RNA-seq input object is quite similar to creating a microarray input object. The function is create_input_rnaseq
and the arguments are:
count matrix
Matrix containing raw RNA-seq countsgroup1_indici
A vector that should have the indici in the expression matrix for the samples belonging to group 1.group2_indici
Same, but for group 2group1_label
Label to identify group 1, for example “Control” or “Patient”group2_label
Same, but for group 2expression
A Boolean value indicating if expression matrix data should be calculated.normalize_quantiles
, Normalize quantiles for WGCNA-based methods?
Example dataset microarray
Example data has been included in the package. There is an example microarray expression matrix from Gene Expression Omnibus (GEO) with accession number GSE4588. This data set has 16 columns, so 16 samples in total with 9 systemic lupus erythematosus (SLE) patients and 7 healthy controls. The matrix has 52307 rows, meaning 52307 probes. The indici for the healthy controls are 1 to 9, and 10 to 16 make up the SLE patients. The expression matrix can be accessed using expression_matrix
We can inspect it using the head function:
head(expression_matrix)
GSM101962 GSM101963 GSM101964 GSM101965 GSM101966
AFFX-BioB-5_at 442.0 442.6 230.8 275.2 540.7
AFFX-BioB-M_at 620.7 565.5 256.9 299.8 693.7
AFFX-BioB-3_at 404.7 376.1 175.4 168.2 403.9
AFFX-BioC-5_at 1298.3 1201.6 742.4 781.2 1485.3
AFFX-BioC-3_at 1647.5 1566.8 966.2 1059.7 1743.9
AFFX-BioDn-5_at 2952.5 2819.6 2107.7 2444.5 3265.4
GSM101967 GSM101968 GSM101969 GSM101970 GSM102710
AFFX-BioB-5_at 432.7 476.4 953.9 1397.2 197.0
AFFX-BioB-M_at 658.5 605.7 1313.7 1863.5 256.8
AFFX-BioB-3_at 440.5 379.5 889.9 1277.7 159.4
AFFX-BioC-5_at 1406.0 1301.2 2368.0 3752.2 674.3
AFFX-BioC-3_at 1720.4 1598.2 2727.0 3771.3 867.4
AFFX-BioDn-5_at 3150.1 2911.8 5480.8 7806.6 2055.5
GSM102711 GSM102712 GSM102713 GSM102714 GSM102715
AFFX-BioB-5_at 184.2 415.8 499.4 1249.3 380.3
AFFX-BioB-M_at 237.2 560.7 729.1 1796.1 557.1
AFFX-BioB-3_at 190.6 359.2 445.2 1194.3 315.8
AFFX-BioC-5_at 573.9 1368.8 1485.1 3886.1 1300.5
AFFX-BioC-3_at 841.0 1476.6 1891.9 4485.8 1706.6
AFFX-BioDn-5_at 1774.7 2927.9 3254.2 8357.6 2781.7
GSM102716
AFFX-BioB-5_at 339.5
AFFX-BioB-M_at 576.0
AFFX-BioB-3_at 358.9
AFFX-BioC-5_at 1262.9
AFFX-BioC-3_at 1466.3
AFFX-BioDn-5_at 3058.1
The annotation for the probes is provided in a data.frame accessible through probe_annotation
. The first column contains the probe ids where the second column holds the corresponding annotation, in this case ENTREZ gene identifiers.
Again using head:
head(probe_annotation)
PROBEID ENTREZID
1 AFFX-BioB-5_at <NA>
2 AFFX-BioB-M_at <NA>
3 AFFX-BioB-3_at <NA>
4 AFFX-BioC-5_at <NA>
5 AFFX-BioC-3_at <NA>
6 AFFX-BioDn-5_at <NA>
Now we can put these things together to create an input object:
MODifieR_input_microarray <- create_input_microarray(expression_matrix = expression_matrix,
annotation_table = probe_annotation,
group1_indici = 1:9,
group2_indici = 10:16,
group1_label = "Control",
group2_label = "Patient")
When the function has finished, it might take a few minutes, we can inspect the object. Let’s start with the class:
[1] "MODifieR_input" "Expression" "MicroArray"
Now we can inspect the DEGs:
gene pvalue
1 1 0.976932866
2 10 0.347847172
3 100 0.006504456
4 1000 0.329104066
5 10000 0.115466316
6 100009676 0.433955896
The limma_probe_table is the table generated by limma with annotation for the probe added:
logFC AveExpr t P.Value adj.P.Val B
1 4455.8984 2263.4000 16.21819 6.973635e-11 2.143054e-06 -0.4889325
2 883.5587 710.8125 16.03447 8.194137e-11 2.143054e-06 -0.4961970
3 1995.8476 1046.2500 12.89384 1.721050e-09 2.255049e-05 -0.6651916
4 845.2206 549.2062 12.89198 1.724472e-09 2.255049e-05 -0.6653242
5 339.1746 240.1000 11.96197 4.802206e-09 5.023779e-05 -0.7382467
6 1924.1302 1276.3625 10.46105 2.902571e-08 2.280640e-04 -0.8907192
PROBEID IDENTIFIER
1 204439_at 10964
2 213294_at 5610
3 214453_s_at 10561
4 227609_at 94240
5 219352_at 55008
6 202086_at 4599
Finally, we can take a look at the groups indici and it’s names by looking at the group_indici component of the object:
$Control
[1] 1 2 3 4 5 6 7 8 9
$Patient
[1] 10 11 12 13 14 15 16
Now we have set up an microarray input object to use in the disease inference methods.
Example dataset RNA-seq
There is an example RNA-seq expression matrix from Gene Expression Omnibus (GEO) with accession number GSE123496. This data set has 10 columns, so 10 samples taken from the frontal cortex with in total 5 Multiple Sclerosis (MS) patients and 5 healthy controls. The matrix has 16569 rows, meaning the same amount of genes. The indici for the patients are 1 to 5, and 6 to 10 make up the controls. The count matrix can be accessed using count_matrix
We can inspect it using the head function:
head(count_matrix)
MS32CF MS38CF MS44CF MS50CF MS56CF NL74CF NL80CF NL86CF NL62CF
7105 251 90 209 336 358 191 189 172 124
8813 235 121 404 428 330 594 242 435 89
57147 224 64 280 437 387 545 259 478 86
55732 109 37 145 203 189 214 114 197 37
2268 336 237 227 345 733 262 482 130 457
3075 867 1181 951 1662 1825 769 1465 285 1235
NL68CF
7105 335
8813 324
57147 408
55732 203
2268 423
3075 579
To create the input object:
MODifieR_input_rna <- create_input_rnaseq(count_matrix = count_matrix,
group1_indici = 1:5,
group2_indici = 6:10,
group1_label = "Patient",
group2_label = "Control",
use_adjusted = F)
This dataset does not contain any significant genes after FDR correction, so unadjusted p values are used.
When the function has finished, we can inspect the object. Let’s start with the class:
[1] "MODifieR_input" "Expression" "RNA-seq"
Now we can inspect the DEGs:
gene pvalue
1 4241 8.162947e-05
2 105378049 1.623881e-04
3 100423062 2.969720e-04
4 155185 6.942163e-04
5 284 9.156738e-04
6 55791 9.238056e-04
The edgeR_deg_table is the table generated by edgeR:
logFC logCPM F PValue FDR
4241 1.071991 4.0142387 36.67993 8.162947e-05 0.9579268
105378049 -2.247071 0.9811993 31.20710 1.623881e-04 0.9579268
100423062 -5.161158 0.1725241 26.93895 2.969720e-04 0.9579268
155185 1.011908 3.4099991 21.68029 6.942163e-04 0.9579268
284 -1.264745 1.8260861 20.13550 9.156738e-04 0.9579268
55791 -0.897363 3.8180308 20.08746 9.238056e-04 0.9579268
Finally, we can take a look at the groups indici and it’s names by looking at the group_indici component of the object:
$Patient
[1] 1 2 3 4 5
$Control
[1] 6 7 8 9 10
Now we have set up an RNA-seq input object to use in the disease inference methods.
Creating a custom input object
There is also the possibility to create a custom input object. This can be useful in cases where there are already DEGs, an annotated expression or both available. In order to create such an object, the functions create_custom_microarray_input_object
or create_input_rna_input_object
can be used.
Below are example on how to use the functions. Some of the input data is optional, for example limma_probe_table/edgeR, expression_matrix/count_matrix and annotation_table are not used in any disease module inference method. Instead, they can optionally be used to update or recalculate an input object.
generic_input_microarray_object <- create_custom_input_object(diff_genes = NULL,
limma_probe_table = NULL,
annotated_exprs_matrix = NULL,
expression_matrix = NULL,
annotation_table = NULL,
group1_indici = NULL,
group2_indici = NULL,
group1_label = NULL,
group2_label = NULL)
generic_input_rna_object <- create_custom_rna_input_object(diff_genes = NULL,
edgeR_deg_table = NULL,
annotated_exprs_matrix = NULL,
count_matrix = NULL,
group1_indici = NULL,
group2_indici = NULL,
group1_label = NULL,
group2_label = NULL,
settings = NULL)
Please note that the specific data in the input object that is required differs per inference method, the reader is referred to table 1 in subsection ‘Input object’ to see what inference methods needs what input data fields. For example, to use the DIAMoND method, only the diff_genes input would suffice to run the analysis.
If the inference method requires annotated_exprs_matrix, group indici and labels are also required.
Inference methods
Disease module inference methods that have DEGs as input need a Protein-Protein interaction (PPi) network to overlay the DEGs on. As an example, a small STRING db PPi network has been included. This PPi network is from STRING version 7.1 and is filtered for interactions with a confidence score higher than 700. This small network only contains 5541 unique genes and 64672 interactions. The first column is interactor gene 1, the second column is interactor gene 2 and the third and last column gives the STRING confidence score. (Jensen et al. 2007) Using head again, we can inspect the first few rows:
ppi_head
entrez1 entrez2 SCORE
1 5341 381 896
2 23647 381 917
3 8853 381 973
4 5921 381 756
5 50807 381 894
6 3002 4074 735
In this tutorial the previously created microarray input will be used. This can easily be exchanged for the RNA-seq input object, the procedure is exactly the same, only MODifieR_input_microarray
is replaced by MODifieR_input_rna
.
Since this is a tutorial on how to run the inference methods, we can change parameters for some inference methods and also make the size of the input object a little smaller to reduce run time.
Also, we can lower the number of genes in the annotated expression matrix:
MODifieR_input_microarray$annotated_exprs_matrix <- MODifieR_input_microarray$annotated_exprs_matrix[1:10000, ]
In addition, some inference methods allow to modify the module object after analysis. If an inference method has these functions, they will be shortly discussed in separate subsections of the inference method.
Clique Sum
There are two different versions of the Clique Sum algorithm implemented, clique_sum_exact
and clique_sum_permutation
, that differ in the method to determine if a clique is significantly enriched with DEGs. Obtaining cliques is done in a separate function, build_clique_db
, that creates an SQLite database for the cliques. This is because obtaining cliques is a computationally intensive process, may clog the memory when using a large network and is static for each network.
Both versions take in the file name of this database instead of a PPi network.
build_clique_db
This function is used to create a clique database. db_folder
is the folder where database will be stored and db_name
is the name of the database. .sqlite will be appended to the database name.
build_clique_db(ppi_network = ppi_network,
db_folder = "/folder/name/",
db_name = "example")
clique_sum_exact
clique_sum_permutation is an implementation of the clique-based disease module inference method proposed by Barrenäs et al. (Barrenäs et al. 2012)
Enrichment of cliques is determined by a one-sided Fisher exact test.
In MODifieR, the function can be called using clique_sum_exact()
and using the file name of the database created in the function build_clique_db
as input for the db parameter.
clique_sum_exact_module <- clique_sum_exact(MODifieR_input = MODifieR_input_microarray,
db = "folder/name/example.sqlite")
clique_sum_permutation
clique_sum_permutation is a variation of the clique-based disease module inference method proposed by Barrenäs et al. (Barrenäs et al. 2012), (Gustafsson et al. 2014)
Enrichment of cliques is determined by comparing the sum of the -log 10 p-values for all genes in the clique and comparing that to a null distribution of cliques composed of random genes of the same size.
In MODifieR, the function can be called using clique_sum_permutation()
and using the file name of the database created in the function build_clique_db
as input for the db parameter.
clique_sum_permutation_module <- clique_sum_permutation(MODifieR_input = MODifieR_input_microarray,
db = "folder/name/example.sqlite")
Correlation Clique
The correlation clique is a clique-based algorithm using consensus clustering. The algorithm starts with calculating a correlation score between each interaction in the PPi network. The correlation score is obtained by subtracting the Pearson correlation p-value: \(correlation score = 1 - Pearson\: p\: value\)
Subsequently, the correlation score is multiplied by the correlation confidence. $edge score = $
When the edge scores are calculated the iterative part of the algorithm commences: All edge scores are scaled to so that the mean of all edge scores is equal to the fraction_of_iteration
parameter. Next, the newly scaled scores are compared to random variables from the uniform distribution between (0,1) and only interactions where the edge score is higher than the random variable are used to construct a new PPi network. Because the scores are scaled to fraction_of_iteration
, this is the average fraction of interactions used for this new network. Then, maximal cliques are inferred from this new network. The cliques are tested for significant enrichment of DEGs by Fisher’s exact test and the union of significant cliques is the disease module for this iteration. The final disease module will consist of genes that have been present in at least frequency cutoff
iterations
The function can be called using correlation_clique()
In the example code, the number of iterations has been reduced to 10 (from 50) to speed up the analysis
correlation_clique_module <- correlation_clique(MODifieR_input = MODifieR_input_microarray,
ppi_network = ppi_network,
iteration = 10)
Post-processing
Correlation clique offers two post-processing functions.
The first one, correlation_adjust_cutoff
, allows to adjust the frequency_cutoff
parameter. The default is 0.5, meaning the gene has to be present in at least 50 percent of iterations. In this example it is set to 0.2, so 20 percent of iterations.
correlation_clique_module <- correlation_adjust_cutoff(frequency_cutoff = 0.2,
correlation_module = correlation_clique_module)
The second post-processing function is correlation_set_module_size
. This function is intended to return a module where the number of module genes is closest to size
. In the example, we set the desired size to 10.
correlation_clique_module <- correlation_set_module_size(size = 10,
correlation_module = correlation_clique_module)
DIAMoND
A slightly modified version of the original DIAMOnD python script (Ghiassian, Menche, and Barabási 2015) is called from within R. The only change to the original algorithm is the option to include the seed genes to the module.
In MODifieR, the function can be called using diamond()
diamond_module <- diamond(MODifieR_input = MODifieR_input_microarray,
ppi_network = ppi_network)
Post-processing
Post-processing DIAMOnD modules entails either to remove or add the seed genes. To do this, there are 2 related functions, add_diamond_seed_genes
and remove_diamond_seed_genes
diamond_module <- diamond_add_seed_genes(diamond_module = diamond_module)
diamond_module <- diamond_remove_seed_genes(diamond_module = diamond_module)
DiffCoEx
DiffCoEx is a method for identifying correlation pattern changes, which builds on the commonly used Weighted Gene Co expression Network Analysis (WGCNA) framework for co expression analysis. (Tesson, Breitling, and Jansen 2010)
In MODifieR, the function can be called using diffcoex()
diffcoex_module <- diffcoex(MODifieR_input = MODifieR_input_microarray)
Post-processing
In the DiffCoEx algorithm, co-expression modules are denoted by color. The final disease module can be composed of multiple colors. The function diffcoex_split_module_by_color
allows to split each of these colors into separate MODifieR_module objects. The return object is a list of MODifieR module objects.
diffcoex_module_list <- diffcoex_split_module_by_color(diffcoex_module = diffcoex_module)
MCODE
A clique based algorithm to identify disease modules from differentially expressed genes originally by Bader et al. (Bader and Hogue 2003)
Much of the code and documentation has been taken from the now defunct package “ProNet”
In MODifieR, the function can be called using mod_mcode()
The MCODE module is the union of all submodules with a score above module_score
.
mcode_module <- mod_mcode(MODifieR_input = MODifieR_input_microarray,
ppi_network = ppi_network)
Post-processing
For MCODE, two post-processing functions are available. The first one, mcode_adjust_score
, allows to change the module_score
. In the example below the score is set to 12
mcode_module <- mcode_adjust_score(module_cutoff = 12, mcode_module = mcode_module)
With the second function, mcode_split_modules
, splits the mcode_module into the different submodules above threshold module_cutoff
. The return type is a list of MCODE MODifieR_modules.
mcode_module_list <- mcode_split_modules(module_cutoff = 3, mcode_module)
MODA
This implementation follows a workflow as described in the MODA vignette. (Li et al. 2016) First, two separate networks are constructed, a background network containing expression data from all samples and a condition specific network consisting of all samples minus the condition specific samples. Then, hierarchical clustering is performed and cutting height estimated from either maximal average density or modularity
Condition specific co-expression modules are then extracted using the Jaccard index and specificTheta.
The final module will consist of the co-expression module that has the minimal Jaccard index complemented by co-expression modules that have a Jaccard index below this minimal + specificTheta
. The group of interest signifies which group has the trait of interest. In this case that is group 2, the SLE patients.
In MODifieR, the function can be called using moda()
moda_module <- moda(MODifieR_input = MODifieR_input_microarray,
group_of_interest = 2)
Post-processing
After the MODA analysis is complete, the specificTheta
parameter can be adjusted with the function moda_change_specific_threshold
. In the example it is set to 0.05 (default is 0.1)
moda_module <- moda_change_specific_threshold(moda_module = moda_module,
specificTheta = 0.05)
Module Discoverer
This is an implementation of the single seed Module Discoverer algorithm. The code has been adapted from the original code by Vlaic et al. (Vlaic et al. 2018) For details, please see the paper referenced.
In MODifieR, the function can be called using modulediscoverer()
module_discoverer_module <- modulediscoverer(MODifieR_input = MODifieR_input_microarray,
ppi_network = ppi_network,
permutations = 1000,
repeats = 3)
WGCNA trait-based
Wgcna is an implementation of WGCNA (Langfelder and Horvath 2008), (Langfelder and Horvath 2012) that associates co-expression modules (denoted by color) to a trait. Co-expression modules with an adjusted p-value < pval_cutoff will make up the final disease module.
The algorithm infers co-expression modules from combined expression data set from both group1 and group2. Co-expression modules are then correlated to trait (group 1 ~ group 2). The group of interest signifies which group has the trait of interest. In this case that is group 2, the SLE patients.
In MODifieR, the function can be called using wgcna()
wgnca_module <- wgcna(MODifieR_input = MODifieR_input_microarray,
group_of_interest = 2)
Post-processing
There are 4 functions available to post-process WGCNA module objects.
The first one, wgcna_adjust_significance
, allows to adjust the p-value for inclusion of co-expression modules into the final disease module. The Boolean parameter use_unadjusted
denotes if the adjusted (FDR correction) or unadjusted p-value is to be used.
wgnca_module <- wgcna_adjust_significance(pval_cutoff = 0.1,
wgcna_module = wgnca_module,
use_unadjusted = FALSE)
With the second function wgcna_get_modules_genes_by_sign
returns a module object that either consists of co-expression colors positively or negatively correlated to the trait. To get only colors positively correlated to trait in the resulting module, the mode
parameter can be set to “p”. To only get negatively correlated colors, use “n”.
wgnca_module <- wgcna_get_module_genes_by_sign(wgcna_module = wgnca_module,
mode = "p")
In the wgcna algorithm, co-expression modules are denoted by color. The final disease module can be composed of multiple colors. The function wgcna_split_module_by_color
allows to split each of these colors into separate MODifieR_module objects. Only colors significantly associated to the trait will be used. The return object is a list of MODifieR module objects.
wgnca_module <- wgcna_split_module_by_color(wgcna_module = wgcna_module)
The second post-processing function is wgcna_set_module_size
. This function is intended to return a module where the number of module genes is closest to size
. In the example, we set the desired size to 200.
wgnca_module <- wgcna_set_module_size(size = 200,
wgcna_module = wgnca_module)
Consensus modules
Currently, there are two different methods implemented to derive a consensus module. The first is a count-based consensus between a list of modules and the second is specific betweenness (S2B) (Garcia-Vaquero et al. 2018)
Count based
A count based consensus module can be derived with the function create_module_set()
. It will return a module object of class MODifieR_module
, same as the individual methods. The subclass is Module_set
.
The method derives a consensus module from a set of modules is by tabulating the presence of each gene in the individual methods. The next step is to set a minimum for this presence; for example, a gene has to be present in at least 3 of the individual modules.
To start deriving a consensus module, it is convenient to wrap all the previously acquired modules into a list:
module_list <- list(clique_sum_permutation_module,
correlation_clique_module,
diamond_module,
diffcoex_module,
mcode_module,
moda_module,
module_discoverer_module,
wgcna_module)
Now the maximal number of modules that share at least one gene can be retrieved:
max_frequency <- get_max_frequency(module_list)
The maximum in this particular example turns out to be 5. This value can be used to retrieve all consensus modules, from a minimum presence of 1 (union of modules) up to 5:
consensus_modules <- lapply(X = 1:max_frequency, FUN = create_module_set,
module_list = module_list)
A list of length 5 with Module_set
objects is returned. In this tutorial the fourth element in the list, so the consensus module where genes are present in at least 4 individual methods will be inspected.
consensus_module4 <- consensus_modules[[4]]
Like other module objects, it has a vector containing the module genes and a list containing the settings. In addition, it has the following elements:
module_gene_list
A named list containing the module genes from the original modulesgene_frequency
Table containing all the genes present in the modules and their frequencymethod_by_gene
A named list where the elements are the modules the genes have been found in and the names are the gene namesgene_by_method
A table containing the gene frequencies by combination of methods
The resulting module object can now be used for further analysis.
In this example the module genes from consensus_module4 are written to a file:
write.table(x = consensus_module4$module_genes,
file = "consensus4_genes.txt",
quote = F,
row.names = F,
col.names = F)
And uploaded into String. This resulted in Figure 1 and Table 2.
pathway ID | pathway description | observed gene count | false discovery rate |
---|---|---|---|
GO.0005515 | protein binding | 39 | 1.18E-13 |
GO.0005102 | receptor binding | 19 | 4.56E-09 |
GO.0005126 | cytokine receptor binding | 8 | 5.35E-05 |
GO.0005125 | cytokine activity | 7 | 7.83E-05 |
GO.0005488 | binding | 42 | 0.000111 |
GO.0003684 | damaged DNA binding | 5 | 0.000192 |
GO.0030235 | nitric-oxide synthase regulator activity | 3 | 0.000192 |
GO.0008022 | protein C-terminus binding | 6 | 0.000392 |
GO.0043566 | structure-specific DNA binding | 7 | 0.000445 |
GO.0019899 | enzyme binding | 14 | 0.00062 |
Table 2: Excerpt from Molecular Function GO term table on the result page of String
Specific betweenness (S2B)
An S2B consensus module can be derived with the function S2B()
. It will return a module object of class MODifieR_module
, same as the individual methods. The subclass is S2B
.
The method derives a consensus module between two modules. Originally conceived by (Garcia-Vaquero et al. 2018) to find an overlap between two gene sets from related diseases, it can also be applied to find an overlap between gene sets tht have been inferred with different methods but are from the same disease.
The method works by prioritizing genes frequently and specifically present in shortest paths linking two disease modules. For a detailed description of the algorithm, see (Garcia-Vaquero et al. 2018).
The function is called using S2B()
In the example below, the clique_sum_permutation and MODA modules are used, and the number of iterations (nrep
and nrep2
) is set to 50 instead of 100 to shorten computation time.
S2B_module <- S2B(MODifieR_module1 = clique_sum_permutation_module,
MODifieR_module2 = moda_module,
ppi_network = ppi_network,
nrep = 50,
nrep2 = 50)
As with all module objects, the S2B object has a vector called module_genes
for the genes that make up the module and also a list containing the settings that have been used to generate the object.
In addition it has two related fields, input_class1
and input_class2
, that show the class of the MODifieR_input
objects that have been used as input. In the case of the tutorial they are Clique_Sum_permutation
and MODA
General functions
Plotting
A very crude plotting function has been implemented to plot MODifieR module objects. To do this, the regular function plot()
can be used. It takes an additional argument, a PPi network to overlay the module genes on. The size of the nodes on the plot is dependent on degree, nodes with a high degree are larger. In general, the plots are not great but can be used in an exploratory fashion when inferring modules.
Plotting consensus_module4 again with the example PPi included in MODifieRDev:
plot(consensus_module4, ppi_network = ppi_network)
Exporting
MODifieR objects can be exported to both csv and an excel workbook.
Exporting to csv
To export MODifieR objects to csv, the function export
is used. It takes a MODifieR object as primary input and also a folder to write the object to as a secondary input. If the folder does not exist, it will be created.
Every element of the object will be written to a separate csv file.
export(MODifieR_object = consensus_modules[[4]], folder = "modules/consensus_module_4")
The input object can also be exported:
export(MODifieR_object = MODifieR_input_microarray, folder = "inputs/microarray")
Exporting to an excel workbook
To export to an excel workbook, the function export_xlsx
is used. Like export, it takes a MODifieR object and a folder as inputs, and in addition an optional filename
. If none given, the name of the object will be the filename.
The output consists of a multi-sheet excel workbook, where every element of the object is a separate sheet.
export_xlsx(MODifieR_object = diamond_module, folder = "modules/diamond_module")
Also exporting the input object:
export_xlsx(MODifieR_object = MODifieR_input_microarray, folder = "inputs/microarray")
References
Bader, Gary D, and Christopher WV Hogue. 2003. “An automated method for finding molecular complexes in large protein interaction networks.” BMC Bioinformatics 4 (1): 2. doi:10.1186/1471-2105-4-2.
Barrenäs, Fredrik, Sreenivas Chavali, Alexessander Alves, Lachlan Coin, Marjo-Riitta Jarvelin, Rebecka Jörnsten, Michael a Langston, et al. 2012. “Highly interconnected genes in disease-specific networks are enriched for disease-associated polymorphisms.” Genome Biology 13 (6): R46. doi:10.1186/gb-2012-13-6-r46.
Garcia-Vaquero, Marina L., Margarida Gama-Carvalho, Javier De Las Rivas, and Francisco R. Pinto. 2018. “Searching the overlap between network modules with specific betweeness (S2B) and its application to cross-disease analysis.” Scientific Reports 8 (1). Springer US: 1–10. doi:10.1038/s41598-018-29990-7.
Ghiassian, Susan Dina, Jörg Menche, and Albert László Barabási. 2015. “A DIseAse MOdule Detection (DIAMOnD) Algorithm Derived from a Systematic Analysis of Connectivity Patterns of Disease Proteins in the Human Interactome.” PLoS Computational Biology 11 (4): 1–21. doi:10.1371/journal.pcbi.1004120.
Gustafsson, Mika, Måns Edström, Danuta Gawel, Colm E Nestor, Hui Wang, Huan Zhang, Fredrik Barrenäs, et al. 2014. “Integrated genomic and prospective clinical studies show the importance of modular pleiotropy for disease susceptibility, diagnosis and treatment.” Genome Medicine 6 (2): 17. doi:10.1186/gm534.
Jensen, Lars J, Michael Kuhn, Samuel Chaffron, Berend Snel, Peer Bork, Tobias Doerks, and Beate Kru. 2007. “STRING 7—recent developments in the integration and prediction of protein interactions” 35 (November 2006): 358–62. doi:10.1093/nar/gkl825.
Langfelder, Peter, and Steve Horvath. 2008. “WGCNA: An R Package for Weighted Correlation Network Analysis.” BMC Bioinformatics, no. 1: 559.
———. 2012. “Fast R Functions for Robust Correlations and Hierarchical Clustering.” Journal of Statistical Software 46 (11): 1–17. http://www.jstatsoft.org/v46/i11/.
Li, Dong, James B. Brown, Luisa Orsini, Zhisong Pan, Guyu Hu, and Shan He. 2016. MODA: MODA: MOdule Differential Analysis for Weighted Gene Co-Expression Network.
Tesson, Bruno M, Rainer Breitling, and Ritsert C Jansen. 2010. “DiffCoEx: a simple and sensitive method to find differentially coexpressed gene modules.” BMC Bioinformatics 11: 497. doi:10.1186/1471-2105-11-497.
Vlaic, Sebastian, Theresia Conrad, Christian Tokarski-Schnelle, Mika Gustafsson, Uta Dahmen, Reinhard Guthke, and Stefan Schuster. 2018. “ModuleDiscoverer: Identification of regulatory modules in protein-protein interaction networks.” Scientific Reports 8 (1): 433. doi:10.1038/s41598-017-18370-2.