MODifieRDev

Dirk de Weerd

2019-08-09

Installation

MODifierDev requires Python 3 with additional Python libraries scipy, sqlite3, numpy and networkx. They are a standard part of Anaconda so installing Anaconda is recommended. It can be downloaded here:

Anaconda

In addition, some R packages are required. They can be acquired with the following code:

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install(c("AnnotationDbi",
                       "MODA", 
                       "STRINGdb", 
                       "limma", 
                       "org.Hs.eg.db",
                       "foreach", 
                       "doParallel", 
                       "Rcpp", 
                       "dynamicTreeCut", 
                       "flashClust", 
                       "reticulate", 
                       "plyr", 
                       "parallel", 
                       "igraph", 
                       "WGCNA",
                       "RSQLite",
                       "stackoverflow",
                       "preprocessCore",
                       "DESeq2",
                       "edgeR",
                       "openxlsx"), 
                     version = "3.8")

After that, the package can be installed from github:

devtools::install_git(url = "https://github.com/ddeweerd/MODifieRDev.git")

Introduction

Cells are organized in a modular fashion, where essential functions are carried out by functional modules. Modules can be described as clusters of genes, gene products or metabolites that interact together, are co-regulated or physically interacting. Complex diseases rarely arise from a single causal factor but rather from multiple factors, with large individual variation. This leads to dysregulation of parts of functional modules and thereby gives rise to a disease phenotype. The underlying perturbation in parts of the functional modules and the connectivity between them makes up a disease module.

To better understand complex diseases it is crucial to identify disease modules. Genes present in the module might not have a significant impact on the disease on its own. However, the cumulative effect of multiple low penetrance genes could play a major role in the pathogenesis of complex diseases. Network-based approaches could be key in detecting these low penetrance genes, which could potentially be novel bio markers or therapeutic targets.

Various disease module inference methods have been proposed earlier, using different approaches. Here, MODifieR is presented, an R package that bundles 8 different disease module inference methods into a single package that takes either micro array or RNA-seq data as input. The 8 methods methods can be classified into 3 different algorithmic classes: seed-based, clique based and co-expression based methods. MODifieR is available under the GNU GPL open source license.

Module inference methods included are:

Clique Sum
Correlation Clique
DIAMoND
DiffCoEx
MCODE
MODA
Module Discoverer
WGCNA trait-based

How to run

This is an explanation on how to run the package. It starts with a short description on the core objects followed by an explanation on how to use the functions. The package also includes example data enabling the user to follow this tutorial.

Core objects

The package is build on two core objects, an input and a module object.

Input object

The first object is an input object for the disease module inference methods, with class MODifieR_input. This object can be used for all disease module inference methods.

The MODifieR_input class has two subclasses, either MicroArray or RNA-seq depending on the input data type.

The input objects contain all necessary input data for the methods. Below is a table showing what methods use an expression matrix, Differentially Expressed Genes (DEGs) or both.

Method Annotated expression matrix DEGs
Clique Sum X
Correlation Clique X X
DIAMoND X
DiffCoEx X
MCODE X
MODA X
Module Discoverer X
WGCNA trait-based X

Table 1: Input types taken by inference methods

Microarray input object

The MODifieR_input with subclass MicroArray object is a named list and can contain the following components:

diff_genes A 2 two column data.frame where the first column are genes and the second column unadjusted p-values obtained by differential expression analysis.

limma_probe_table A data.frame from limma topTable with added gene annotation.

annotated_exprs_matrix A matrix where the rows are genes and the columns samples. Probes have been collapsed into genes.

expression_matrix The original input expression matrix.

annotation_table A data.frame containing the annotation for the microarray probes. Also part of the input

group_indiciA named list containing 2 numeric vectors. The names are the group labels, for example “patient” and “control”. The values are the indici for each group, so the column indici for the expression matrix for each of the aforementioned groups.

settings The parameters used to generate the object.

RNA-seq input object

The MODifieR_input with subclass RNA-seq object is a named list and can contain the following components:

diff_genes A 2 two column data.frame where the first column are genes and the second column unadjusted p-values obtained by differential expression analysis.

edgeR_deg_table A data.frame from edgeR glmQLFit.

annotated_exprs_matrix A matrix where the rows are genes and the columns samples. Probes have been collapsed into genes.

count_matrix The orginal input count matrix.

group_indiciA named list containing 2 numeric vectors. The names are the group labels, for example “patient”. and “control”. The values are the indici for each group, so the column indici for the expression matrix for each of the aforementioned groups.

settings The parameters used to generate the object.

Module object

The second object, with class MODifieR_module is an object that is produced by the disease module inference methods. They also have a subclassses according to the method and input type (Microarray or RNA-seq) that produced them.

Like the input object, the module objects are also named lists. Module objects differ between each function that generates them (hence the subclass) but have certain features in common:

  • The module object will always contain an element (character vector) called module_genes which are the genes that make up the final module

  • The module object will always contain a named list called settings which holds the parameters that have been used when generating the object

Creating an input object

In order to run an analysis, an input object has to be created. This can be done in two ways: The first is the usual way and involves high throughput data and in case of microarray also probe annotation. The second way is to create a custom object using previously obtained DEGs or expression data.

Using microarray data

To create an input object with microarray data, the function create_input_microarray can be used. The function takes several inputs:

  • expression_matrix A matrix with normalized expression values. The rows are probes and the columns are samples.

  • annotation_table An two columns dataframe providing annotation for the probes. The first column should contain the probe ID, and the second column its corresponding annotation.

  • group1_indici A vector that should have the indici in the expression matrix for the samples belonging to group 1.

  • group2_indici Same, but for group 2

  • group1_label Label to identify group 1, for example “Control” or “Patient”

  • group2_label Same, but for group 2

  • expression A Boolean value indicating if expression matrix data should be calculated.

  • differential expression A Boolean value indicating if DEGs data should be calculated.

  • use_adjusted use adjusted p value for differential expression analysis?

  • method When collapsing the probes, the collapseRows() method from WGCNA is used. This argument determines the collapsing method to use. Please see the WGCNA documentation for more information. Options are:
    • MaxMean Highest mean value
    • MinMean Lowest mean value
    • MaxRowVariance Highest variance
    • absMaxMean Highest mean absolute value
    • absMinMean Lowest mean absolute value
    • ME Eigenrow (first principal component)

Using RNA-seq data

Creating an RNA-seq input object is quite similar to creating a microarray input object. The function is create_input_rnaseq and the arguments are:

  • count matrix Matrix containing raw RNA-seq counts

  • group1_indici A vector that should have the indici in the expression matrix for the samples belonging to group 1.

  • group2_indici Same, but for group 2

  • group1_label Label to identify group 1, for example “Control” or “Patient”

  • group2_label Same, but for group 2

  • expression A Boolean value indicating if expression matrix data should be calculated.

  • normalize_quantiles, Normalize quantiles for WGCNA-based methods?

Example dataset microarray

Example data has been included in the package. There is an example microarray expression matrix from Gene Expression Omnibus (GEO) with accession number GSE4588. This data set has 16 columns, so 16 samples in total with 9 systemic lupus erythematosus (SLE) patients and 7 healthy controls. The matrix has 52307 rows, meaning 52307 probes. The indici for the healthy controls are 1 to 9, and 10 to 16 make up the SLE patients. The expression matrix can be accessed using expression_matrix

We can inspect it using the head function:

head(expression_matrix)
                GSM101962 GSM101963 GSM101964 GSM101965 GSM101966
AFFX-BioB-5_at      442.0     442.6     230.8     275.2     540.7
AFFX-BioB-M_at      620.7     565.5     256.9     299.8     693.7
AFFX-BioB-3_at      404.7     376.1     175.4     168.2     403.9
AFFX-BioC-5_at     1298.3    1201.6     742.4     781.2    1485.3
AFFX-BioC-3_at     1647.5    1566.8     966.2    1059.7    1743.9
AFFX-BioDn-5_at    2952.5    2819.6    2107.7    2444.5    3265.4
                GSM101967 GSM101968 GSM101969 GSM101970 GSM102710
AFFX-BioB-5_at      432.7     476.4     953.9    1397.2     197.0
AFFX-BioB-M_at      658.5     605.7    1313.7    1863.5     256.8
AFFX-BioB-3_at      440.5     379.5     889.9    1277.7     159.4
AFFX-BioC-5_at     1406.0    1301.2    2368.0    3752.2     674.3
AFFX-BioC-3_at     1720.4    1598.2    2727.0    3771.3     867.4
AFFX-BioDn-5_at    3150.1    2911.8    5480.8    7806.6    2055.5
                GSM102711 GSM102712 GSM102713 GSM102714 GSM102715
AFFX-BioB-5_at      184.2     415.8     499.4    1249.3     380.3
AFFX-BioB-M_at      237.2     560.7     729.1    1796.1     557.1
AFFX-BioB-3_at      190.6     359.2     445.2    1194.3     315.8
AFFX-BioC-5_at      573.9    1368.8    1485.1    3886.1    1300.5
AFFX-BioC-3_at      841.0    1476.6    1891.9    4485.8    1706.6
AFFX-BioDn-5_at    1774.7    2927.9    3254.2    8357.6    2781.7
                GSM102716
AFFX-BioB-5_at      339.5
AFFX-BioB-M_at      576.0
AFFX-BioB-3_at      358.9
AFFX-BioC-5_at     1262.9
AFFX-BioC-3_at     1466.3
AFFX-BioDn-5_at    3058.1

The annotation for the probes is provided in a data.frame accessible through probe_annotation. The first column contains the probe ids where the second column holds the corresponding annotation, in this case ENTREZ gene identifiers.

Again using head:

head(probe_annotation)
          PROBEID ENTREZID
1  AFFX-BioB-5_at     <NA>
2  AFFX-BioB-M_at     <NA>
3  AFFX-BioB-3_at     <NA>
4  AFFX-BioC-5_at     <NA>
5  AFFX-BioC-3_at     <NA>
6 AFFX-BioDn-5_at     <NA>

Now we can put these things together to create an input object:

MODifieR_input_microarray <- create_input_microarray(expression_matrix = expression_matrix,
                               annotation_table = probe_annotation,
                               group1_indici = 1:9,
                               group2_indici = 10:16,
                               group1_label = "Control",
                               group2_label = "Patient")

When the function has finished, it might take a few minutes, we can inspect the object. Let’s start with the class:

[1] "MODifieR_input" "Expression"     "MicroArray"    

Now we can inspect the DEGs:

       gene      pvalue
1         1 0.976932866
2        10 0.347847172
3       100 0.006504456
4      1000 0.329104066
5     10000 0.115466316
6 100009676 0.433955896

The limma_probe_table is the table generated by limma with annotation for the probe added:

      logFC   AveExpr        t      P.Value    adj.P.Val          B
1 4455.8984 2263.4000 16.21819 6.973635e-11 2.143054e-06 -0.4889325
2  883.5587  710.8125 16.03447 8.194137e-11 2.143054e-06 -0.4961970
3 1995.8476 1046.2500 12.89384 1.721050e-09 2.255049e-05 -0.6651916
4  845.2206  549.2062 12.89198 1.724472e-09 2.255049e-05 -0.6653242
5  339.1746  240.1000 11.96197 4.802206e-09 5.023779e-05 -0.7382467
6 1924.1302 1276.3625 10.46105 2.902571e-08 2.280640e-04 -0.8907192
      PROBEID IDENTIFIER
1   204439_at      10964
2   213294_at       5610
3 214453_s_at      10561
4   227609_at      94240
5   219352_at      55008
6   202086_at       4599

Finally, we can take a look at the groups indici and it’s names by looking at the group_indici component of the object:

$Control
[1] 1 2 3 4 5 6 7 8 9

$Patient
[1] 10 11 12 13 14 15 16

Now we have set up an microarray input object to use in the disease inference methods.

Example dataset RNA-seq

There is an example RNA-seq expression matrix from Gene Expression Omnibus (GEO) with accession number GSE123496. This data set has 10 columns, so 10 samples taken from the frontal cortex with in total 5 Multiple Sclerosis (MS) patients and 5 healthy controls. The matrix has 16569 rows, meaning the same amount of genes. The indici for the patients are 1 to 5, and 6 to 10 make up the controls. The count matrix can be accessed using count_matrix

We can inspect it using the head function:

head(count_matrix)
      MS32CF MS38CF MS44CF MS50CF MS56CF NL74CF NL80CF NL86CF NL62CF
7105     251     90    209    336    358    191    189    172    124
8813     235    121    404    428    330    594    242    435     89
57147    224     64    280    437    387    545    259    478     86
55732    109     37    145    203    189    214    114    197     37
2268     336    237    227    345    733    262    482    130    457
3075     867   1181    951   1662   1825    769   1465    285   1235
      NL68CF
7105     335
8813     324
57147    408
55732    203
2268     423
3075     579

To create the input object:

MODifieR_input_rna <- create_input_rnaseq(count_matrix = count_matrix, 
                                          group1_indici = 1:5, 
                                          group2_indici = 6:10, 
                                          group1_label = "Patient", 
                                          group2_label = "Control",
                                          use_adjusted = F)

This dataset does not contain any significant genes after FDR correction, so unadjusted p values are used.

When the function has finished, we can inspect the object. Let’s start with the class:

[1] "MODifieR_input" "Expression"     "RNA-seq"       

Now we can inspect the DEGs:

       gene       pvalue
1      4241 8.162947e-05
2 105378049 1.623881e-04
3 100423062 2.969720e-04
4    155185 6.942163e-04
5       284 9.156738e-04
6     55791 9.238056e-04

The edgeR_deg_table is the table generated by edgeR:

              logFC    logCPM        F       PValue       FDR
4241       1.071991 4.0142387 36.67993 8.162947e-05 0.9579268
105378049 -2.247071 0.9811993 31.20710 1.623881e-04 0.9579268
100423062 -5.161158 0.1725241 26.93895 2.969720e-04 0.9579268
155185     1.011908 3.4099991 21.68029 6.942163e-04 0.9579268
284       -1.264745 1.8260861 20.13550 9.156738e-04 0.9579268
55791     -0.897363 3.8180308 20.08746 9.238056e-04 0.9579268

Finally, we can take a look at the groups indici and it’s names by looking at the group_indici component of the object:

$Patient
[1] 1 2 3 4 5

$Control
[1]  6  7  8  9 10

Now we have set up an RNA-seq input object to use in the disease inference methods.

Creating a custom input object

There is also the possibility to create a custom input object. This can be useful in cases where there are already DEGs, an annotated expression or both available. In order to create such an object, the functions create_custom_microarray_input_object or create_input_rna_input_object can be used.

Below are example on how to use the functions. Some of the input data is optional, for example limma_probe_table/edgeR, expression_matrix/count_matrix and annotation_table are not used in any disease module inference method. Instead, they can optionally be used to update or recalculate an input object.

generic_input_microarray_object <- create_custom_input_object(diff_genes = NULL, 
                                                              limma_probe_table = NULL,
                                                              annotated_exprs_matrix = NULL, 
                                                              expression_matrix = NULL, 
                                                              annotation_table = NULL, 
                                                              group1_indici = NULL,
                                                              group2_indici = NULL, 
                                                              group1_label = NULL, 
                                                              group2_label = NULL)
generic_input_rna_object <- create_custom_rna_input_object(diff_genes = NULL, 
                                                           edgeR_deg_table = NULL, 
                                                           annotated_exprs_matrix = NULL, 
                                                           count_matrix = NULL, 
                                                           group1_indici = NULL, 
                                                           group2_indici = NULL, 
                                                           group1_label = NULL, 
                                                           group2_label = NULL, 
                                                           settings = NULL)

Please note that the specific data in the input object that is required differs per inference method, the reader is referred to table 1 in subsection ‘Input object’ to see what inference methods needs what input data fields. For example, to use the DIAMoND method, only the diff_genes input would suffice to run the analysis.

If the inference method requires annotated_exprs_matrix, group indici and labels are also required.

Inference methods

Disease module inference methods that have DEGs as input need a Protein-Protein interaction (PPi) network to overlay the DEGs on. As an example, a small STRING db PPi network has been included. This PPi network is from STRING version 7.1 and is filtered for interactions with a confidence score higher than 700. This small network only contains 5541 unique genes and 64672 interactions. The first column is interactor gene 1, the second column is interactor gene 2 and the third and last column gives the STRING confidence score. (Jensen et al. 2007) Using head again, we can inspect the first few rows:

ppi_head
  entrez1 entrez2 SCORE
1    5341     381   896
2   23647     381   917
3    8853     381   973
4    5921     381   756
5   50807     381   894
6    3002    4074   735

In this tutorial the previously created microarray input will be used. This can easily be exchanged for the RNA-seq input object, the procedure is exactly the same, only MODifieR_input_microarray is replaced by MODifieR_input_rna.

Since this is a tutorial on how to run the inference methods, we can change parameters for some inference methods and also make the size of the input object a little smaller to reduce run time.

Also, we can lower the number of genes in the annotated expression matrix:

MODifieR_input_microarray$annotated_exprs_matrix <- MODifieR_input_microarray$annotated_exprs_matrix[1:10000, ]

In addition, some inference methods allow to modify the module object after analysis. If an inference method has these functions, they will be shortly discussed in separate subsections of the inference method.

Clique Sum

There are two different versions of the Clique Sum algorithm implemented, clique_sum_exact and clique_sum_permutation, that differ in the method to determine if a clique is significantly enriched with DEGs. Obtaining cliques is done in a separate function, build_clique_db, that creates an SQLite database for the cliques. This is because obtaining cliques is a computationally intensive process, may clog the memory when using a large network and is static for each network.

Both versions take in the file name of this database instead of a PPi network.

build_clique_db

This function is used to create a clique database. db_folder is the folder where database will be stored and db_name is the name of the database. .sqlite will be appended to the database name.

build_clique_db(ppi_network = ppi_network, 
                db_folder = "/folder/name/", 
                db_name = "example")

clique_sum_exact

clique_sum_permutation is an implementation of the clique-based disease module inference method proposed by Barrenäs et al. (Barrenäs et al. 2012)

Enrichment of cliques is determined by a one-sided Fisher exact test.

In MODifieR, the function can be called using clique_sum_exact() and using the file name of the database created in the function build_clique_db as input for the db parameter.

clique_sum_exact_module <- clique_sum_exact(MODifieR_input = MODifieR_input_microarray, 
                                            db = "folder/name/example.sqlite")

clique_sum_permutation

clique_sum_permutation is a variation of the clique-based disease module inference method proposed by Barrenäs et al. (Barrenäs et al. 2012), (Gustafsson et al. 2014)

Enrichment of cliques is determined by comparing the sum of the -log 10 p-values for all genes in the clique and comparing that to a null distribution of cliques composed of random genes of the same size.

In MODifieR, the function can be called using clique_sum_permutation() and using the file name of the database created in the function build_clique_db as input for the db parameter.

clique_sum_permutation_module <- clique_sum_permutation(MODifieR_input = MODifieR_input_microarray, 
                                                        db = "folder/name/example.sqlite")

Correlation Clique

The correlation clique is a clique-based algorithm using consensus clustering. The algorithm starts with calculating a correlation score between each interaction in the PPi network. The correlation score is obtained by subtracting the Pearson correlation p-value: \(correlation score = 1 - Pearson\: p\: value\)

Subsequently, the correlation score is multiplied by the correlation confidence. $edge score = $

When the edge scores are calculated the iterative part of the algorithm commences: All edge scores are scaled to so that the mean of all edge scores is equal to the fraction_of_iteration parameter. Next, the newly scaled scores are compared to random variables from the uniform distribution between (0,1) and only interactions where the edge score is higher than the random variable are used to construct a new PPi network. Because the scores are scaled to fraction_of_iteration, this is the average fraction of interactions used for this new network. Then, maximal cliques are inferred from this new network. The cliques are tested for significant enrichment of DEGs by Fisher’s exact test and the union of significant cliques is the disease module for this iteration. The final disease module will consist of genes that have been present in at least frequency cutoff iterations

The function can be called using correlation_clique()

In the example code, the number of iterations has been reduced to 10 (from 50) to speed up the analysis

correlation_clique_module <- correlation_clique(MODifieR_input = MODifieR_input_microarray, 
                                                ppi_network = ppi_network, 
                                                iteration = 10)
Post-processing

Correlation clique offers two post-processing functions.

The first one, correlation_adjust_cutoff, allows to adjust the frequency_cutoff parameter. The default is 0.5, meaning the gene has to be present in at least 50 percent of iterations. In this example it is set to 0.2, so 20 percent of iterations.

correlation_clique_module <- correlation_adjust_cutoff(frequency_cutoff = 0.2, 
                                                       correlation_module = correlation_clique_module)

The second post-processing function is correlation_set_module_size. This function is intended to return a module where the number of module genes is closest to size. In the example, we set the desired size to 10.

correlation_clique_module <- correlation_set_module_size(size = 10, 
                                                         correlation_module = correlation_clique_module)

DIAMoND

A slightly modified version of the original DIAMOnD python script (Ghiassian, Menche, and Barabási 2015) is called from within R. The only change to the original algorithm is the option to include the seed genes to the module.

In MODifieR, the function can be called using diamond()

diamond_module <- diamond(MODifieR_input = MODifieR_input_microarray, 
                          ppi_network = ppi_network)
Post-processing

Post-processing DIAMOnD modules entails either to remove or add the seed genes. To do this, there are 2 related functions, add_diamond_seed_genes and remove_diamond_seed_genes

diamond_module <- diamond_add_seed_genes(diamond_module = diamond_module)

diamond_module <- diamond_remove_seed_genes(diamond_module = diamond_module)

DiffCoEx

DiffCoEx is a method for identifying correlation pattern changes, which builds on the commonly used Weighted Gene Co expression Network Analysis (WGCNA) framework for co expression analysis. (Tesson, Breitling, and Jansen 2010)

In MODifieR, the function can be called using diffcoex()

diffcoex_module <- diffcoex(MODifieR_input = MODifieR_input_microarray)
Post-processing

In the DiffCoEx algorithm, co-expression modules are denoted by color. The final disease module can be composed of multiple colors. The function diffcoex_split_module_by_color allows to split each of these colors into separate MODifieR_module objects. The return object is a list of MODifieR module objects.

diffcoex_module_list <- diffcoex_split_module_by_color(diffcoex_module = diffcoex_module)

MCODE

A clique based algorithm to identify disease modules from differentially expressed genes originally by Bader et al. (Bader and Hogue 2003)

Much of the code and documentation has been taken from the now defunct package “ProNet”

In MODifieR, the function can be called using mod_mcode()

The MCODE module is the union of all submodules with a score above module_score.

mcode_module <- mod_mcode(MODifieR_input = MODifieR_input_microarray, 
                               ppi_network = ppi_network)
Post-processing

For MCODE, two post-processing functions are available. The first one, mcode_adjust_score, allows to change the module_score. In the example below the score is set to 12

mcode_module <- mcode_adjust_score(module_cutoff = 12, mcode_module = mcode_module)

With the second function, mcode_split_modules, splits the mcode_module into the different submodules above threshold module_cutoff. The return type is a list of MCODE MODifieR_modules.

mcode_module_list <- mcode_split_modules(module_cutoff = 3, mcode_module)

MODA

This implementation follows a workflow as described in the MODA vignette. (Li et al. 2016) First, two separate networks are constructed, a background network containing expression data from all samples and a condition specific network consisting of all samples minus the condition specific samples. Then, hierarchical clustering is performed and cutting height estimated from either maximal average density or modularity

Condition specific co-expression modules are then extracted using the Jaccard index and specificTheta.

The final module will consist of the co-expression module that has the minimal Jaccard index complemented by co-expression modules that have a Jaccard index below this minimal + specificTheta. The group of interest signifies which group has the trait of interest. In this case that is group 2, the SLE patients.

In MODifieR, the function can be called using moda()

moda_module <- moda(MODifieR_input = MODifieR_input_microarray, 
                    group_of_interest = 2)
Post-processing

After the MODA analysis is complete, the specificTheta parameter can be adjusted with the function moda_change_specific_threshold. In the example it is set to 0.05 (default is 0.1)

moda_module <- moda_change_specific_threshold(moda_module = moda_module, 
                                              specificTheta = 0.05)

Module Discoverer

This is an implementation of the single seed Module Discoverer algorithm. The code has been adapted from the original code by Vlaic et al. (Vlaic et al. 2018) For details, please see the paper referenced.

In MODifieR, the function can be called using modulediscoverer()

module_discoverer_module <- modulediscoverer(MODifieR_input = MODifieR_input_microarray, 
                                             ppi_network = ppi_network, 
                                             permutations = 1000, 
                                             repeats = 3)

WGCNA trait-based

Wgcna is an implementation of WGCNA (Langfelder and Horvath 2008), (Langfelder and Horvath 2012) that associates co-expression modules (denoted by color) to a trait. Co-expression modules with an adjusted p-value < pval_cutoff will make up the final disease module.

The algorithm infers co-expression modules from combined expression data set from both group1 and group2. Co-expression modules are then correlated to trait (group 1 ~ group 2). The group of interest signifies which group has the trait of interest. In this case that is group 2, the SLE patients.

In MODifieR, the function can be called using wgcna()

wgnca_module <- wgcna(MODifieR_input = MODifieR_input_microarray,
                      group_of_interest = 2)
Post-processing

There are 4 functions available to post-process WGCNA module objects.

The first one, wgcna_adjust_significance, allows to adjust the p-value for inclusion of co-expression modules into the final disease module. The Boolean parameter use_unadjusted denotes if the adjusted (FDR correction) or unadjusted p-value is to be used.

wgnca_module <- wgcna_adjust_significance(pval_cutoff = 0.1, 
                                          wgcna_module = wgnca_module,
                                          use_unadjusted = FALSE)

With the second function wgcna_get_modules_genes_by_sign returns a module object that either consists of co-expression colors positively or negatively correlated to the trait. To get only colors positively correlated to trait in the resulting module, the mode parameter can be set to “p”. To only get negatively correlated colors, use “n”.

wgnca_module <- wgcna_get_module_genes_by_sign(wgcna_module = wgnca_module, 
                                               mode = "p")

In the wgcna algorithm, co-expression modules are denoted by color. The final disease module can be composed of multiple colors. The function wgcna_split_module_by_color allows to split each of these colors into separate MODifieR_module objects. Only colors significantly associated to the trait will be used. The return object is a list of MODifieR module objects.

wgnca_module <- wgcna_split_module_by_color(wgcna_module = wgcna_module)

The second post-processing function is wgcna_set_module_size. This function is intended to return a module where the number of module genes is closest to size. In the example, we set the desired size to 200.

wgnca_module <- wgcna_set_module_size(size = 200,
                                      wgcna_module = wgnca_module)

Consensus modules

Currently, there are two different methods implemented to derive a consensus module. The first is a count-based consensus between a list of modules and the second is specific betweenness (S2B) (Garcia-Vaquero et al. 2018)

Count based

A count based consensus module can be derived with the function create_module_set(). It will return a module object of class MODifieR_module, same as the individual methods. The subclass is Module_set.

The method derives a consensus module from a set of modules is by tabulating the presence of each gene in the individual methods. The next step is to set a minimum for this presence; for example, a gene has to be present in at least 3 of the individual modules.

To start deriving a consensus module, it is convenient to wrap all the previously acquired modules into a list:

module_list <- list(clique_sum_permutation_module, 
                    correlation_clique_module, 
                    diamond_module, 
                    diffcoex_module, 
                    mcode_module, 
                    moda_module, 
                    module_discoverer_module, 
                    wgcna_module)

Now the maximal number of modules that share at least one gene can be retrieved:

max_frequency <- get_max_frequency(module_list)

The maximum in this particular example turns out to be 5. This value can be used to retrieve all consensus modules, from a minimum presence of 1 (union of modules) up to 5:

consensus_modules <- lapply(X = 1:max_frequency, FUN = create_module_set, 
                            module_list = module_list)

A list of length 5 with Module_set objects is returned. In this tutorial the fourth element in the list, so the consensus module where genes are present in at least 4 individual methods will be inspected.

consensus_module4 <- consensus_modules[[4]]

Like other module objects, it has a vector containing the module genes and a list containing the settings. In addition, it has the following elements:

  • module_gene_list A named list containing the module genes from the original modules

  • gene_frequencyTable containing all the genes present in the modules and their frequency

  • method_by_gene A named list where the elements are the modules the genes have been found in and the names are the gene names

  • gene_by_method A table containing the gene frequencies by combination of methods

The resulting module object can now be used for further analysis.

In this example the module genes from consensus_module4 are written to a file:

write.table(x = consensus_module4$module_genes, 
            file = "consensus4_genes.txt", 
            quote = F, 
            row.names = F, 
            col.names = F)

And uploaded into String. This resulted in Figure 1 and Table 2.

Figure 1: Plot from https://string-db.org using the module genes from consensus_module4 as input

Figure 1: Plot from https://string-db.org using the module genes from consensus_module4 as input

pathway ID pathway description observed gene count false discovery rate
GO.0005515 protein binding 39 1.18E-13
GO.0005102 receptor binding 19 4.56E-09
GO.0005126 cytokine receptor binding 8 5.35E-05
GO.0005125 cytokine activity 7 7.83E-05
GO.0005488 binding 42 0.000111
GO.0003684 damaged DNA binding 5 0.000192
GO.0030235 nitric-oxide synthase regulator activity 3 0.000192
GO.0008022 protein C-terminus binding 6 0.000392
GO.0043566 structure-specific DNA binding 7 0.000445
GO.0019899 enzyme binding 14 0.00062

Table 2: Excerpt from Molecular Function GO term table on the result page of String

Specific betweenness (S2B)

An S2B consensus module can be derived with the function S2B(). It will return a module object of class MODifieR_module, same as the individual methods. The subclass is S2B.

The method derives a consensus module between two modules. Originally conceived by (Garcia-Vaquero et al. 2018) to find an overlap between two gene sets from related diseases, it can also be applied to find an overlap between gene sets tht have been inferred with different methods but are from the same disease.

The method works by prioritizing genes frequently and specifically present in shortest paths linking two disease modules. For a detailed description of the algorithm, see (Garcia-Vaquero et al. 2018).

The function is called using S2B()

In the example below, the clique_sum_permutation and MODA modules are used, and the number of iterations (nrep and nrep2) is set to 50 instead of 100 to shorten computation time.

 S2B_module <- S2B(MODifieR_module1 = clique_sum_permutation_module, 
                   MODifieR_module2 = moda_module, 
                   ppi_network = ppi_network, 
                   nrep = 50, 
                   nrep2 = 50)

As with all module objects, the S2B object has a vector called module_genes for the genes that make up the module and also a list containing the settings that have been used to generate the object.

In addition it has two related fields, input_class1 and input_class2, that show the class of the MODifieR_input objects that have been used as input. In the case of the tutorial they are Clique_Sum_permutation and MODA

General functions

Plotting

A very crude plotting function has been implemented to plot MODifieR module objects. To do this, the regular function plot() can be used. It takes an additional argument, a PPi network to overlay the module genes on. The size of the nodes on the plot is dependent on degree, nodes with a high degree are larger. In general, the plots are not great but can be used in an exploratory fashion when inferring modules.

Plotting consensus_module4 again with the example PPi included in MODifieRDev:

plot(consensus_module4, ppi_network = ppi_network)
Figure 2: Using the plot function in MODifieRDev with consensus_module4 and the included PPi network as inputs

Figure 2: Using the plot function in MODifieRDev with consensus_module4 and the included PPi network as inputs

Figure 3: Using the plot function in MODifieRDev with the diamond module and the included PPi network as inputs

Figure 3: Using the plot function in MODifieRDev with the diamond module and the included PPi network as inputs

Exporting

MODifieR objects can be exported to both csv and an excel workbook.

Exporting to csv

To export MODifieR objects to csv, the function export is used. It takes a MODifieR object as primary input and also a folder to write the object to as a secondary input. If the folder does not exist, it will be created.

Every element of the object will be written to a separate csv file.

export(MODifieR_object = consensus_modules[[4]], folder = "modules/consensus_module_4")

The input object can also be exported:

export(MODifieR_object = MODifieR_input_microarray, folder = "inputs/microarray")

Exporting to an excel workbook

To export to an excel workbook, the function export_xlsx is used. Like export, it takes a MODifieR object and a folder as inputs, and in addition an optional filename. If none given, the name of the object will be the filename.

The output consists of a multi-sheet excel workbook, where every element of the object is a separate sheet.

export_xlsx(MODifieR_object = diamond_module, folder = "modules/diamond_module")

Also exporting the input object:

export_xlsx(MODifieR_object = MODifieR_input_microarray, folder = "inputs/microarray")

References

Bader, Gary D, and Christopher WV Hogue. 2003. “An automated method for finding molecular complexes in large protein interaction networks.” BMC Bioinformatics 4 (1): 2. doi:10.1186/1471-2105-4-2.

Barrenäs, Fredrik, Sreenivas Chavali, Alexessander Alves, Lachlan Coin, Marjo-Riitta Jarvelin, Rebecka Jörnsten, Michael a Langston, et al. 2012. “Highly interconnected genes in disease-specific networks are enriched for disease-associated polymorphisms.” Genome Biology 13 (6): R46. doi:10.1186/gb-2012-13-6-r46.

Garcia-Vaquero, Marina L., Margarida Gama-Carvalho, Javier De Las Rivas, and Francisco R. Pinto. 2018. “Searching the overlap between network modules with specific betweeness (S2B) and its application to cross-disease analysis.” Scientific Reports 8 (1). Springer US: 1–10. doi:10.1038/s41598-018-29990-7.

Ghiassian, Susan Dina, Jörg Menche, and Albert László Barabási. 2015. “A DIseAse MOdule Detection (DIAMOnD) Algorithm Derived from a Systematic Analysis of Connectivity Patterns of Disease Proteins in the Human Interactome.” PLoS Computational Biology 11 (4): 1–21. doi:10.1371/journal.pcbi.1004120.

Gustafsson, Mika, Måns Edström, Danuta Gawel, Colm E Nestor, Hui Wang, Huan Zhang, Fredrik Barrenäs, et al. 2014. “Integrated genomic and prospective clinical studies show the importance of modular pleiotropy for disease susceptibility, diagnosis and treatment.” Genome Medicine 6 (2): 17. doi:10.1186/gm534.

Jensen, Lars J, Michael Kuhn, Samuel Chaffron, Berend Snel, Peer Bork, Tobias Doerks, and Beate Kru. 2007. “STRING 7—recent developments in the integration and prediction of protein interactions” 35 (November 2006): 358–62. doi:10.1093/nar/gkl825.

Langfelder, Peter, and Steve Horvath. 2008. “WGCNA: An R Package for Weighted Correlation Network Analysis.” BMC Bioinformatics, no. 1: 559.

———. 2012. “Fast R Functions for Robust Correlations and Hierarchical Clustering.” Journal of Statistical Software 46 (11): 1–17. http://www.jstatsoft.org/v46/i11/.

Li, Dong, James B. Brown, Luisa Orsini, Zhisong Pan, Guyu Hu, and Shan He. 2016. MODA: MODA: MOdule Differential Analysis for Weighted Gene Co-Expression Network.

Tesson, Bruno M, Rainer Breitling, and Ritsert C Jansen. 2010. “DiffCoEx: a simple and sensitive method to find differentially coexpressed gene modules.” BMC Bioinformatics 11: 497. doi:10.1186/1471-2105-11-497.

Vlaic, Sebastian, Theresia Conrad, Christian Tokarski-Schnelle, Mika Gustafsson, Uta Dahmen, Reinhard Guthke, and Stefan Schuster. 2018. “ModuleDiscoverer: Identification of regulatory modules in protein-protein interaction networks.” Scientific Reports 8 (1): 433. doi:10.1038/s41598-017-18370-2.