Matrix_eQTL_main {MatrixEQTL}R Documentation

Main function for fast eQTL analysis in MatrixEQTL package


Matrix_eQTL_engine function tests association of every row of the snps dataset with every row of the gene dataset using a linear regression model defined by the useModel parameter (see below).

The testing procedure accounts for extra covariates in cvrt parameter.

The errorCovariance parameter can be set to the error variance-covariance matrix to account for heteroskedastic and/or correlated errors.

Associations significant at pvOutputThreshold (pvOutputThreshold.cis) levels are saved to output_file_name (output_file_name.cis), with corresponding estimates of effect size (slope coefficient), test statistics, p-values, and q-values (false discovery rate).

Matrix eQTL can perform separate analysis for local (cis) and distant (trans) eQTLs.For such analysis one has to set the cis-analysis specific parameters pvOutputThreshold.cis > 0, cisDist, snpspos and genepos in the call of Matrix_eQTL_main function.A gene-SNP pair is considered local if the distance between them is less or equal to cisDist.The genomic location of genes and SNPs is defined by data frames snpspos and genepos.Depending on p-value thresholds pvOutputThreshold and pvOutputThreshold.cis Matrix eQTL runs in one of three different modes:

Matrix_eQTL_engine is a wrapper for Matrix_eQTL_main for eQTL analysis without regard to gene/SNP location and provided for compatibility with the previous versions of the package.

The parameter pvalue.hist allows to record information sufficient to create a histogram or QQ-plot of all the p-values (see plot).


Matrix_eQTL_main( snps, 
                  cvrt = SlicedData$new(), 
                  output_file_name = "", 
                  pvOutputThreshold = 1e-5,
                  useModel = modelLINEAR, 
                  errorCovariance = numeric(), 
                  verbose = TRUE, 
                  output_file_name.cis = "", 
                  pvOutputThreshold.cis = 0,
                  snpspos = NULL, 
                  genepos = NULL,
                  cisDist = 1e6,
                  pvalue.hist = FALSE,
         = FALSE,
                  noFDRsaveMemory = FALSE)

                   cvrt = SlicedData$new(), 
                   pvOutputThreshold = 1e-5, 
                   useModel = modelLINEAR, 
                   errorCovariance = numeric(), 
                   verbose = TRUE,
                   pvalue.hist = FALSE,
          = FALSE,
                   noFDRsaveMemory = FALSE)



SlicedData object with genotype information. Can be real-valued for linear models and must take at most 3 distinct values for ANOVA unless the number of ANOVA categories is set to a higher number (see useModel parameter).


SlicedData object with gene expression information. Must have columns matching those of snps.


SlicedData object with additional covariates. Can be an empty SlicedData object in case of no covariates.The constant is always included in the model and would cause an error if included in cvrt.The order of columns must match those in snps and gene.


character, connection, or NULL. If not NULL, significant associations are saved to this file (all significant associations if pvOutputThreshold=0 or only distant if pvOutputThreshold>0).If the file with this name exists, it is overwritten.


character, connection, or NULL.If not NULL, significant local associations are saved to this file.If the file with this name exists, it is overwritten.


numeric. Significance threshold for all/distant tests.


numeric. Same as pvOutputThreshold, but for local eQTLs.


integer. Eigher modelLINEAR, modelANOVA, or modelLINEAR_CROSS.

  1. Set useModel = modelLINEAR to model the effect of the genotype as additive linear and test for its significance using t-statistic.

  2. Set useModel = modelANOVA to treat genotype as a categorical variables and use ANOVA model and test for its significance using F-test. The default number of ANOVA categories is 3. Set otherwise like this: options(MatrixEQTL.ANOVA.categories=4).

  3. Set useModel = modelLINEAR_CROSS to add a new term to the modelequal to the product of genotype and the last covariate; the significance of this term is then tested using t-statistic.


numeric. The error covariance matrix. Use numeric() for homoskedastic independent errors.


logical. Set to TRUE to display more detailed report about the progress.


data.frame object with information about SNP locations, must have 3 columns - SNP name, chromosome, and position, like this:

snpid chr pos
Snp_01 1 721289
Snp_02 1 752565
... ... ...

data.frame with information about transcript locations, must have 4 columns - the name, chromosome, and positions of the left and right ends, like this:

geneid chr left right
Gene_01 1 721289 731289
Gene_02 1 752565 762565
... ... ... ...

numeric. SNP-gene pairs within this distance are considered local. The distance is measured from the nearest end of the gene. SNPs within a gene are always considered local.


logical, numerical, or "qqplot".Defines whether and how the distribution of p-values is recorded in the returned object.If pvalue.hist = FALSE, the information is not recorded and the analysis is performed faster. Alternatively, set pvalue.hist = "qqplot" to record information sufficient to create a QQ-plot of the p-values (use plot on the returned object to create the plot).To record information for a histogram set pvalue.hist to the desired number of bins of equal size. Finally, pvalue.hist can also be set to a custom set of bin edges.

logical. Set = TRUE to record the minimum p-value for each SNP and each gene in the returned object. The minimum p-values are recorded even if if they are above the corresponding thresholds of pvOutputThreshold and pvOutputThreshold.cis. The analysis runs faster when the parameter is set to FALSE.


logical. Set noFDRsaveMemory = TRUE to save significant gene-SNP pairs directly to the output files, reduce memory footprint and skip FDR calculation. The eQTLs are not recorded in the returned object if noFDRsaveMemory = TRUE.


Note that the columns of gene, snps, and cvrt must match.If they do not match in the input files, use ColumnSubsample method to subset and/or reorder them.


The detected eQTLs are saved in output_file_name and/or output_file_name.cis if they are not NULL.The method also returns a list with a summary of the performed analysis.


Keeps all input parameters and also records the number of degrees of freedom for the full model.

Time difference between the start and the end of the analysis (in seconds).


Information about all detected eQTLs.


Information about detected local eQTLs.


Information about detected distant eQTLs.

The elements all, cis, and trans may contain the following components


Total number of tests performed. This is used for FDR calculation.


Data frame with recorded significant associations. Not available if noFDRsaveMemory=FALSE


Number of significant associations recorded.


Histogram bins used for recording p-value distribution. See pvalue.hist parameter.


Number of p-value that fell in each histogram bin. See pvalue.hist parameter.


Vector with the best p-value for each SNP. See parameter.


Vector with the best p-value for each gene. See parameter.


Andrey Shabalin


The package website:

See Also

The code below is the sample code for eQTL analysis NOT using gene/SNP locations.

See MatrixEQTL_cis_code for sample code for eQTL analysis that separates local and distant tests.


# Matrix eQTL by Andrey A. Shabalin
# Be sure to use an up to date version of R and Matrix eQTL.

# source("Matrix_eQTL_R/Matrix_eQTL_engine.r");

## Location of the package with the data files.
base.dir = find.package('MatrixEQTL');

## Settings

# Linear model to use, modelANOVA, modelLINEAR, or modelLINEAR_CROSS
useModel = modelLINEAR; # modelANOVA, modelLINEAR, or modelLINEAR_CROSS

# Genotype file name
SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");

# Gene expression file name
expression_file_name = paste(base.dir, "/data/GE.txt", sep="");

# Covariates file name
# Set to character() for no covariates
covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");

# Output file name
output_file_name = tempfile();

# Only associations significant at this level will be saved
pvOutputThreshold = 1e-2;

# Error covariance matrix
# Set to numeric() for identity.
errorCovariance = numeric();
# errorCovariance = read.table("Sample_Data/errorCovariance.txt");

## Load genotype data

snps = SlicedData$new();
snps$fileDelimiter = "\t";      # the TAB character
snps$fileOmitCharacters = "NA"; # denote missing values;
snps$fileSkipRows = 1;          # one row of column labels
snps$fileSkipColumns = 1;       # one column of row labels
snps$fileSliceSize = 2000;      # read file in slices of 2,000 rows

## Load gene expression data

gene = SlicedData$new();
gene$fileDelimiter = "\t";      # the TAB character
gene$fileOmitCharacters = "NA"; # denote missing values;
gene$fileSkipRows = 1;          # one row of column labels
gene$fileSkipColumns = 1;       # one column of row labels
gene$fileSliceSize = 2000;      # read file in slices of 2,000 rows

## Load covariates

cvrt = SlicedData$new();
cvrt$fileDelimiter = "\t";      # the TAB character
cvrt$fileOmitCharacters = "NA"; # denote missing values;
cvrt$fileSkipRows = 1;          # one row of column labels
cvrt$fileSkipColumns = 1;       # one column of row labels
if(length(covariates_file_name)>0) {

## Run the analysis

me = Matrix_eQTL_engine(
    snps = snps,
    gene = gene,
    cvrt = cvrt,
    output_file_name = output_file_name,
    pvOutputThreshold = pvOutputThreshold,
    useModel = useModel, 
    errorCovariance = errorCovariance, 
    verbose = TRUE,
    pvalue.hist = TRUE, = FALSE,
    noFDRsaveMemory = FALSE);

## Results:

cat('Analysis done in: ', me$, ' seconds', '\n');
cat('Detected eQTLs:', '\n');

## Plot the histogram of all p-values


[Package MatrixEQTL version 2.1.1 Index]