R: Main function for fast eQTL analysis in MatrixEQTL package

Description

Matrix_eQTL_engine function tests association of every row of the snps dataset with every row of the gene dataset using a linear regression model defined by the useModel parameter (see below).

The testing procedure accounts for extra covariates in cvrt parameter.

The errorCovariance parameter can be set to the error variance-covariance matrix to account for heteroskedastic and/or correlated errors.

Associations significant at pvOutputThreshold (pvOutputThreshold.cis) levels are saved to output_file_name (output_file_name.cis), with corresponding estimates of effect size (slope coefficient), test statistics, p-values, and q-values (false discovery rate).

Matrix eQTL can perform separate analysis for local (cis) and distant (trans) eQTLs.For such analysis one has to set the cis-analysis specific parameters pvOutputThreshold.cis > 0, cisDist, snpspos and genepos in the call of Matrix_eQTL_main function.A gene-SNP pair is considered local if the distance between them is less or equal to cisDist.The genomic location of genes and SNPs is defined by data frames snpspos and genepos.Depending on p-value thresholds pvOutputThreshold and pvOutputThreshold.cis Matrix eQTL runs in one of three different modes:

Set pvOutputThreshold > 0 and pvOutputThreshold.cis = 0 (or use Matrix_eQTL_engine) to perform eQTL analysis without using gene/SNP locations. Associations significant at the pvOutputThreshold level are be recorded in output_file_name and in the returned object.
Set pvOutputThreshold = 0 and pvOutputThreshold.cis > 0 to perform eQTL analysis for local gene-SNP pairs only. Local associations significant at pvOutputThreshold.cis level will be recorded in output_file_name.cis and in the returned object.
Set pvOutputThreshold > 0 and pvOutputThreshold.cis > 0 to perform eQTL analysis with separate p-value thresholds for local and distant eQTLs. Distant and local associations significant at corresponding thresholds are recorded in output_file_name and output_file_name.cis respectively and in the returned object. In this case the false discovery rate is calculated separately for these two sets of eQTLs.

Matrix_eQTL_engine is a wrapper for Matrix_eQTL_main for eQTL analysis without regard to gene/SNP location and provided for compatibility with the previous versions of the package.

The parameter pvalue.hist allows to record information sufficient to create a histogram or QQ-plot of all the p-values (see plot).

Usage

Matrix_eQTL_main( snps, 
                  gene, 
                  cvrt = SlicedData$new(), 
                  output_file_name = "", 
                  pvOutputThreshold = 1e-5,
                  useModel = modelLINEAR, 
                  errorCovariance = numeric(), 
                  verbose = TRUE, 
                  output_file_name.cis = "", 
                  pvOutputThreshold.cis = 0,
                  snpspos = NULL, 
                  genepos = NULL,
                  cisDist = 1e6,
                  pvalue.hist = FALSE,
                  min.pv.by.genesnp = FALSE,
                  noFDRsaveMemory = FALSE)

Matrix_eQTL_engine(snps, 
                   gene, 
                   cvrt = SlicedData$new(), 
                   output_file_name, 
                   pvOutputThreshold = 1e-5, 
                   useModel = modelLINEAR, 
                   errorCovariance = numeric(), 
                   verbose = TRUE,
                   pvalue.hist = FALSE,
                   min.pv.by.genesnp = FALSE,
                   noFDRsaveMemory = FALSE)

Arguments

snps

SlicedData object with genotype information. Can be real-valued for linear models and must take at most 3 distinct values for ANOVA unless the number of ANOVA categories is set to a higher number (see useModel parameter).

gene

SlicedData object with gene expression information. Must have columns matching those of snps.

cvrt

SlicedData object with additional covariates. Can be an empty SlicedData object in case of no covariates.The constant is always included in the model and would cause an error if included in cvrt.The order of columns must match those in snps and gene.

output_file_name

character, connection, or NULL. If not NULL, significant associations are saved to this file (all significant associations if pvOutputThreshold=0 or only distant if pvOutputThreshold>0).If the file with this name exists, it is overwritten.

output_file_name.cis

character, connection, or NULL.If not NULL, significant local associations are saved to this file.If the file with this name exists, it is overwritten.

pvOutputThreshold

numeric. Significance threshold for all/distant tests.

pvOutputThreshold.cis

numeric. Same as pvOutputThreshold, but for local eQTLs.

useModel

integer. Eigher modelLINEAR, modelANOVA, or modelLINEAR_CROSS.

Set useModel = modelLINEAR to model the effect of the genotype as additive linear and test for its significance using t-statistic.
Set useModel = modelANOVA to treat genotype as a categorical variables and use ANOVA model and test for its significance using F-test. The default number of ANOVA categories is 3. Set otherwise like this: options(MatrixEQTL.ANOVA.categories=4).
Set useModel = modelLINEAR_CROSS to add a new term to the modelequal to the product of genotype and the last covariate; the significance of this term is then tested using t-statistic.

errorCovariance

numeric. The error covariance matrix. Use numeric() for homoskedastic independent errors.

verbose

logical. Set to TRUE to display more detailed report about the progress.

snpspos

data.frame object with information about SNP locations, must have 3 columns - SNP name, chromosome, and position, like this:

snpid	chr	pos
Snp_01	1	721289
Snp_02	1	752565
...	...	...

genepos

data.frame with information about transcript locations, must have 4 columns - the name, chromosome, and positions of the left and right ends, like this:

geneid	chr	left	right
Gene_01	1	721289	731289
Gene_02	1	752565	762565
...	...	...	...

cisDist

numeric. SNP-gene pairs within this distance are considered local. The distance is measured from the nearest end of the gene. SNPs within a gene are always considered local.

pvalue.hist

logical, numerical, or "qqplot".Defines whether and how the distribution of p-values is recorded in the returned object.If pvalue.hist = FALSE, the information is not recorded and the analysis is performed faster. Alternatively, set pvalue.hist = "qqplot" to record information sufficient to create a QQ-plot of the p-values (use plot on the returned object to create the plot).To record information for a histogram set pvalue.hist to the desired number of bins of equal size. Finally, pvalue.hist can also be set to a custom set of bin edges.

min.pv.by.genesnp

logical. Set min.pv.by.genesnp = TRUE to record the minimum p-value for each SNP and each gene in the returned object. The minimum p-values are recorded even if if they are above the corresponding thresholds of pvOutputThreshold and pvOutputThreshold.cis. The analysis runs faster when the parameter is set to FALSE.

noFDRsaveMemory

logical. Set noFDRsaveMemory = TRUE to save significant gene-SNP pairs directly to the output files, reduce memory footprint and skip FDR calculation. The eQTLs are not recorded in the returned object if noFDRsaveMemory = TRUE.

Details

Note that the columns of gene, snps, and cvrt must match.If they do not match in the input files, use ColumnSubsample method to subset and/or reorder them.

Value

The detected eQTLs are saved in output_file_name and/or output_file_name.cis if they are not NULL.The method also returns a list with a summary of the performed analysis.

`param`	Keeps all input parameters and also records the number of degrees of freedom for the full model.
`time.in.sec`	Time difference between the start and the end of the analysis (in seconds).
`all`	Information about all detected eQTLs.
`cis`	Information about detected local eQTLs.
`trans`	Information about detected distant eQTLs.

The elements all, cis, and trans may contain the following components

ntests: Total number of tests performed. This is used for FDR calculation.
eqtls: Data frame with recorded significant associations. Not available if noFDRsaveMemory=FALSE
neqtls: Number of significant associations recorded.
hist.bins: Histogram bins used for recording p-value distribution. See pvalue.hist parameter.
hist.counts: Number of p-value that fell in each histogram bin. See pvalue.hist parameter.
min.pv.snps: Vector with the best p-value for each SNP. See min.pv.by.genesnp parameter.
min.pv.gene: Vector with the best p-value for each gene. See min.pv.by.genesnp parameter.

Author(s)

Andrey Shabalin ashabalin@vcu.edu

References

The package website: http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/

Examples

# Matrix eQTL by Andrey A. Shabalin
# http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/
# 
# Be sure to use an up to date version of R and Matrix eQTL.

# source("Matrix_eQTL_R/Matrix_eQTL_engine.r");
library(MatrixEQTL)

## Location of the package with the data files.
base.dir = find.package('MatrixEQTL');

## Settings

# Linear model to use, modelANOVA, modelLINEAR, or modelLINEAR_CROSS
useModel = modelLINEAR; # modelANOVA, modelLINEAR, or modelLINEAR_CROSS

# Genotype file name
SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");

# Gene expression file name
expression_file_name = paste(base.dir, "/data/GE.txt", sep="");

# Covariates file name
# Set to character() for no covariates
covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");

# Output file name
output_file_name = tempfile();

# Only associations significant at this level will be saved
pvOutputThreshold = 1e-2;

# Error covariance matrix
# Set to numeric() for identity.
errorCovariance = numeric();
# errorCovariance = read.table("Sample_Data/errorCovariance.txt");


## Load genotype data

snps = SlicedData$new();
snps$fileDelimiter = "\t";      # the TAB character
snps$fileOmitCharacters = "NA"; # denote missing values;
snps$fileSkipRows = 1;          # one row of column labels
snps$fileSkipColumns = 1;       # one column of row labels
snps$fileSliceSize = 2000;      # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name);

## Load gene expression data

gene = SlicedData$new();
gene$fileDelimiter = "\t";      # the TAB character
gene$fileOmitCharacters = "NA"; # denote missing values;
gene$fileSkipRows = 1;          # one row of column labels
gene$fileSkipColumns = 1;       # one column of row labels
gene$fileSliceSize = 2000;      # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name);

## Load covariates

cvrt = SlicedData$new();
cvrt$fileDelimiter = "\t";      # the TAB character
cvrt$fileOmitCharacters = "NA"; # denote missing values;
cvrt$fileSkipRows = 1;          # one row of column labels
cvrt$fileSkipColumns = 1;       # one column of row labels
if(length(covariates_file_name)>0) {
  cvrt$LoadFile(covariates_file_name);
}

## Run the analysis

me = Matrix_eQTL_engine(
    snps = snps,
    gene = gene,
    cvrt = cvrt,
    output_file_name = output_file_name,
    pvOutputThreshold = pvOutputThreshold,
    useModel = useModel, 
    errorCovariance = errorCovariance, 
    verbose = TRUE,
    pvalue.hist = TRUE,
    min.pv.by.genesnp = FALSE,
    noFDRsaveMemory = FALSE);
		
unlink(output_file_name);

## Results:

cat('Analysis done in: ', me$time.in.sec, ' seconds', '\n');
cat('Detected eQTLs:', '\n');
show(me$all$eqtls)

## Plot the histogram of all p-values

plot(me)

Main function for fast eQTL analysis in MatrixEQTL package