Getting started with Matrix eQTL

Sample code and toy data set

The best way to start using Matrix eQTL is to first run the sample code on provided toy dataset. The sample code and data are part of the package and are also available at this page. The package manual contains the sample code at the bottom of the help page for Matrix eQTL main function available via ?Matrix_eQTL_main command.

The sample code performs eQTL analysis of a toy data set consisting of three files: genotype, expression, and covariates. For every gene-SNP pair it runs linear regression analysis accounting for the set of covariates.

Let's go over the sample code line by line.

First step is to load the package:

• library("MatrixEQTL");

The toy data set files are stored with the package at the following location.

• base.dir = find.package("MatrixEQTL");

Then we set the parameters such as selected linear model and names of genotype and expression data files.

• useModel = modelLINEAR; # modelANOVA or modelLINEAR or modelLINEAR_CROSS
• SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");
• expression_file_name = paste(base.dir, "/data/GE.txt", sep="");

A separate file may be provided with extra covariates. In case of no covariates set the variable covariates_file_name to character().

• covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");
• output_file_name = tempfile();

The p-value threshold determines which gene-SNP associations are saved in the output file output_file_name. Note that for larger datasets the threshold should be lower. Setting the threshold to a high value for a large dataset may cause excessively large output files.

• pvOutputThreshold = 1e-2;

Finally, define the covariance matrix for the error term. This parameter is rarely used. If the covariance matrix is a multiple of identity, set it to numeric().

• errorCovariance = numeric();

The next section of the sample code contains three very similar parts loading the files with genotype, gene expression, and covariates. In each part one can set the file delimiter (i.e. tabulation "\t", comma ",", or space " "), the string representation for missing values, the number of rows with column labels, and the number of columns with row labels. Finally, one can change the number of the variables in a slice for the file reading procedure (do not change if not sure).

• snps = SlicedData$new();
• snps$fileDelimiter = "\t";      # the TAB character
• snps$fileOmitCharacters = "NA"; # denote missing values;
• snps$fileSkipRows = 1;          # one row of column labels
• snps$fileSkipColumns = 1;       # one column of row labels
• snps$fileSliceSize = 2000;      # read file in pieces of 2,000 rows
• snps$LoadFile( SNP_file_name );

Finally, the main Matrix eQTL function is called:

• me = Matrix_eQTL_engine( snps = snps, gene = gene, cvrt = cvrt, output_file_name = output_file_name, pvOutputThreshold = pvOutputThreshold, useModel = useModel, errorCovariance = errorCovariance, verbose = TRUE, pvalue.hist = TRUE, min.pv.by.genesnp = FALSE, noFDRsaveMemory = FALSE);

Each significant gene-SNP association is recorded in a separate line in the output file and in the returned object me. In case of cis/trans eQTL analysis described below, two output files are produced, one with cis-eQTLs, another only with trans. Every record contains a SNP name, a transcript name, estimate of the effect size, t- or F-statistic, p-value, and FDR.

Cis- and trans- eQTL analysis

Matrix eQTL can distinguish local (cis-) and distant (trans-) eQTLs and perform separate correction for multiple comparisons for those groups. The second sample code shows how to run such analysis. The main Matrix eQTL function Matrix_eQTL_main requires several extra parameters for cis/trans analysis:

pvOutputThreshold.cis – p-value threshold for cis-eQTLs.
output_file_name.cis – detected cis-eQTLs are saved in this file.
cisDist – maximum distance at which gene-SNP pair is considered local.
snpspos – data frame with information about SNP locations, must have 3 columns - SNP name, chromosome, and position. See sample SNP location file.
genepos – data frame with information about gene locations, must have 4 columns - the name, chromosome, and positions of the left and right ends. See sample gene location file.

For more information see Matrix eQTL reference manual via command ?Matrix_eQTL_main in R or click Matrix_eQTL_main.

Analyzing your own LARGE data (BLAS)

Matrix eQTL is designed to handle large genotype and expression data sets. They are loaded using SlicedData classes which store the data in slices of 1000 rows (default size). The analysis is then performed for each pair of slices of genotype and expression data sets.

Matrix eQTL gains efficiency by expressing the most computationally intensive part of the calculations in terms of large matrix operations, most importantly – matrix multiplications. Thus, using a fast BLAS in R, Revolution R, or Matlab, Matrix eQTL can outperform competitors by several orders of magnitude as presented in the summary table on the main page or the manuscript in Bioinformatics.

Unfortunately, the standard installation of R does not include a fast BLAS. If you are not using Revolution R (on Windows) you may be using an inefficient BLAS. You can improve the performance of large matrix multiplication in R up to 20 times by using a non-standard BLAS. The necessary steps depend on the platform you are using:

Revolution R for Windows: you experience the best performance already.
64 bit R for Windows: you can achieve best performance by using versions of Dr. Kazushige Goto's BLAS by Ei-Ji Nakama.
32 bit R for Windows: you can dowload ATLAS BLAS for 32-bit R.
Linux version of R: consult the R Installation and Administration manual and a relevant blog post by Allan Engelhardt.
Mac OS: probably the same as for Linux. Let me know if there is a better solution.

Getting started with Matrix eQTL

Contents

Installing Matrix eQTL

Sample code and toy data set

Cis- and trans- eQTL analysis

Analyzing your own data

Analyzing your own LARGE data (BLAS)

More Matrix eQTL