SCORE-Seq: Score-Type Tests for Detecting Disease Associations With Rare Variants in Sequencing Studies

SCORE-Seq is a command-line program which implements the methods of Lin and Tang (2011) for detecting disease associations with rare variants in sequencing studies. The mutation information is aggregated across multiple variant sites of a gene through a weighted linear combination and then related to disease phenotypes through appropriate regression models. The weights can be constant or dependent on allele frequencies and phenotypes. The association testing is based on score-type statistics. The allele-frequency threshold can be fixed or variable. Statistical significance can be assessed by using asymptotic normal approximation or resampling. A detailed description of the methods is given in Lin and Tang (2011). The current release covers binary and continuous traits with arbitrary covariates under case-control and cross-sectional sampling. The newest version was released on May 2, 2012 with some new features. We are working intensely to improve the capabilities of SCORE-Seq, so please check back frequently for updates.

General information

SCORE-Seq is a command-line program written in the C language which implements the methods of Lin and Tang (2011) for detecting disease associations with rare variants in sequencing studies. In the software, various tests are conducted for each gene. There are options for the minor allele frequency (MAF) upper bound, the call rate (CR) lower bound and the minor allele count (MAC) lower bound. A gene is removed from the analysis if the MAFs of all its variants are greater than the MAF upper bound, the CRs of all its variants are less than the CR lower bound or its MAC is less than the MAC lower bound. By default, the MAF upper bound is 0.05, the CR lower bound is 0 and the MAC lower bound is 5. The MAFs may be determined internally (i.e., calculated from the genotype file) or externally (i.e., input in the mapping file). Under the additive genetic model (default), the test statistics are based on one or several sets of genetic scores that are calculated by a weighted sum of mutation counts for each subject. A set of genetic scores corresponds to a specifically defined weight function. A description of the genetic score and weight function for each test is given in the OPTIONS section below. A fixed-threshold test only involves one set of genetic scores in the test statistic, while a variable threshold test involves multiple sets of genetic scores. We perform three fixed-threshold tests (T1, T5 and Fp) plus one variable-threshold test (VT test). The software also outputs the p-value of the EREC test (for detecting variants with opposite effects) if resampling is turned on. In addition, the T1, T5 and VT tests under the dominant genetic model can be obtained by using the option -CMC. In that case, all the tests based on the additive genetic model are suppressed. Besides the rare variant analysis described above, the user can conduct single variant analysis for common SNPs by using the option -com. To suppress the rare variant analysis, use the option -noRare. Asymptotic p-values are provided by default while resampling p-values can be generated by using the option -resample.

SYNOPSIS

SCORE-Seq [-CMC] [-pfile phenofile] [-gfile genofile] [-mfile mapfile] [-ofile outfile] [-vtlog vtlog] [-msglog msglog] [-MAF MAF_UB] [-MAC MAC_LB] [-CR CR_LB] [-resample R] [-EREC delta] [-noRare] [-com MAF_C] [-ofileC outfile_com]

OPTIONS

Option Parameter Default Description
-CMC No Use the dominant instead of the additive model.
-pfile {phenofile} pheno.txt Specify the phenotype-covariate file.
-gfile {genofile} geno.txt Specify the genotype file.
-mfile {mapfile} mapping.txt Specify the gene-SNP mapping file for the rare variant analysis.
-ofile {outfile} rare.out Specify the output file for the rare variant analysis.
-vtlog {vtlog} vt.log Specify the log file for VT tests.
-msglog {msglog} No Specify the log file for warning messages
-MAF {MAF_UB} 0.05 Specify the MAF upper bound, which is any number between 0 and 1.
-MAC {MAC_LB} 5 Specify the MAC lower bound, which is any integer.
-CR {CR_LB} 0 Specify the call rate lower bound, which is any number between 0 and 1.
-resample {R} No Turn on resampling and specify the maximum number of resamples. If R is set to -1, then the default of 1 million resamples is applied; otherwise, R should be an integer between 1 million and 100 millions. In the latter case, the software will perform resampling up to R times for any resampling test that has a p-value < 1e-4 after 1 million resamples.
-EREC {delta} 1 for binary trait;
2 for standardized    
continuous trait
Specify the constant delta for the EREC test. This option is effective only when resampling is turned on.
-noRare No Turn off the rare variant analysis.
-com {MAF_C} No Turn on the single variant analysis for common SNPs (MAF > MAF_C).
-ofileC {outfile_com} com.out Sepcify the output file for the single variant analysis.

Suppose that there are m SNPs with call rates greater than CR_LB within a gene. Under the additive model, the genetic score for subject i is calculated by the following formula:

Eq.1

where xji is the number of the minor allele for subject i at SNP j, ξj is the weight for SNP j. If xji is missing, its value is imputed by the mean number of the minor allele at SNP j.

T1 test: Set the weight to 1 if MAF < 1% and to 0 otherwise.

T5 test: Set the weight to 1 if MAF < 5% and to 0 otherwise.

Fp test: Set the weight to

1/sqrt(p*(1-p))

if MAF < MAF_UB and to 0 otherwise, where p is the estimated MAF with pseudo counts in the pooled sample if the MAFs are calculated internally; otherwise, p is the MAF from the external source.

VT test: Set the weight to 1 if MAF ≤ threshold and to 0 otherwise. The thresholds consist of the unique MAFs that are less than MAF_UB and aggregate at least MAC_LB mutations. If the number of qualified thresholds exceeds 20, we reduce the number to ≤ 20 for the asymptotic test by rounding the MAFs. If the number of thresholds is still more than 20 after rounding, we skip the lowest thresholds until the number of thresholds is 20.

EREC test: Set the weight to the estimated regression coefficient plus or minus delta if MAF < MAF_UB and to 0 otherwise. The test statistic is maximized over the two sets of weights.

Under the dominant model, the T1, T5 and VT tests remain valid and can be performed by specifying the option -CMC. We set si to 1 if there is at least one mutation among the aggregating variants and to 0 otherwise.

INPUT FILES

In all input files, the 1st column contains the ID (subject ID in the phenotype file; SNP ID in the genotype file; gene ID in the gene-SNP mapping file). No missing values are allowed in the ID. Field separator character should be tab-delimiter.

PHENOTYPE FILE

The rows represent subjects. The 1st column contains the subject ID; the 2nd column contains the quantitative or binary phenotype; the remaining columns are covariates if there are any. The format for the binary phenotype is 0/1. The order of subjects in the rows should be identical to the order of subjects in the columns of the genotype file. No missing values are allowed. All variables should be numeric.

GENOTYPE FILE

The rows represent SNPs and the columns represent subjects. The 1st column contains the SNP ID. The genotype format is 0/1/2. Missing values are denoted as “NA”. For each SNP, there is at lease one non-missing value across subjects.

MAPPING FILE

This file contains two or three columns. The first column is the gene ID, and the second column is the SNP ID. Each row represents a unique gene-SNP pair. No missing values are allowed. The third column contains the external MAFs if they are to be used in the analysis. If the user only requests the single variant analysis, the mapping file is not required.

OUTPUT

OUTPUT FILE OF THE RARE VARIANT ANALYSIS

If the output value is invalid for some reason, output ‘NA’. The format of the output file is as follows:

Output of the rare variant analysis

where R indicates resampling. Columns 2-10 show the p-values. The remaining columns contain the score statistics (U), their variance estimates (V), and test statistics (stat) for all the fix-threshold tests. Under the dominant model, only the results for T1, T5, and VT will be provided.

OUTPUT FILE OF THE SINGLE VARIANT ANALYSIS

Output of the single variant analysis

VT LOG

For each gene, VT LOG contains detailed information of VT tests. The following is an example for one gene:

************* gene SDHB *************
MAF threshold   0.001134   0.005669   0.006803   0.00907   0.010204   ...
#SNP included   7          8          9          10        12         ...
#minor allele   7          12         18         26        44         ...
score stat      1.41       1.10       0.93       1.40      1.53       ...

WARNING MESSAGES

In some situations, p-values cannot be calculated.

  1. If a gene only contains common SNPs (i.e., MAF > MAF_UB), all its SNPs have call rate less than CR_LB or its MAC is less than MAC_LB, then no p-value can be calculated. In this case, the gene will not show up in any output file (rare.out or vt.log).
    "GENE x Warning: All SNP MAFs > <MAF_UB> or call rate < <CR_LB> or minor allele count < <MAC_LB>, so pvalues are not calculated."
  2. For any of the observed test statistics, if the variance estimate is 0, no p-value is calculated for the test. In the following warning message, “<>” can be “T1”, “T5”, “Fp”, or “VT”.
    "GENE x Warning: Variance estimate=0 for observed <> test statistic, so the pvalue cannot be calculated."

In other situations, p-values can be calculated but may not be accurate.

  1. For asymptotic VT tests, the calculation of the p-value depends on an accurate multivariate normal (mvn) probability evaluation. If the mvn probability is not accurate, the p-value is problematic.
    "GENE x Warning: The asymptotic VT pvalue may not be accurate."
  2. If the default maximum number of resamples (i.e., 1 million) is applied and the resampling p-value < 1e-4, a warning message will alert you that the p-value may not be accurate. In the following warning message, “<>” can be “T1”, “T5”, “Fp”, “VT”, or “EREC”.
    "GENE x Warning: The resampling pvalue of <> test < 1e-4 after 1 million resamples, so the pvalue may not be accurate."

EXAMPLE

The example includes a genotype file "genoP.txt", a phenotype file "phenoP.txt", a mapping file "mappingP.txt". Enter the command

$ SCORE-Seq -pfile phenoP.txt -gfile genoP.txt -mfile mappingP.txt -ofile rareP.out -vtlog vtP.log -msglog msg.log -MAF 0.05 -MAC 5 -resample -1

to obtain the results given in "rareP.out" and "vtP.log", with the warning messages in "msg.log".

MULTIPLE CPUs

If there are a large number of genes, it is preferable to use multiple CPUs. To this end, the user can divide the mapping file into several files, each of which contains a subset of genes. Then the software can be run separately for each mapping file on a different CPU. In this way, it takes approximately 1/K (K is the number of jobs that are split into) of the original time to complete the whole set of analysis. We provide a R script to illustrate this approach. In that example, 100 mapping files are created from the original mapping file and 100 jobs are submitted simultaneously to multiple CPUs.

REFERENCE

Lin, D. Y. and Tang, Z. Z. (2011). A General Framework for Detecting Disease Associations With Rare Variants in Sequencing Studies. American Journal of Human Genetics, 89, 354-367.

DOWNLOAD

SCORE-Seq for 64-bit x86 based Linux [updated May 2, 2012]

executable (zip archive) » SCORE-Seq-3.0-linux-64.zip

Example files [updated May 2, 2012]

zip archive » SCORE-Seq-3.0-example.zip

VERSION HISTORY

Version Date Description
1 August 4, 2011 First version released
1.0.4 September 1, 2011 Revised version; added many new features
1.0.5 September 9, 2011 Minor updates
2 November 3, 2011 Changed the constant for the EREC from delta to +/- delta.
Allowed the use of external MAFs in the analysis.
Allowed single variant analysis for common SNPs.
2.0.1 December 6, 2011 Revised the format of the p-value output file.
Fixed a bug in handling output files.
3.0 May 2, 2012 Added the option “-CR” for the call rate lower bound.
Added T1, T5 and VT under the dominant model (-CMC).
Removed Tmax.
Output score statistics, their variance estimates and test statistics for fixed-threshold tests.
Output the original test statistics instead of their absolute values in the VT LOG file.
Invalid values are shown as “NA” instead of “99”.
may 11 2012
SCORE-SeqTDS
new software
may 2 2012
SCORE-Seq
new version
march 7 2012
MASS
new software
may 23 2011
SPREG 2.0
new version
july 13 2010
tagIMPUTE
new software
july 13 2010
SNPMStat 4.0
new version
february 23 2010
GWASelect
new software
february 17 2010
MAOS 1.4
new software version
october 14 2008
hapstat 3.0
command-line executable for Linux