SNPMStat : Statistical Analysis of SNP-Disease Association with Missing Genotype Data
SNPMStat is a program for the statistical analysis of SNP-disease association in case-control studies with potentially missing genotype data. For SNPs without missing data, the program performs the standard association analysis and provides the estimated odds ratios and standard error estimates, together with the Armitage trend tests and p-values. For typed SNPs with missing data or untyped SNPs, the program performs the maximum-likelihood analysis described in Lin, Hu and Huang (American Journal of Human Genetics, 2008) and provides the estimated odds ratios and standard error estimates, together with the Wald statistics and p-values. The current release performs single-SNP analysis under additive, recessive or dominant mode of inheritance without environmental factors. The related software interface HAPSTAT allows very general analysis (including multiple-SNP analysis, all modes of inheritance, Hardy-Weinberg disequilibrium, all study designs and phenotypes, and gene-environment interactions), but requires the user to specify the set of SNPs used to infer the unknown genotypes of the SNP with missing data. We are working intensely to improve the capabilities of SNPMStat, so please check back frequently for updates.
General information
The program is written in standard C. Executable files are available for Windows and Linux in the download section.
Input
The program can handle both typed SNPs with partially missing data and untyped SNPs. A reference panel is used for untyped SNPs, but not for typed SNPs. The data from the case-control study and reference panel are placed in separate files. All input files are placed in the same directory as the executable file.
With reference panel (default)
For untyped SNPs, the program requires two input files named case_control.dat and reference.dat, and accepts an optional input file named phase.dat. Each file contains text data in a tabular (row-column) format, with rows representing SNPs and columns representing subjects (or chromosomes in phase.dat).
case_control.dat
The first line of the input file case_control.dat provides the disease status information for all study subjects (1=disease, 0=no disease). The main body of the file follows the format
If strand orientation information is not available, all
strand_orientation
fields should be indicated as 0.
If this information is available, flag 1 in the field indicates that the strand
orientation in the study data is different from the reference panel (thus
allele coding will be switched by the program) and flag 0 indicates strand
consistency.
In particular, if all the genotypes in the reference panel are in forward
strand, then flag 1 means that the SNP in the case-control study was recorded
on reverse strand.
The strand orientation information is only required for C/G
and A/T SNPs. For all the other types of SNPs, this field can be left 0.
nucleotide1 and nucleotide2 are the nucleotides of the SNP
and should be in the alphabetical order. For example, if the two nucleotides
are G and A, then nucleotide1 is A and nucleotide2 is G. The
genotypes are coded with 0, 1 and 2, referring to the count of
nucleotide1. Missing genotypes should be coded as 9.
After aligning the SNPs according to the strand information provided, we
perform an internal check to see whether the strand alignment can be determined
from (a) the allele labels (at non A/T and G/C SNPs), and (b) allele
frequencies (at A/T and G/C SNPs ).
SNPs that cannot be aligned are removed from the data. The checks can
be turned off by using the
| 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | ··· | |||||||||||||||||
| rs3892957 | 53295262 | 0 | A | G | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 0 | ··· | ||||||||||
| rs570046 | 53295430 | 0 | C | T | 2 | 2 | 2 | 2 | 2 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 0 | 1 | 1 | 2 | ··· | ||||||||||
| rs3894276 | 53295746 | 0 | G | T | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 0 | ··· | ||||||||||
| ··· | ··· | ··· | |||||||||||||||||||||||||||||
reference.dat
SNPMStat allows the subjects in the reference panel to be either trios (default) or unrelated (specified by the -ur option). For trios, the file follows the format
Physical_position should be in the same ascending or descending order as in case_control.dat. nucleotide1, nucleotide2 and genotypes follow the same rule as in case-control.dat. For each trio, the child's genotype is entered last. For the reference panel of unrelated subjects, the format is the same except that there is only one subject per family.
| rs3892957 | 53295262 | A | G | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | ··· | ||
| rs570046 | 53295430 | C | T | 0 | 0 | 9 | 0 | 0 | 9 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ··· | ||
| rs8087214 | 53295793 | G | T | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | ··· | ||
| rs12963672 | 53295932 | A | T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ··· | ||
| ··· | ··· | ··· | ··· | ||||||||||||||||
phase.dat (-p)
Phasing information in the reference panel is incorporated by using the option -p, which will include file phase.dat. SNPs with phase information can be a smaller set of those with genotype information, which is a typical situation with the HapMap database. The file format is
SNP_identifier should be in the same order of reference.dat. Each subject contributes two columns (phase1_i phase2_i, i = 1, ..., n) with 0/1 coding, referring to the count of nucleocide1 as in reference.dat. Typically, only phasing information on founders (mothers and fathers) is provided.
| rs3892957 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | ··· |
| rs570046 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | ··· |
| rs8087214 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ··· |
| rs12963672 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ··· |
| ··· | ··· | ··· | ||||||||||||||||||||||
Without reference panel (-nr)
If the option -nr (no reference) is used, the program only requires case_control.dat. In this situation, the strand orientation entry has no meaning and can be simply filled up with 0s.
Mode of inheritance
The additive model of inheritance is assumed by default. One can specify the recessive or dominant mode of inheritance by using the option -rec or -dom, respectively .
Output (-out)
Computational results are written to the default output file results.out, unless the option -out is used to specify a different file name. For each untyped SNP, the output shows the SNP identifier, physical position, nucleotide1, nucleotide2, M_D measure between the SNP and a set of typed SNPs with the best prediction, frequency of nucleotide1 in the reference panel, estimate of log odds-ratio, standard error estimate, standard-normal test statistic, and the corresponding p-value. nucleotide2 is taken as the reference in the analysis. Any untyped SNPs with allele frequencies 0.0 or 1.0 in the reference panel are excluded from analysis. For each genotyped SNP, the output shows the proportion of non-missing genotypes and the frequency of nucleotide1 in the case-control data instead of M_D measure and the frequency of nucleotide1 in the reference panel. Any SNPs with allele frequencies 0.0 or 1.0 in the case-control data are excluded. For untyped SNPs and typed SNPs with missing data, the test statistic is the Wald statistic. For typed SNPs without missing data, the Armitage trend test is used. The results for very low minor allele frequencies may not be stable and should be viewed with great caution, especially for untyped SNPs or typed SNPs with substantial missingness.
Imputation (-impute)
SNPMStat also allows the imputation (using the option -impute) of genotypes for untyped SNPs and missing values for typed SNPs. The disease status line in case_control.dat is ignored in this task. Untyped SNPs with allele frequencies of 0.0 or 1.0 in the reference panel are excluded, and so are the typed SNPs with allele frequencies of 0.0 or 1.0 in the case-control data. The posterior probabilities of genotypes, in the order of AA, AG and GG for an A/G SNP, are output to imp_geno.out by default, unless the intended file name is specified by -out_imp. The first column shows the proportion of non-missing genotypes for typed SNPs and 'no' for untyped SNPs. In the presence of the option -impute, the association analysis can be suppressed by using -notest. This is mandatory when there is no information on disease status, in which case the study data file is still named "case_control.dat" and its first line contains an arbitrary integer (e.g., 0) for each subject.
List of options
| Option | Default | Description | Example |
|---|---|---|---|
| -nr | No reference panel used | | |
| -ur | Unrelated subjects in reference panel | | |
| -p | File phase.dat provided | | |
| -rec | Recessive model assumed | | |
| -dom | Dominant model assumed | | |
| -out | results.out | Specify the output file for association test results | |
| | Turn off internal check | | |
| | Turn off default removal of SNPs that cannot be aligned | | |
| | 1) Skip analysis of untyped SNPs not in phase.dat; 2) Exclude SNPs not in phase.dat as predictors. This will significantly speed up the program, but may lose important untyped SNPs. This flag is meaningful only when -p is specified. | | |
| | Carry out imputation. | | |
| | imp_geno.out | Specify the output file for imputed genotypes. This flag works with -impute. | |
| | Suppress the association tests when only the genotype imputation is desired. This flag works with -impute. | |
Example
A simulated test data set mimicking the region of SNPs 1090-1190 in the RA study is provided in the case_control.dat example file. The reference.dat file contains the part of the HapMap data covering that chromosomal region and the phase.dat file provides the phasing information of reference.dat. The SNPMStat output is given in ra.out and ra_imp.out, which was obtained via the following command
It takes less than 3 minutes to finish the analysis on a computer with a 3.6GHz CPU and 4GB of memory. A plot of log10(p-values) is provided in ra.bmp.
The example files can also be run using other commands (results not shown). We list below some of the possibilities to illustrate the typical time range.
| Command | Data Files Used | Time Cost |
|---|---|---|
| | case_control.dat, reference.dat | 2.7 hrs |
| | case_control.dat | 1.8 hrs |
| | case_control.dat, reference.dat, phase.dat | 35 mins |
| | case_control.dat, reference.dat, phase.dat | 3 mins |
Reference
Lin DY, Hu Y, Huang BE. 2008. Simple and efficient analysis of SNP-disease association with missing genotype data. The American Journal of Human Genetics, 82: 444-452.
Download
SNPMStat for Windows [updated 13 January 2009]
SNPMStat for Linux [updated 13 January 2009]
static binary executable (zip archive) » SNPMStat_static.zip
Example files [updated 13 January 2009]
Version History
| Version | Date | Description |
|---|---|---|
| 1.0 | Oct. 2007 | First version released |
| 1.1 | May 8, 2008 | Bug Fix:
|
| 2.0 | Jul. 9, 2008 | New Features:
|
| 2.1 | Sep. 29, 2008 | Bug Fix:
|
| 3.0 | Oct. 14, 2008 |
|
| 3.1 | Jan. 13, 2009 |
|
SNPMStat_HM
If the HapMap Phase I or II database is used, the input files pertaining to reference panel can be prepared using our supplementary program SNPMStat_HM.
SNPMStat_CC
To facilitate the preparation of the case_control.dat file, we provide the supplementary program SNPMStat_CC to convert the transposed PED file (.tped and .tfam in PLINK file format) into case_control.dat.