SNPMStat : Statistical Analysis of SNP-Disease Association with Missing Genotype Data

SNPMStat is a program for the statistical analysis of SNP-disease association in case-control studies with potentially missing genotype data. For SNPs without missing data, the program performs the standard association analysis and provides the estimated odds ratios and standard error estimates, together with the Armitage trend tests and p-values. For typed SNPs with missing data or untyped SNPs, the program performs the maximum-likelihood analysis described in Lin, Hu and Huang (American Journal of Human Genetics, 2008) and provides the estimated odds ratios and standard error estimates, together with the Wald statistics and p-values. The current release performs single-SNP analysis under additive, recessive or dominant mode of inheritance without environmental factors. The related software interface HAPSTAT allows very general analysis (including multiple-SNP analysis, all modes of inheritance, Hardy-Weinberg disequilibrium, all study designs and phenotypes, and gene-environment interactions), but requires the user to specify the set of SNPs used to infer the unknown genotypes of the SNP with missing data. We are working intensely to improve the capabilities of SNPMStat, so please check back frequently for updates.

General information

The program is written in standard C. Executable files are available for Windows and Linux in the download section.

Input

The program can handle both typed SNPs with partially missing data and untyped SNPs. A reference panel is used for untyped SNPs, but not for typed SNPs. The data from the case-control study and reference panel are placed in separate files. All input files are placed in the same directory as the executable file.

With reference panel (default)

For untyped SNPs, the program requires two input files named case_control.dat and reference.dat, and accepts an optional input file named phase.dat. Each file contains text data in a tabular (row-column) format, with rows representing SNPs and columns representing subjects (or chromosomes in phase.dat).

case_control.dat

The first line of the input file case_control.dat provides the disease status information for all study subjects (1=disease, 0=no disease). The main body of the file follows the format

SNP_identifier physical_position strand_orientation nucleotide1 nucleotide2 genotype_1 ... genotype_n

If strand orientation information is not available, all strand_orientation fields should be indicated as 0. If this information is available, flag 1 in the field indicates that the strand orientation in the study data is different from the reference panel (thus allele coding will be switched by the program) and flag 0 indicates strand consistency. In particular, if all the genotypes in the reference panel are in forward strand, then flag 1 means that the SNP in the case-control study was recorded on reverse strand. The strand orientation information is only required for C/G and A/T SNPs. For all the other types of SNPs, this field can be left 0. nucleotide1 and nucleotide2 are the nucleotides of the SNP and should be in the alphabetical order. For example, if the two nucleotides are G and A, then nucleotide1 is A and nucleotide2 is G. The genotypes are coded with 0, 1 and 2, referring to the count of nucleotide1. Missing genotypes should be coded as 9. After aligning the SNPs according to the strand information provided, we perform an internal check to see whether the strand alignment can be determined from (a) the allele labels (at non A/T and G/C SNPs), and (b) allele frequencies (at A/T and G/C SNPs ). SNPs that cannot be aligned are removed from the data. The checks can be turned off by using the -no_fix and -no_remove flags, which will be illustrated later.

Example format for input file case_control.dat.
0 1 1 0 0 0 0 1 0 1 0 0 1 0 ···  
rs3892957   53295262   0   A G   0 0 0 0 0 1 1 0 0 0 0 0 2 1 1 0 ···
rs570046   53295430   0   C T   2 2 2 2 2 1 1 2 2 2 2 2 0 1 1 2 ···
rs3894276   53295746   0   G T   0 0 0 0 0 1 1 0 0 0 0 0 2 1 1 0 ···
··· ··· ···

reference.dat

SNPMStat allows the subjects in the reference panel to be either trios (default) or unrelated (specified by the -ur option). For trios, the file follows the format

SNP_identifier physical_position nucleotide1 nucleotide2 trio_1 trio_2 ···

Physical_position should be in the same ascending or descending order as in case_control.dat. nucleotide1, nucleotide2 and genotypes follow the same rule as in case-control.dat. For each trio, the child's genotype is entered last. For the reference panel of unrelated subjects, the format is the same except that there is only one subject per family.

Example format for input file reference.dat.
rs3892957 53295262   A G   0 0 0 0 0 0 0 1 1 1 1 1 1 ···
rs570046 53295430   C T   0 0 9 0 0 9 1 1 1 1 1 1 1 ···
rs8087214 53295793   G T   2 2 2 2 2 2 2 2 2 2 2 2 2 ···
rs12963672 53295932   A T   0 0 0 0 0 0 0 0 0 0 0 0 0 ···
··· ··· ··· ···

phase.dat (-p)

Phasing information in the reference panel is incorporated by using the option -p, which will include file phase.dat. SNPs with phase information can be a smaller set of those with genotype information, which is a typical situation with the HapMap database. The file format is

SNP_identifier phase1_1 phase2_1 phase1_2 phase2_2 ... phase1_n phase2_n

SNP_identifier should be in the same order of reference.dat. Each subject contributes two columns (phase1_i phase2_i, i = 1, ..., n) with 0/1 coding, referring to the count of nucleocide1 as in reference.dat. Typically, only phasing information on founders (mothers and fathers) is provided.

Example format for input file phase.dat.
rs3892957 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 1 0 0 0 ···
rs570046 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 1 1 1 ···
rs8087214 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ···
rs12963672 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 ···
··· ··· ···

Without reference panel (-nr)

If the option -nr (no reference) is used, the program only requires case_control.dat. In this situation, the strand orientation entry has no meaning and can be simply filled up with 0s.

Mode of inheritance

The additive model of inheritance is assumed by default. One can specify the recessive or dominant mode of inheritance by using the option -rec or -dom, respectively .

Output (-out)

Computational results are written to the default output file results.out, unless the option -out is used to specify a different file name. For each untyped SNP, the output shows the SNP identifier, physical position, nucleotide1, nucleotide2, M_D measure between the SNP and a set of typed SNPs with the best prediction, frequency of nucleotide1 in the reference panel, estimate of log odds-ratio, standard error estimate, standard-normal test statistic, and the corresponding p-value. nucleotide2 is taken as the reference in the analysis. Any untyped SNPs with allele frequencies 0.0 or 1.0 in the reference panel are excluded from analysis. For each genotyped SNP, the output shows the proportion of non-missing genotypes and the frequency of nucleotide1 in the case-control data instead of M_D measure and the frequency of nucleotide1 in the reference panel. Any SNPs with allele frequencies 0.0 or 1.0 in the case-control data are excluded. For untyped SNPs and typed SNPs with missing data, the test statistic is the Wald statistic. For typed SNPs without missing data, the Armitage trend test is used. The results for very low minor allele frequencies may not be stable and should be viewed with great caution, especially for untyped SNPs or typed SNPs with substantial missingness.

Imputation (-impute)

SNPMStat also allows the imputation (using the option -impute) of genotypes for untyped SNPs and missing values for typed SNPs. The disease status line in case_control.dat is ignored in this task. Untyped SNPs with allele frequencies of 0.0 or 1.0 in the reference panel are excluded, and so are the typed SNPs with allele frequencies of 0.0 or 1.0 in the case-control data. The posterior probabilities of genotypes, in the order of AA, AG and GG for an A/G SNP, are output to imp_geno.out by default, unless the intended file name is specified by -out_imp. The first column shows the proportion of non-missing genotypes for typed SNPs and 'no' for untyped SNPs. In the presence of the option -impute, the association analysis can be suppressed by using -notest. This is mandatory when there is no information on disease status, in which case the study data file is still named "case_control.dat" and its first line contains an arbitrary integer (e.g., 0) for each subject.

List of options

Option Default Description Example
-nr No reference panel used SNPMStat -nr
-ur Unrelated subjects in reference panel SNPMStat -ur
-p File phase.dat provided SNPMstat -p
-rec Recessive model assumed SNPMStat -rec
-dom Dominant model assumed SNPMStat -dom
-out results.out Specify the output file for association test results SNPMStat -out ra.out
-no_fix Turn off internal check SNPMstat -no_fix
-no_remove Turn off default removal of SNPs that cannot be aligned SNPMstat -no_remove
-speed 1) Skip analysis of untyped SNPs not in phase.dat;
2) Exclude SNPs not in phase.dat as predictors.
This will significantly speed up the program, but may lose important untyped SNPs. This flag is meaningful only when -p is specified.
SNPMstat -p -speed
-impute Carry out imputation. SNPMstat -p -speed -impute
-out_imp imp_geno.out Specify the output file for imputed genotypes. This flag works with -impute. SNPMstat -impute -out_imp ra_imp.out
-notest Suppress the association tests when only the genotype imputation is desired. This flag works with -impute. SNPMstat -impute -notest

Example

A simulated test data set mimicking the region of SNPs 1090-1190 in the RA study is provided in the case_control.dat example file. The reference.dat file contains the part of the HapMap data covering that chromosomal region and the phase.dat file provides the phasing information of reference.dat. The SNPMStat output is given in ra.out and ra_imp.out, which was obtained via the following command

SNPMStat -p -no_remove -speed -out ra.out -impute -out_imp ra_imp.out

It takes less than 3 minutes to finish the analysis on a computer with a 3.6GHz CPU and 4GB of memory. A plot of log10(p-values) is provided in ra.bmp.

The example files can also be run using other commands (results not shown). We list below some of the possibilities to illustrate the typical time range.

Command Data Files Used Time Cost
SNPMStat case_control.dat, reference.dat 2.7   hrs
SNPMStat -nr case_control.dat 1.8   hrs
SNPMStat -p -no_remove case_control.datreference.datphase.dat 35   mins
SNPMStat -p -no_remove -speed case_control.datreference.datphase.dat 3   mins

Reference

Lin DY, Hu Y, Huang BE. 2008. Simple and efficient analysis of SNP-disease association with missing genotype data. The American Journal of Human Genetics, 82: 444-452.

Download

SNPMStat for Windows [updated 13 January 2009]

executable » SNPMStat.exe

SNPMStat for Linux [updated 13 January 2009]

executable (zip archive) » SNPMStat.zip
static binary executable (zip archive) » SNPMStat_static.zip

Example files [updated 13 January 2009]

zip archive » SNPMStat.example.zip

Version History

Version Date Description
1.0 Oct. 2007 First version released
1.1 May 8, 2008 Bug Fix:
  • Fixed bug in the logistic regression analysis. This bug only affected the results of typed SNPs without missing values.
2.0 Jul. 9, 2008 New Features:
  • 1) Requires additional columns in input data files to include nucleotide information and strand information.
  • 2) Added internal check of the strand orientation between reference panel and case-control data files. Added -no_fix and -no_remove to control the internal check.
  • 3) Revised the example.
  • 4) Added supplementary program SNPMStat_HM to convert the reference panel from HapMap database to the required format.
2.1 Sep. 29, 2008 Bug Fix:
  • Fixed bug in reading case_control.dat when there is no reference panel available. This bug only affected the results when option -nr was specified.
3.0 Oct. 14, 2008
  • Bug fix:
    • Included SNPs in case_control.dat but not in reference.dat into the analysis.
  • New Features:
    • 1) Added supplementary program SNPMStat_CC to convert the case-control data from PLINK format to the required format.
    • 2) Required the physical position column for case_control.dat
    • 3) Changed the format of case_control.dat when there is no reference panel to be the same as the one with reference panel.
    • 4) Relaxed the requirement of rs number to allow any SNP identifier.
    • 5) Added a column with nucleotide coding information in the output file.
    • 6) Added the option -dom to allow the analysis of dominant effect.
    • 7) Added the option -speed.
3.1 Jan. 13, 2009
  • New Features:
    • 1) Added the option -impute to allow the imputation of untyped SNPs or missing values of typed SNPs.
    • 2) Added the option -out_imp to specify the output file for imputed genotypes.
    • 3) Added the option -notest to allow the suppression of association analysis.

SNPMStat_HM

If the HapMap Phase I or II database is used, the input files pertaining to reference panel can be prepared using our supplementary program SNPMStat_HM.

SNPMStat_CC

To facilitate the preparation of the case_control.dat file, we provide the supplementary program SNPMStat_CC to convert the transposed PED file (.tped and .tfam in PLINK file format) into case_control.dat.

october 5 2009
MAOS
new software
january 13 2009
SNPMStat
software update
october 14 2008
hapstat 3.0
command-line executable for Linux
april 3, 2008
postdoctoral positions available