SNPMStat :: Statistical Analysis of SNP-Disease Association with Missing Genotype Data

SNPMStat is a program for the statistical analysis of SNP-disease association in case-control studies with potentially missing genotype data. For SNPs without missing data, the program performs the standard association analysis and provides the estimated odds ratios and standard error estimates, together with the Armitage trend tests and p-values. For typed SNPs with missing data or untyped SNPs, the program performs the maximum-likelihood analysis described in Lin, Hu and Huang (American Journal of Human Genetics, 2008) and provides the estimated odds ratios and standard error estimates, together with the Wald statistics and p-values. The current release performs single-SNP analysis under either additive or recessive mode of inheritance without environmental factors. The related software interface HAPSTAT allows very general analysis (including multiple-SNP analysis, all modes of inheritance, Hardy-Weinberg disequilibrium, all study designs and phenotypes, and gene-environment interactions), but requires the user to specify the set of SNPs used to infer the unknown genotypes of the SNP with missing data. We are working intensely to improve the capabilities of SNPMStat, so please check back frequently for updates.

General information

The program is written in standard C. Executable files are available for Windows and Linux in the download section.

Input

The program can handle both typed SNPs with partially missing data and untyped SNPs. A reference panel is used for untyped SNPs, but not for typed SNPs. The data from the case-control study and reference panel are placed in separate files. All input files are placed in the same directory as the executable file.

With reference panel (default)

For untyped SNPs, the program requires two input files named case_control.dat and reference.dat, and accepts an optional input file named phase.dat. Each file contains text data in a tabular (row-column) format, with rows representing SNPs and columns representing subjects (or chromosomes in phase.dat).

case_control.dat

The first line of the input file case_control.dat provides the disease status information for all study subjects (1=disease, 0=no disease). The main body of the file follows the format

rsnumber strand_orientation nucleotide1 nucleotide2 genotype_1 ... genotype_n.

The rsnumber field should be rid of the character 'rs'. If no strand orientation information is available, all strand_orientation fields should be indicated as 0. If this information is available, flag 1 in the field indicates the strand orientation in the study data is different from the reference panel (thus allele coding is switched by the program) and flag 0 indicates strand consistency. In particular, if all the genotypes in the reference panel are in forward strand, then flag 1 means that the SNP in the study was recorded on reverse strand. The strand orientation information is only required for C/G and A/T SNPs. For all the other types of SNPs, this field can be left 0. nucleotide1 and nucleotide2 are the nucleotides of the SNP and should be in the alphabetical order. For example, if the two nucleotides are G and A, then nucleotide1 is A and nucleotide2 is G. The genotypes are coded with 0, 1 and 2, referring to the count of nucleotide1. Missing genotypes should be coded as 9. After aligning the SNPs according to the strand information provided, we perform an internal check to see if the strand alignment can be determined from (a) the allele labels (at non A/T and G/C SNPs), and (b) allele frequencies (at A/T and G/C SNPs ). SNPs that cannot be aligned are removed from the data. The checks can be turned off using the -no_fix and -no_remove flags, which will be illustrated later.

Example format for input file case_control.dat with reference panel.
0 1 1 0 0 0 0 1 0 1 0 0 1 0 ···  
3892957   0   A G   0 0 0 0 0 1 1 0 0 0 0 0 2 1 1 0 ···
570046   0   C T   2 2 2 2 2 1 1 2 2 2 2 2 0 1 1 2 ···
3894276   0   G T   0 0 0 0 0 1 1 0 0 0 0 0 2 1 1 0 ···
··· ··· ···

reference.dat

SNPMStat allows the subjects in the reference panel to be either trios (default) or unrelated (specified by the -ur option). For trios, the file follows the format

rsnumber physical_position nucleotide1 nucleotide2 Fathergenotype Mothergenotype Childgenotype.

Again, the rsnumber field should be rid of the character “rs”. Physical_position should be in either ascending or descending order. nucleotide1, nucleotide2 and genotypes follow the same rule as in case-control.dat. For each trio, the child's genotype is entered last. For the reference panel of unrelated subjects, the format is the same except that there is only one subject per family.

Example format for input file reference.dat.
3892957 53295262   A G   0 0 0 0 0 0 0 1 1 1 1 1 1 ···
570046 53295430   C T   2 2 2 2 2 2 2 1 1 1 1 1 1 ···
8087214 53295793   G T   2 2 2 2 2 2 2 2 2 2 2 2 2 ···
12963672 53295932   A T   0 0 0 0 0 0 0 0 0 0 0 0 0 ···
··· ··· ··· ···

phase.dat

Phasing information in the reference panel is incorporated using the option -p, which will include file phase.dat. SNPs with phase information can be a smaller set of those with genotype information, which is a typical situation with HapMap database. The file format is

rsnumber phase1_1 phase2_1 phase1_2 phase2_2 ... phase1_n phase2_n.

rsnumber should be in the same order of reference.dat. Each subject contributes two columns (phase1_i phase2_i, i = 1, ..., n) with 0/1 coding, referring to the count of nucleocide1 as in reference.dat. Typically, only phasing information on founders (mothers and fathers) is provided.

Example format for input file phase.dat.
3892957 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 1 0 0 0 ···
570046 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 1 1 1 ···
8087214 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ···
12963672 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 ···
··· ··· ···

Without reference panel

If the option -nr (no reference) is used, the program requires only one input file named case_control.dat and the optional phase.dat is irrelevant. In this situation, the file case_control.dat differs from the one accompanied by reference.dat in that it should replace the strand_orientation with the physical_position of the SNPs in the second column.

Example format for input file case_control.dat without reference panel.
0 1 1 0 0 0 0 1 0 1 0 0 1 0 ···  
3892957 53295262   A G   0 0 0 0 0 1 1 0 0 0 0 0 2 1 1 0 ···
570046 53295430   C T   2 2 2 2 2 1 1 2 2 2 2 2 0 1 1 2 ···
3894276 53296375   G T   0 0 0 0 0 1 1 0 0 0 0 0 2 1 1 0 ···
··· ··· ··· ···

Mode of inheritance

The additive model of inheritance is assumed by default. One can specify the recessive mode of inheritance by using the option -rec.

Output

Computational results are written to the file results.out by default. Use the option -out to specify a different file name. For each untyped SNP, the output shows the rs number, chromosomal position, M_D measure between this SNP and a set of typed SNPs with the best prediction, frequency of nucleotide1 in the reference panel, estimate of log odds-ratio, standard error estimate, standard-normal test statistic, and the corresponding p-value. Any untyped SNPs with allele frequencies 0.0 or 1.0 in the reference panel are excluded from analysis. For each genotyped SNP, the output shows the proportion of non-missing genotypes, rs number, chromosomal position, frequency of nucleotide1 in the case-control data, estimate of log odds-ratio, standard error estimate, standard-normal test statistic, and the corresponding p-value. Any SNPs with allele frequencies 0.0 or 1.0 in the case-control data are excluded. For untyped SNPs and typed SNPs with missing data, the test statistic is the Wald statistic. For typed SNPs with no missing data, the Armitage trend is used. The results for very low minor allele frequencies may not be stable and should be viewed with great caution, especially for untyped SNPs or typed SNPs with substantial missingness.

Summary of options

Option Description Example
-nr No reference panel used. SNPMStat -nr
-ur Unrelated subjects in reference panel. SNPMStat -ur
-p File phase.dat provided. SNPMstat -p
-rec Recessive model assumed. SNPMStat -rec
-out Specify output file. SNPMStat -out filename
-no_fix Turn off internal check. SNPMstat -no_fix
-no_remove Turn off default removal of SNPs that cannot be aligned. SNPMstat -no_remove

Example

A simulated test data set mimicking the region of SNPs 1090-1190 in the RA study is provided in the case_control.dat example file. The reference.dat file contains the part of the HapMap data covering that chromosomal region and the phase.dat file provides the phasing information of the reference panel. The command

SNPMStat -p -no_remove -out ra.out

gives the output in file ra.out. A plot of log10 (p-values) is provided in ra.bmp.

SNPMStat_HM

If the HapMap database is used, the input files pertaining to reference panel can be prepared using our supplementary program SNPMStat_HM.

Download

SNPMStat for Windows [updated 09 July 2008]

executable » SNPMStat.exe

SNPMStat for Linux [updated 09 July 2008]

executable » SNPMStat
static binary executable » SNPMStat.static

Example files [updated 09 July 2008]

zip archive » SNPMStat.files.zip

Reference

Lin DY, Hu Y, Huang BE. 2008. Simple and efficient analysis of SNP-disease association with missing genotype data. The American Journal of Human Genetics, 82: 444-452.

april 3, 2008
postdoctoral positions available
25 june 2008
HAPSTAT 3.0
now supporting untyped SNP analysis
09 july 2008
SNPMStat
software update