SNPMStat :: Statistical Analysis of SNP-Disease Association with Missing Genotype Data
SNPMStat is a program for the statistical analysis of SNP-disease association in case-control studies with potentially missing genotype data. For SNPs without missing data, the program performs the standard association analysis and provides the estimated odds ratios and standard error estimates, together with the Armitage trend tests and p-values. For typed SNPs with missing data or untyped SNPs, the program performs the maximum-likelihood analysis described in Lin, Hu and Huang (American Journal of Human Genetics, 2008) and provides the estimated odds ratios and standard error estimates, together with the Wald statistics and p-values. The current release performs single-SNP analysis under either additive or recessive mode of inheritance without environmental factors. The related software interface HAPSTAT allows very general analysis (including multiple-SNP analysis, all modes of inheritance, Hardy-Weinberg disequilibrium, all study designs and phenotypes, and gene-environment interactions), but requires the user to specify the set of SNPs used to infer the unknown genotypes of the SNP with missing data. We are working intensely to improve the capabilities of SNPMStat, so please check back frequently for updates.
General information
The program is written in standard C. Executable files are available for Windows and Linux in the download section.
Input
The program can handle both typed SNPs with partially missing data and untyped SNPs. A reference panel is used for untyped SNPs, but not for typed SNPs. The data from the case-control study and reference panel are placed in separate files. All input files are placed in the same directory as the executable file.
With reference panel (default)
For untyped SNPs, the program requires two input files named case_control.dat and reference.dat, and accepts an optional input file named phase.dat. Each file contains text data in a tabular (row-column) format, with rows representing SNPs and columns representing subjects (or chromosomes in phase.dat).
case_control.dat
The first line of the input file case_control.dat provides the disease status information for all study subjects (1=disease, 0=no disease). The main body of the file follows the format
The rsnumber field should be rid of the character 'rs'. If no strand
orientation information is available, all strand_orientation fields
should be indicated as 0. If this information is available, flag 1 in the field
indicates the strand orientation in the study data is different from the
reference panel (thus allele coding is switched by the program) and flag 0
indicates strand consistency. In particular, if all the genotypes in the
reference panel
are in forward strand, then flag 1 means that the SNP in the study was recorded
on reverse strand. The strand orientation information is only required for C/G
and A/T SNPs. For all the other types of SNPs, this field can be left 0.
nucleotide1 and nucleotide2 are the nucleotides of the SNP
and should be in the alphabetical order. For example, if the two nucleotides
are G and A, then nucleotide1 is A and nucleotide2 is G. The
genotypes are coded with 0, 1 and 2, referring to the count of
nucleotide1. Missing genotypes should be coded as 9. After aligning
the SNPs according to the strand information provided, we perform an internal
check to see if the strand alignment can be determined from (a) the allele
labels (at non A/T and G/C SNPs), and (b) allele frequencies (at A/T and G/C
SNPs ). SNPs that cannot be aligned are removed from the data. The checks can
be turned off using the
| 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | ··· | |||||||||||
| 3892957 | 0 | A | G | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 0 | ··· | |||||
| 570046 | 0 | C | T | 2 | 2 | 2 | 2 | 2 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 0 | 1 | 1 | 2 | ··· | |||||
| 3894276 | 0 | G | T | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 0 | ··· | |||||
| ··· | ··· | ··· | |||||||||||||||||||||||
reference.dat
SNPMStat allows the subjects in the reference panel to be either trios (default) or unrelated (specified by the -ur option). For trios, the file follows the format
Again, the rsnumber field should be rid of the character “rs”. Physical_position should be in either ascending or descending order. nucleotide1, nucleotide2 and genotypes follow the same rule as in case-control.dat. For each trio, the child's genotype is entered last. For the reference panel of unrelated subjects, the format is the same except that there is only one subject per family.
| 3892957 | 53295262 | A | G | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | ··· | ||
| 570046 | 53295430 | C | T | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | ··· | ||
| 8087214 | 53295793 | G | T | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | ··· | ||
| 12963672 | 53295932 | A | T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ··· | ||
| ··· | ··· | ··· | ··· | ||||||||||||||||
phase.dat
Phasing information in the reference panel is incorporated using the option -p, which will include file phase.dat. SNPs with phase information can be a smaller set of those with genotype information, which is a typical situation with HapMap database. The file format is
rsnumber should be in the same order of reference.dat. Each subject contributes two columns (phase1_i phase2_i, i = 1, ..., n) with 0/1 coding, referring to the count of nucleocide1 as in reference.dat. Typically, only phasing information on founders (mothers and fathers) is provided.
| 3892957 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | ··· |
| 570046 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | ··· |
| 8087214 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ··· |
| 12963672 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ··· |
| ··· | ··· | ··· | ||||||||||||||||||||||
Without reference panel
If the option -nr (no reference) is used, the program requires only one input file named case_control.dat and the optional phase.dat is irrelevant. In this situation, the file case_control.dat differs from the one accompanied by reference.dat in that it should replace the strand_orientation with the physical_position of the SNPs in the second column.
| 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | ··· | |||||||||||||
| 3892957 | 53295262 | A | G | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 0 | ··· | |||||||
| 570046 | 53295430 | C | T | 2 | 2 | 2 | 2 | 2 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 0 | 1 | 1 | 2 | ··· | |||||||
| 3894276 | 53296375 | G | T | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 0 | ··· | |||||||
| ··· | ··· | ··· | ··· | ||||||||||||||||||||||||
Mode of inheritance
The additive model of inheritance is assumed by default. One can specify the recessive mode of inheritance by using the option -rec.
Output
Computational results are written to the file results.out by default. Use the option -out to specify a different file name. For each untyped SNP, the output shows the rs number, chromosomal position, M_D measure between this SNP and a set of typed SNPs with the best prediction, frequency of nucleotide1 in the reference panel, estimate of log odds-ratio, standard error estimate, standard-normal test statistic, and the corresponding p-value. Any untyped SNPs with allele frequencies 0.0 or 1.0 in the reference panel are excluded from analysis. For each genotyped SNP, the output shows the proportion of non-missing genotypes, rs number, chromosomal position, frequency of nucleotide1 in the case-control data, estimate of log odds-ratio, standard error estimate, standard-normal test statistic, and the corresponding p-value. Any SNPs with allele frequencies 0.0 or 1.0 in the case-control data are excluded. For untyped SNPs and typed SNPs with missing data, the test statistic is the Wald statistic. For typed SNPs with no missing data, the Armitage trend is used. The results for very low minor allele frequencies may not be stable and should be viewed with great caution, especially for untyped SNPs or typed SNPs with substantial missingness.
Summary of options
| Option | Description | Example |
|---|---|---|
| -nr | No reference panel used. | |
| -ur | Unrelated subjects in reference panel. | |
| -p | File phase.dat provided. | |
| -rec | Recessive model assumed. | |
| -out | Specify output file. | |
| | Turn off internal check. | |
| | Turn off default removal of SNPs that cannot be aligned. | |
Example
A simulated test data set mimicking the region of SNPs 1090-1190 in the RA study is provided in the case_control.dat example file. The reference.dat file contains the part of the HapMap data covering that chromosomal region and the phase.dat file provides the phasing information of the reference panel. The command
gives the output in file ra.out. A plot of log10 (p-values) is provided in ra.bmp.
SNPMStat_HM
If the HapMap database is used, the input files pertaining to reference panel can be prepared using our supplementary program SNPMStat_HM.
Download
SNPMStat for Windows [updated 09 July 2008]
SNPMStat for Linux [updated 09 July 2008]
Example files [updated 09 July 2008]
Reference
Lin DY, Hu Y, Huang BE. 2008. Simple and efficient analysis of SNP-disease association with missing genotype data. The American Journal of Human Genetics, 82: 444-452.