hapstat»documentation»data

Input data

HAPSTAT supports data from cross-sectional, longitudinal, case-control, and cohort (including case-cohort and nested case-control) association studies. The cross-sectional and longitudinal designs collect data on a random sample of individuals. In a cross-sectional study, the response variable is measured only once on all study subjects; in a longitudinal study, the response variable is measured repeatedly through time. In the case-control design, data is collected on a sample of diseased individuals or cases and a sample of disease-free individuals or controls. The cohort design follows a sample of at-risk individuals over time and records their times of disease occurrence. The individuals who are withdrawn prematurely or who are disease-free at the end of the cohort study have censored observations, in that their ages at onset are only known to be beyond their durations of follow-up. HAPSTAT also supports the inclusion of external genotype data collected on family trios or unrelated individuals, allowing for analysis on untyped SNPs.

File format

HAPSTAT accepts ANSI-encoded text files containing data in a tabular (row-column) format. Each row contains space or tab delimited data specific to an individual. Column titles may be specified in the first line of the file and must also be space or tab delimited. If using external data, column titles must be specified in both the study file and the external file. File may contain columns of extraneous data. There are no requirements on the ordering of the columns.

The study file must contain one or more columns representing the multi-SNP genotype. Optionally, the file may include one or more columns of environmental covariates. Study-dependent data requirements are as follows:

Cross-sectional
The file must contain one column describing the trait value of the individual.
Longitudinal
File data must be tab delimited. The file must contain two columns of data per individual providing an identifier unique to that individual and the observed trait value. For each individual, each observation is recorded on a separate row. At least some individuals have more than one observation. Data that is constant over all observations, such as the multi-SNP genotype, need only be specified for one observation. An identifier must be specified for each row of observed data. There are no requirements on the ordering of the rows.
Case-control
The file must contain one column describing the disease status of the individual.
Cohort
The file must contain two columns of data per individual providing the observation time and event indicator.

The external file contains data collected on either trios or unrelated individuals. The file must contain one or more columns representing each of the SNP sites intended for analysis in the study file and all untyped SNPs of interest. For trio data, HAPSTAT requires three rows of data per family, where data for mothers and fathers is in the first two rows and the child's data is in the third. The correspondence between SNP sites in the study data and external data is determined by the column title. For both the study file and the external file, column titles must be specified in the first row of the file and the corresponding SNP sites must be titled the same in both files.

Data specification

The multi-SNP genotype is represented by a sequence of the values 0, 1 and 2, corresponding to the number of occurrences of a specific allele at each SNP site. Any value other than 0, 1 or 2 is assumed to indicate missing SNP data. Individuals are allowed to have missing data at all SNP sites. Environmental covariates are represented by decimal values and may not contain missing values. The representation of data not intended for analysis is unimportant.

In cross-sectional studies, disease-related traits are represented by decimal values. In longitudinal studies, the identifier is represented as a string value and disease-related traits are represented by decimal values. HAPSTAT allows both quantitative and binary traits in cross-sectional studies and only quantitative traits in longitudinal studies. In case-control studies, the disease status is specified by 1 for cases or 0 for controls. In cohort studies, decimal values represent the observation times and a binary event indicator distinguishes between uncensored and censored individuals by the values 1 and 0, respectively.

The file case-control.dat, shown below, contains simulated data for a case-control study of 2000 individuals genotyped at five SNPs, where some SNP values are missing.

case-control.dat:  Format of case-control data for HAPSTAT input.

The disease status is specified in the first column, titled “Status”. The columns “Age” and “Gender” contain environmental covariate data, and the columns SNP1-SNP5 represent the five SNP sites. The ‘·’ character indicates a missing SNP value.

For external data, the sequence of values representing the genotype must have the same allele correspondence as in the study data. The file trio.dat, shown below, contains simulated data for 50 families genotyped at five SNPs. Columns SNP1-SNP5 correspond to the five identically titled SNP sites in case-control.dat.

trio.dat:  External data file of family trios.

Study data import

To open a study file in HAPSTAT, select the menu option File»Open and choose the study type corresponding to your data from the submenu. Browse to the directory where your data file resides, select your file and click the Open button. The HAPSTAT display after importing case-control.dat is shown in Figure 1.1.

You may only have one file open in HAPSTAT at any time. To open a new file, select the menu option File»Open or click the icon on the toolbar. HAPSTAT will prompt you to save your results before closing the current file. You may also close a file via the menu option File»Close or the icon on the toolbar.

Transpose

Use the Settings»Transpose menu option to transpose rows and columns.

Variable selection

To specify the columns that correspond to the variables HAPSTAT should use for analysis, first click inside the text area of the variable you wish to set in the Variables box on the right panel. Then select the columns of data corresponding to that variable by clicking on the column labels on the left panel. Use the toolbar icons and to show or hide unselected columns. The selection of variables for the single-gene analysis of case-control.dat is shown in Figure 1.2. After completing your selection, click Continue to proceed.

Multiple genes

Click the icon on the toolbar to create multiple genes. Figure 1.3 shows the choices of SNP1-SNP3 and SNP4-SNP5 as Gene 1 and Gene 2, respectively.

Header data

HAPSTAT will attempt to detect if the first line of your file contains column titles or actual data. You can toggle what HAPSTAT decides by checking/unchecking the menu option Settings»Include header.

Edit/clear selection

To change your variable selections after the Continue button is clicked, return to the file tab and click the Edit button. The Settings»Clear menu option will clear the current selection.

Sorting

Right click on the title of the column you wish to sort by and select Sort ascending or Sort descending. All columns are sorted accordingly.

Excluding rows

All rows are included by default. To exclude a particular row from your analysis, right click on the row number and select Exclude.

External data import

You may import external data at any time. To open an external file, select the menu option File»Open»External data file and choose the external type corresponding to your data from the submenu. Browse to the directory where your data file resides, select your file and click the Open button. The HAPSTAT display after importing trio.dat is shown in Figure 1.4.

If a study file is already open in HAPSTAT, all gene variables selected from the study file will be pre-selected from the external file (if present). You may add or remove SNP variables from the gene selection as described above. SNPs selected from the study file must be a subset of those selected from the external file. SNPs selected from the external file that are not selected or not present in the study file are considered to be untyped.

14 october 2008
new release available
hapstat 3.0
command-line executable for Linux
25 june 2008
new release available
HAPSTAT 3.0
now supporting untyped SNP analysis
17 june 2008
HAPSTAT 2.0 update
29 february 2008
new release available
HAPSTAT 2.0
now supporting longitudinal studies
11 october 2007
now available
hapstat command-line executable for Linux
ENAR 2007
spring meeting
HAPSTAT 1.0 tutorial