Using Standalone Tassel for Multi-Locus Trait Mapping

This example will show how to perform association mapping of a quantitative trait using tassel-5-standalone. tassel_5_standalone TASSEL requires certain types of input for its mixed linear model mode.

  • Hapmap formatted data
  • Kinship Matrix
  • Population Structure
  • Trait

saegus has functions to calculate the (kinship) relationship matrix using marker data [VanRaden2008]. Population structure is computed using [Patterson2006]. We will give an example of hapmap formatted output. The trait data is formatted by adding specific tags to the top of a trait vector. saegus also writes an xml file for TASSEL as a config file. The config file keeps the data grouped together appropriately and keeps you from having to re-run the same commands over and over again.

Module imports
>>> import simuOpt
>>> simuOpt.setOptions(alleleType='short', quiet=True)
>>> import simuPOP as sim
>>> import pandas as pd
>>> import numpy as np
>>> import random
>>> from saegus import analyze, operators, parameters
>>> np.set_printoptions(suppress=True, precision=5)

Setting up the Population for Analysis

We will load the population and compute the required information as in previous examples.

Preparing the population for TASSEL
>>> example_pop = sim.loadPopulation('example_pop.pop')
>>> example_pop.addInfoFields('ind_id', 'mother_id', 'father_id', 'g', 'p')
>>> sim.tagID(example_pop)
>>> sim.stat(example_pop, numOfSegSites=sim.ALL_AVAIL, vars=['segSites'])
>>> segregating_loci = example_pop.dvars().segSites

Hapmap Formatted Data

Hapmap formatted data consists of 11 columns followed by the genotype data.

  • rs#
  • alleles
  • chrom
  • pos
  • strand
  • assembly#
  • center
  • protLSID
  • assayLSID
  • panelLSID
  • QCcode

Each column corresponds to an individual. Each row corresponds to a locus. Most of these columns we will fill with NA because we are only interested in the association mapping aspect of TASSEL. In the hapmap format a genotype is given as two upper case letters corresponding to a nucleotide or code.

A genotype is given in terms of two uppercase letters.

Code Meaning
A Thymine
C Cytosine
G Guanine
T Thymine
R A or G
Y C or T
S G or C
W A or T
K G or T
M A or C
B C or G or T
D A or G or T
H A or C or T
V A or C or G
N any base
. gap
/ gap

saegus automatically does the conversion into hapmap format provided with a dict to code the allele of each individual into a letter. For these examples the allele states (1, 2, 3, 4) have nothing to do with the corresponding nucleotides of the hapmap format. The GWAS automatically creates an attribute individual names which formats the ind_id for TASSEL.

Formatting genotype matrix in hapmap format
>>> gwas = analyze.GWAS(example_pop, segregating_loci, 'example')
>>> print(gwas.individual_names)
['I1' 'I2' 'I3' 'I4' 'I5' 'I6' 'I7' 'I8' 'I9' 'I10' 'I11' 'I12' 'I13' 'I14'
 'I15' 'I16' 'I17' 'I18' 'I19' 'I20' 'I21' 'I22' 'I23' 'I24' 'I25' 'I26'
 'I27' 'I28' 'I29' 'I30' 'I31' 'I32' 'I33' 'I34' 'I35' 'I36' 'I37' 'I38'
 'I39' 'I40' 'I41' 'I42' 'I43' 'I44' 'I45' 'I46' 'I47' 'I48' 'I49' 'I50'
 'I51' 'I52' 'I53' 'I54' 'I55' 'I56' 'I57' 'I58' 'I59' 'I60' 'I61' 'I62'
 'I63' 'I64' 'I65' 'I66' 'I67' 'I68' 'I69' 'I70' 'I71' 'I72' 'I73' 'I74'
 'I75' 'I76' 'I77' 'I78' 'I79' 'I80' 'I81' 'I82' 'I83' 'I84' 'I85' 'I86'
 'I87' 'I88' 'I89' 'I90' 'I91' 'I92' 'I93' 'I94' 'I95' 'I96' 'I97' 'I98'
 'I99' 'I100' 'I101' 'I102' 'I103' 'I104' 'I105']
>>> hapmap_columns = ['rs', 'alleles', 'chrom', 'pos',
...              'strand', 'assembly', 'center',
...              'center', 'protLSID', 'assayLSID',
...              'panelLSID', 'QCode'] + list(gwas.individual_names)
...              ]
>>> hapmap_matrix = pd.DataFrame(columns=hapmap_columns)
>>> hapmap.rs = segregating_loci
>>> hapmap.alleles = segregating_minor_alleles
>>> chromosomes = np.array([example_pop.chromLocusPair(locus)[0] + 1
...                          for locus in segregating_loci], dtype=np.int8)
>>> hapmap.chrom = chromosomes
>>> hapmap_matrix.pos = np.arange(segregating_loci.shape)
>>> hapmap_matrix.loc[:, 'strand':'QCode'] = np.core.defchararray.array(
...                          [['NA']*42837]*8).T
>>> for i, ind in enumerate(example_pop.individuals()):
...       hapmap_matrix.loc[:, gwas.individual_names[i]] = [
...           ''.join(sorted(gwas.int_to_snp_conversions[a] +
...                     gwas.int_to_snp_conversions[b]))
...           for a, b, in zip(
...              np.array(ind.genotype(ploidy=0))[segregating_loci],
...              np.array(ind.genotype(ploidy=1))[segregating_loci])
...               ]
>>> print(np.array(hapmap_matrix))
[[0 1 1 ..., 'GG' 'GG' 'GG']
 [1 2 1 ..., 'GT' 'GG' 'TT']
 [2 3 1 ..., 'GG' 'GG' 'GG']
 ...,
 [44442 2 10 ..., 'GG' 'CG' 'CG']
 [44443 3 10 ..., 'CT' 'CC' 'CT']
 [44444 1 10 ..., 'CC' 'TT' 'TT']]

The final step is to write the hapmap to a txt file with the appropriate rows and columns.

Writing the hapmap file for TASSEL
>>> with open('example_hapmap.txt', 'w') as hapmap_file:
...         hapmap_matrix.to_csv(hapmap_file, sep='\t', index=False)

All of the operations to format the genotypes into hapmap format are encapsulated inside of hapmap_formatter().

Population Structure

Population structure is used as a covariate. For the past examples the first eigenvector explains the overwhelming majority of the variation. However, you saegus has functions to compute a test statistic. The value of the test statistic must be compared against the Tracy-Widom manually as there is not a distribution implemented in Python. We implement the computation in Patterson2006. Let:

  • \(m\) be the number of individuals
  • \(n\) the number of loci (called markers in the paper)
  • \(a\) is the minor allele at each locus
  • \(\mathbf{C}\) is a matrix whose entires are the counts of the minor allele for each individual for each locus

Hence \(\mathbf{C}(i, j)\) is how many copies of the minor allele, \(a\) an individual, \(i\), has at locus \(j\). For example:

Example of count matrix
>>> sim.stat(example_pop, alleleFreq=sim.ALL_AVAIL)
>>> allele_states = analyze.gather_allele_data(example_pop)
>>> minor_alleles = allele_states[:, 3]  # column corresponding to minor alleles
>>> segregating_minor_alleles = minor_alleles[segregating_loci]
>>> gwas = analyze.GWAS(example_pop, segregating_loci, 'example')
>>> count_matrix = gwas.calculate_count_matrix(segregating_minor_alleles, segregating_loci)
>>> count_matrix.shape
(105, 42837)
>>> print(count_matrix)
[[1 1 1 ..., 1 1 1]
 [0 0 0 ..., 1 0 0]
 [1 0 0 ..., 1 0 0]
 ...,
 [0 1 0 ..., 2 1 2]
 [0 2 0 ..., 1 0 0]
 [0 0 0 ..., 1 1 0]]

Calculate the mean of each column: \(\mathbf{C_{\mu}}(j)\):

\[\mathbf{C}_{\mu}(j) = \frac{\sum_{i=1}^{m}\mathbf{C}(i, j)}{m}\]

We “correct” each entry of \(\mathbf{C}(i, j)\) by the following process. Let \(p(j)\) be equal to \(\frac{\mathbf{C_{\mu}}(j)}{2}\). We obtain the matrix \(\mathbf{M}\) after the “correction” process.

\[\mathbf{M}(i,j) = \frac{\mathbf{C}(i, j) - \mathbf{C}_{\mu}(j)}{\sqrt{p(j)(1-p(j))}}\]

We are interested in the principal components of the matrix:

\[\mathbf{X} = (\frac{1}{n})\mathbf{M}\mathbf{M^T}\]

Hence we perform the eigenvalue decomposition of mathbf{X}. We will use the principal components as co-variates for TASSEL’s mixed linear model. But before that let us see if our results agree. Does this population seem to be descended from six subpopulations?

_images/PC1xPC2.svg

There appear to be six subgroups in the plot. Which is exactly what we obtained from a different method. For the time being we will use the first two principal components; however, the Patterson paper describes a test statistic to test the significance of each principal component.

Population structure calculation
>>> column_means = np.apply_along_axis(np.mean, axis=0, arr=count_matrix)
>>> shifted = np.array(
...    [count_matrix[:, i] - column_means[i] for i in range(42837)]).T
>>> P = column_means/2
>>> scale = np.sqrt(P*(1-P))
>>> M = np.matrix(
...     np.array([shifted[:, i] / scale[i] for i in range(42837)]).T)
>>> X = (1/42837)*(M * M.T)
>>> eigendata = linalg.eig(X)
>>> eigenvalues = np.array(eigendata[0], dtype=np.float)
>>> eigenvectors = np.array(eigendata[1], dtype=np.float)

All of the above calculations are performed by pop_struct_eigendecomp(). The file for the population structure covariates is written by population_structure_formatter()

Population structure file
>>> gwas.population_structure_formatter(eigenvalues, eigenvectors,
...    number_of_pcs=2, pop_struct_file_name='example_structure.txt')

Kinship Matrix

The kinship matrix is calculated via the method given in VanRaden2008. It is the same method implemented in Synbreed. The marker allele is interpreted as the minor allele. The elements of \(\mathbf{V_{n x m}}\) are \(-1\) for the minor allele homozygote, \(0\) for the heterozygote and \(1\) for the major allele homozygote.

  • \(n\) The number of individuals
  • \(m\) The number of loci
  • \(p_i\) Frequency of the minor allele at locus i

We can obtain \(\mathbf{V}\) by multiplying \(\mathbf{C}\) by \(-1\) and adding \(1\). Recall that \(\mathbf{C}\) is given by calculate_count_matrix() in terms of minor alleles.

\[\begin{split}\left[ \begin{array} ((-1)*2 + 1 = (-1) \\ (-1)*1 + 1 = 0 \\ (-1)*0 + 1 = 1 \end{array} \right]\end{split}\]

Which is exactly what is required. We use method (1) to compute \(\mathbf{G}\). We only need to add individual_names to the header of \(\mathbf{G}\) for use in TASSEL.

\[\mathbf{P} = 2(p_i - \frac{1}{2})\]
\[\mathbf{Z}_{n \times m} = \mathbf{V_j} - \mathbf{P}\]
\[\mathbf{G} = \frac{\mathbf{Z}\mathbf{Z^T}}{2\sum_{i}p_i(1-p_i)}\]
Calculation of the kinship matrix
>>> V = (-1)*count_matrix + 1  # count_matrix was computed in population structure
>>> P = np.array([gwas.pop.dvars().alleleFreq[locus][allele]
...                       for locus, allele in zip(self.segregating_loci,
...                            gwas.segregating_minor_alleles)])
>>> Z = np.zeros((gwas.pop.popSize(), len(gwas.segregating_loci)))
>>> for i in range(gwas.pop.popSize()):
...      Z[i, :] = V[i, :] - 2*(P - 1/2)
>>> G = (Z * Z.T) / (2 * np.sum((P * (1 - P))))

All of the calculation are performed by calculate_kinship_matrix(). This block is to show the calculation explicitly without having to source dive. Now we write the file for TASSEL.

Write kinship matrix to file
>>> G = pd.DataFrame(G, index=gwas.individual_names)
>>> header = "{}\n".format(gwas.pop.popSize())
>>> with open('example_kinship.txt', 'w') as kinship_file:
...      kinship_file.write(header)
...      G.to_csv(kinship_file, sep='\t', index=True, header=False
...                      float_format='%.5f')

Comment About Synbreed

Accoding to VanRaden2008 the vector P is defined as allele frequencies expressed as a difference from 0.5 and multipled by 2. However the Synbreed documentation defines P as

If ret=”realized”, the realized relatedness between individuals is computed according to the formulas in Habier et al. (2007) or vanRaden (2008)

\[\mathbf{U} = \frac{\mathbf{ZZ^T}}{2\sum_i p_i(1-p_i)}\]

where Z = W − P, W is the marker matrix, P contains the allele frequencies multiplied by 2, \(p_i\) is the allele frequency of marker i, and the sum is over all loci.

For the time being I have adjusted the calculation to match Synbreed; however, it is not what is written in the VanRaden2008 paper.

Trait

Our simulated trait data is given a header and output as a text file. TASSEL uses html style tags <tag> in the header to label the input of each file. We will use exponentially distributed allele effects with 20 qtl

\[G \sim Exp(1)\]
Functions for handling trait data
>>> qtl = sorted(random.sample(segregating_loci, 20))
>>> trait = parameters.Trait()
>>> allele_effects_table = trait.construct_allele_effects_table(example_pop, qtl, random.expovariate, 1)
>>> allele_effects_array = trait.construct_ae_array(allele_effects_table, qtl)
>>> heritability = 0.7
>>> operators.calculate_g(example_pop, allele_effects_array)
>>> operators.calculate_error_variance(example_pop, heritability)
>>> operators.calculate_p(example_pop)
>>> tassel_trait = gwas.trait_formatter(trait_file_name='example_trait.txt')

For analysis of the TASSEL output we need to store the information about QTL and allele effects.

Storing qtl, allele effects and allele frequencies
>>> example_trait_data = h5py.File('example_quantitative_trait.hdf5')
>>> example_trait_data['qtl'] = np.array(qtl)
>>> example_trait_data['allele/effects'] = allele_effects_table
>>> allele_frequencies = analyze.gather_allele_frequencies(example_pop, allele_states)
>>> example_trait_data['allele/frequencies'] = allele_frequencies
>>> example_trait_data.close()

TASSEL Config File

The final component is a xml file which specifies the protocol for TASSEL to run. The config file can be in terms of relative or absolute paths for its input. All TASSEL options can be specified in the config file. This is the content of the xml file.

Contents of the conig file
>>> <?xml version='1.0' encoding='UTF-8' standalone='no'?>
...   <TasselPipeline>
...    <fork1>
...        <h>example_hapmap.txt</h>
...    </fork1>
...    <fork2>
...        <t>example_trait.txt</t>
...    </fork2>
...    <fork3>
...        <q>example_structure.txt</q>
...    </fork3>
...    <fork4>
...        <k>example_kinship.txt</k>
...    </fork4>
...    <combine5>
...        <input1/>
...        <input2/>
...        <input3/>
...        <intersect/>
...    </combine5>
...    <combine6>
...        <input5/>
...        <input4/>
...        <mlm/>
...        <mlmCompressionLevel>
...            None
...        </mlmCompressionLevel>
...        <export>example_out__</export>
...    </combine6>
...    <runfork1/>
...    <runfork2/>
...    <runfork3/>
...    <runfork4/>
... </TasselPipeline>

In this case the “forks” are only using the filename; however, in actual use the entire absolute path is given for the input files. The tassel_gwas_config() creates the file for each run given the file names. saegus attaches the full absolute path using the platform independent os module. The config file approach for TASSEL allows the simulation and analysis procedure to be automated by a shell script or batch file.

Running TASSEL Mixed Linear Model

We are using tassel-5-standalone. This version requires Java 8 and can be downloaded at http://www.maizegenetics.net/tassel. The command to run TASSEL using the config file is very simple. I usually copy the config file to the TASSEL directory; however, you can simply specify the absolute path. Then run the following command

Running TASSEL standalone
>>> ./run_pipeline.pl -Xmx6g -configFile example_gwas_pipeline.xml
[Patterson2006]Patterson, N, Price, A, Reich, D. (2006). Population Structure and Eigenanalysis. PLOS Genetics, 2(12). doi:10:1371/journal.pgen.0020190
[VanRaden2008]VanRaden, P. M. (2008). Efficient methods to compute genomic predictions. Journal of Dairy Science, 91(11), 4414–23. doi:10.3168/jds.2007-0980