.. _analysis_using_tassel: ##################################################### Using Standalone Tassel for Multi-Locus Trait Mapping ##################################################### This example will show how to perform association mapping of a quantitative trait using tassel-5-standalone. tassel_5_standalone_ TASSEL requires certain types of input for its mixed linear model mode. .. _tassel_5_standalone: http://www.maizegenetics.net/tassel * Hapmap formatted data * Kinship Matrix * Population Structure * Trait :py:mod:`saegus` has functions to calculate the (kinship) relationship matrix using marker data [VanRaden2008]_. Population structure is computed using [Patterson2006]_. We will give an example of hapmap formatted output. The trait data is formatted by adding specific tags to the top of a trait vector. :py:mod:`saegus` also writes an ``xml`` file for TASSEL as a config file. The config file keeps the data grouped together appropriately and keeps you from having to re-run the same commands over and over again. .. code-block:: python :caption: Module imports >>> import simuOpt >>> simuOpt.setOptions(alleleType='short', quiet=True) >>> import simuPOP as sim >>> import pandas as pd >>> import numpy as np >>> import random >>> from saegus import analyze, operators, parameters >>> np.set_printoptions(suppress=True, precision=5) .. _setting_up_for_association_analysis: Setting up the Population for Analysis ====================================== We will load the population and compute the required information as in previous examples. .. code-block:: python :caption: Preparing the population for TASSEL >>> example_pop = sim.loadPopulation('example_pop.pop') >>> example_pop.addInfoFields('ind_id', 'mother_id', 'father_id', 'g', 'p') >>> sim.tagID(example_pop) >>> sim.stat(example_pop, numOfSegSites=sim.ALL_AVAIL, vars=['segSites']) >>> segregating_loci = example_pop.dvars().segSites .. _hapmap_formatted_data: Hapmap Formatted Data ===================== Hapmap formatted data consists of 11 columns followed by the genotype data. * rs# * alleles * chrom * pos * strand * assembly# * center * protLSID * assayLSID * panelLSID * QCcode Each column corresponds to an individual. Each row corresponds to a locus. Most of these columns we will fill with NA because we are only interested in the association mapping aspect of TASSEL. In the hapmap format a genotype is given as two upper case letters corresponding to a nucleotide or code. A genotype is given in terms of two uppercase letters. +------+-------------+ | Code | Meaning | +======+=============+ | A | Thymine | +------+-------------+ | C | Cytosine | +------+-------------+ | G | Guanine | +------+-------------+ | T | Thymine | +------+-------------+ | R | A or G | +------+-------------+ | Y | C or T | +------+-------------+ | S | G or C | +------+-------------+ | W | A or T | +------+-------------+ | K | G or T | +------+-------------+ | M | A or C | +------+-------------+ | B | C or G or T | +------+-------------+ | D | A or G or T | +------+-------------+ | H | A or C or T | +------+-------------+ | V | A or C or G | +------+-------------+ | N | any base | +------+-------------+ | . | gap | +------+-------------+ | / | gap | +------+-------------+ ``saegus`` automatically does the conversion into hapmap format provided with a ``dict`` to code the allele of each individual into a letter. For these examples the allele states (``1, 2, 3, 4``) have nothing to do with the corresponding nucleotides of the hapmap format. The :py:class:`GWAS` automatically creates an attribute ``individual names`` which formats the ``ind_id`` for TASSEL. .. code-block:: python :caption: Formatting genotype matrix in hapmap format >>> gwas = analyze.GWAS(example_pop, segregating_loci, 'example') >>> print(gwas.individual_names) ['I1' 'I2' 'I3' 'I4' 'I5' 'I6' 'I7' 'I8' 'I9' 'I10' 'I11' 'I12' 'I13' 'I14' 'I15' 'I16' 'I17' 'I18' 'I19' 'I20' 'I21' 'I22' 'I23' 'I24' 'I25' 'I26' 'I27' 'I28' 'I29' 'I30' 'I31' 'I32' 'I33' 'I34' 'I35' 'I36' 'I37' 'I38' 'I39' 'I40' 'I41' 'I42' 'I43' 'I44' 'I45' 'I46' 'I47' 'I48' 'I49' 'I50' 'I51' 'I52' 'I53' 'I54' 'I55' 'I56' 'I57' 'I58' 'I59' 'I60' 'I61' 'I62' 'I63' 'I64' 'I65' 'I66' 'I67' 'I68' 'I69' 'I70' 'I71' 'I72' 'I73' 'I74' 'I75' 'I76' 'I77' 'I78' 'I79' 'I80' 'I81' 'I82' 'I83' 'I84' 'I85' 'I86' 'I87' 'I88' 'I89' 'I90' 'I91' 'I92' 'I93' 'I94' 'I95' 'I96' 'I97' 'I98' 'I99' 'I100' 'I101' 'I102' 'I103' 'I104' 'I105'] >>> hapmap_columns = ['rs', 'alleles', 'chrom', 'pos', ... 'strand', 'assembly', 'center', ... 'center', 'protLSID', 'assayLSID', ... 'panelLSID', 'QCode'] + list(gwas.individual_names) ... ] >>> hapmap_matrix = pd.DataFrame(columns=hapmap_columns) >>> hapmap.rs = segregating_loci >>> hapmap.alleles = segregating_minor_alleles >>> chromosomes = np.array([example_pop.chromLocusPair(locus)[0] + 1 ... for locus in segregating_loci], dtype=np.int8) >>> hapmap.chrom = chromosomes >>> hapmap_matrix.pos = np.arange(segregating_loci.shape) >>> hapmap_matrix.loc[:, 'strand':'QCode'] = np.core.defchararray.array( ... [['NA']*42837]*8).T >>> for i, ind in enumerate(example_pop.individuals()): ... hapmap_matrix.loc[:, gwas.individual_names[i]] = [ ... ''.join(sorted(gwas.int_to_snp_conversions[a] + ... gwas.int_to_snp_conversions[b])) ... for a, b, in zip( ... np.array(ind.genotype(ploidy=0))[segregating_loci], ... np.array(ind.genotype(ploidy=1))[segregating_loci]) ... ] >>> print(np.array(hapmap_matrix)) [[0 1 1 ..., 'GG' 'GG' 'GG'] [1 2 1 ..., 'GT' 'GG' 'TT'] [2 3 1 ..., 'GG' 'GG' 'GG'] ..., [44442 2 10 ..., 'GG' 'CG' 'CG'] [44443 3 10 ..., 'CT' 'CC' 'CT'] [44444 1 10 ..., 'CC' 'TT' 'TT']] The final step is to write the hapmap to a ``txt`` file with the appropriate rows and columns. .. code-block:: python :caption: Writing the hapmap file for TASSEL >>> with open('example_hapmap.txt', 'w') as hapmap_file: ... hapmap_matrix.to_csv(hapmap_file, sep='\t', index=False) All of the operations to format the genotypes into hapmap format are encapsulated inside of :func:`hapmap_formatter`. .. _calculating_population_structure: Population Structure ==================== Population structure is used as a covariate. For the past examples the first eigenvector explains the overwhelming majority of the variation. However, you :py:mod:`saegus` has functions to compute a test statistic. The value of the test statistic must be compared against the *Tracy-Widom* manually as there is not a distribution implemented in Python. We implement the computation in Patterson2006_. Let: * :math:`m` be the number of individuals * :math:`n` the number of loci (called markers in the paper) * :math:`a` is the minor allele at each locus * :math:`\mathbf{C}` is a matrix whose entires are the counts of the minor allele for each individual for each locus Hence :math:`\mathbf{C}(i, j)` is how many copies of the minor allele, :math:`a` an individual, :math:`i`, has at locus :math:`j`. For example: .. code-block:: python :caption: Example of count matrix >>> sim.stat(example_pop, alleleFreq=sim.ALL_AVAIL) >>> allele_states = analyze.gather_allele_data(example_pop) >>> minor_alleles = allele_states[:, 3] # column corresponding to minor alleles >>> segregating_minor_alleles = minor_alleles[segregating_loci] >>> gwas = analyze.GWAS(example_pop, segregating_loci, 'example') >>> count_matrix = gwas.calculate_count_matrix(segregating_minor_alleles, segregating_loci) >>> count_matrix.shape (105, 42837) >>> print(count_matrix) [[1 1 1 ..., 1 1 1] [0 0 0 ..., 1 0 0] [1 0 0 ..., 1 0 0] ..., [0 1 0 ..., 2 1 2] [0 2 0 ..., 1 0 0] [0 0 0 ..., 1 1 0]] Calculate the mean of each column: :math:`\mathbf{C_{\mu}}(j)`: .. math:: \mathbf{C}_{\mu}(j) = \frac{\sum_{i=1}^{m}\mathbf{C}(i, j)}{m} We "correct" each entry of :math:`\mathbf{C}(i, j)` by the following process. Let :math:`p(j)` be equal to :math:`\frac{\mathbf{C_{\mu}}(j)}{2}`. We obtain the matrix :math:`\mathbf{M}` after the "correction" process. .. math:: \mathbf{M}(i,j) = \frac{\mathbf{C}(i, j) - \mathbf{C}_{\mu}(j)}{\sqrt{p(j)(1-p(j))}} We are interested in the principal components of the matrix: .. math:: \mathbf{X} = (\frac{1}{n})\mathbf{M}\mathbf{M^T} Hence we perform the eigenvalue decomposition of \mathbf{X}. We will use the principal components as co-variates for TASSEL's mixed linear model. But before that let us see if our results agree. Does this population seem to be descended from six subpopulations? .. image:: /images/PC1xPC2.svg :align: center There appear to be six subgroups in the plot. Which is exactly what we obtained from a different method. For the time being we will use the first two principal components; however, the Patterson paper describes a test statistic to test the significance of each principal component. .. code-block:: python :caption: Population structure calculation >>> column_means = np.apply_along_axis(np.mean, axis=0, arr=count_matrix) >>> shifted = np.array( ... [count_matrix[:, i] - column_means[i] for i in range(42837)]).T >>> P = column_means/2 >>> scale = np.sqrt(P*(1-P)) >>> M = np.matrix( ... np.array([shifted[:, i] / scale[i] for i in range(42837)]).T) >>> X = (1/42837)*(M * M.T) >>> eigendata = linalg.eig(X) >>> eigenvalues = np.array(eigendata[0], dtype=np.float) >>> eigenvectors = np.array(eigendata[1], dtype=np.float) All of the above calculations are performed by :func:`pop_struct_eigendecomp`. The file for the population structure covariates is written by :func:`population_structure_formatter` .. code-block:: python :caption: Population structure file >>> gwas.population_structure_formatter(eigenvalues, eigenvectors, ... number_of_pcs=2, pop_struct_file_name='example_structure.txt') .. _calculating_the_kinship_matrix: Kinship Matrix ============== The kinship matrix is calculated via the method given in VanRaden2008_. It is the same method implemented in Synbreed. The marker allele is interpreted as the minor allele. The elements of :math:`\mathbf{V_{n x m}}` are :math:`-1` for the minor allele homozygote, :math:`0` for the heterozygote and :math:`1` for the major allele homozygote. * :math:`n` The number of individuals * :math:`m` The number of loci * :math:`p_i` Frequency of the minor allele at locus i We can obtain :math:`\mathbf{V}` by multiplying :math:`\mathbf{C}` by :math:`-1` and adding :math:`1`. Recall that :math:`\mathbf{C}` is given by :func:`calculate_count_matrix` in terms of minor alleles. .. math:: \left[ \begin{array} ((-1)*2 + 1 = (-1) \\ (-1)*1 + 1 = 0 \\ (-1)*0 + 1 = 1 \end{array} \right] Which is exactly what is required. We use method (1) to compute :math:`\mathbf{G}`. We only need to add ``individual_names`` to the header of :math:`\mathbf{G}` for use in TASSEL. .. math:: \mathbf{P} = 2(p_i - \frac{1}{2}) .. math:: \mathbf{Z}_{n \times m} = \mathbf{V_j} - \mathbf{P} .. math:: \mathbf{G} = \frac{\mathbf{Z}\mathbf{Z^T}}{2\sum_{i}p_i(1-p_i)} .. code-block:: python :caption: Calculation of the kinship matrix >>> V = (-1)*count_matrix + 1 # count_matrix was computed in population structure >>> P = np.array([gwas.pop.dvars().alleleFreq[locus][allele] ... for locus, allele in zip(self.segregating_loci, ... gwas.segregating_minor_alleles)]) >>> Z = np.zeros((gwas.pop.popSize(), len(gwas.segregating_loci))) >>> for i in range(gwas.pop.popSize()): ... Z[i, :] = V[i, :] - 2*(P - 1/2) >>> G = (Z * Z.T) / (2 * np.sum((P * (1 - P)))) All of the calculation are performed by :func:`calculate_kinship_matrix`. This block is to show the calculation explicitly without having to source dive. Now we write the file for TASSEL. .. code-block:: python :caption: Write kinship matrix to file >>> G = pd.DataFrame(G, index=gwas.individual_names) >>> header = "{}\n".format(gwas.pop.popSize()) >>> with open('example_kinship.txt', 'w') as kinship_file: ... kinship_file.write(header) ... G.to_csv(kinship_file, sep='\t', index=True, header=False ... float_format='%.5f') .. _comment_about_synbreed: Comment About Synbreed ---------------------- Accoding to VanRaden2008_ the vector P is defined as allele frequencies expressed as a difference from 0.5 and multipled by 2. However the Synbreed_ documentation defines P as If ret="realized", the realized relatedness between individuals is computed according to the formulas in Habier et al. (2007) or vanRaden (2008) .. math:: \mathbf{U} = \frac{\mathbf{ZZ^T}}{2\sum_i p_i(1-p_i)} where Z = W − P, W is the marker matrix, P contains the allele frequencies multiplied by 2, :math:`p_i` is the allele frequency of marker i, and the sum is over all loci. For the time being I have adjusted the calculation to match :mod:`Synbreed`; however, it is not what is written in the VanRaden2008_ paper. .. _formatting_trait_data: Trait ===== Our simulated trait data is given a header and output as a text file. TASSEL uses html style tags in the header to label the input of each file. We will use exponentially distributed allele effects with 20 ``qtl`` .. math:: G \sim Exp(1) .. code-block:: python :caption: Functions for handling trait data >>> qtl = sorted(random.sample(segregating_loci, 20)) >>> trait = parameters.Trait() >>> allele_effects_table = trait.construct_allele_effects_table(example_pop, qtl, random.expovariate, 1) >>> allele_effects_array = trait.construct_ae_array(allele_effects_table, qtl) >>> heritability = 0.7 >>> operators.calculate_g(example_pop, allele_effects_array) >>> operators.calculate_error_variance(example_pop, heritability) >>> operators.calculate_p(example_pop) >>> tassel_trait = gwas.trait_formatter(trait_file_name='example_trait.txt') For analysis of the TASSEL output we need to store the information about QTL and allele effects. .. code-block:: python :caption: Storing qtl, allele effects and allele frequencies >>> example_trait_data = h5py.File('example_quantitative_trait.hdf5') >>> example_trait_data['qtl'] = np.array(qtl) >>> example_trait_data['allele/effects'] = allele_effects_table >>> allele_frequencies = analyze.gather_allele_frequencies(example_pop, allele_states) >>> example_trait_data['allele/frequencies'] = allele_frequencies >>> example_trait_data.close() .. _tassel_config_file: TASSEL Config File ================== The final component is a ``xml`` file which specifies the protocol for TASSEL to run. The config file can be in terms of relative or absolute paths for its input. All TASSEL options can be specified in the config file. This is the content of the ``xml`` file. .. code-block:: xml :caption: Contents of the conig file >>> ... ... ... example_hapmap.txt ... ... ... example_trait.txt ... ... ... example_structure.txt ... ... ... example_kinship.txt ... ... ... ... ... ... ... ... ... ... ... ... ... None ... ... example_out__ ... ... ... ... ... ... In this case the "forks" are only using the filename; however, in actual use the entire absolute path is given for the input files. The :func:`tassel_gwas_config` creates the file for each run given the file names. :mod:`saegus` attaches the full absolute path using the platform independent :mod:`os` module. The config file approach for TASSEL allows the simulation and analysis procedure to be automated by a shell script or batch file. .. _running_tassel_mlm: Running TASSEL Mixed Linear Model ================================= We are using :mod:`tassel-5-standalone`. This version requires Java 8 and can be downloaded at http://www.maizegenetics.net/tassel. The command to run TASSEL using the config file is very simple. I usually copy the config file to the TASSEL directory; however, you can simply specify the absolute path. Then run the following command .. code-block:: shell :caption: Running TASSEL standalone >>> ./run_pipeline.pl -Xmx6g -configFile example_gwas_pipeline.xml .. _Synbreed: https://cran.r-project.org/web/packages/synbreed/synbreed.pdf .. [Patterson2006] Patterson, N, Price, A, Reich, D. (2006). Population Structure and Eigenanalysis. PLOS Genetics, 2(12). doi:10:1371/journal.pgen.0020190 .. [VanRaden2008] VanRaden, P. M. (2008). Efficient methods to compute genomic predictions. Journal of Dairy Science, 91(11), 4414–23. doi:10.3168/jds.2007-0980