.. _additive_trait_parameterization:

###############################
Additive Trait Parameterization
###############################

.. code-block:: python
   :caption: Modules we will need for this example

   import simuOpt
   simuOpt.setOptions(alleleType='short', quiet=True)
   import simuPOP as sim
   import numpy as np, pandas as pd
   import random
   from saegus import analyze, operators, parameters
   np.set_printoptions(suppress=True, precision=5)

.. _overview_of_additive_trait_example:

Our goal is to show how to parameterize an additive trait in :mod`saegus` code.
We will make use of :mod:`simuPOP` ``infoFields`` to store the information about
the additive trait for each individual. Also we will add a unique identifier
for each individual which will help us verify that the trait is being calculated
correctly. In this example we will randomly choose quantitative trait loci from
among loci which are segregating in ``example_pop``.

.. _preparing_the_population:

Preparing the Population
########################

We will re-load the population and add information fields.
``example_pop`` needs the information fields ``ind_id``, ``g`` and ``p``.

.. _load_the_population:

Load the Population
===================

We will use the population we created in the last step instead of creating
a new population.

.. code-block:: python
   :caption: Loading our example population from a file

   example_pop = sim.loadPopulation('example_pop.pop')


.. _add_information_fields:

Add Information Fields
======================

At present our population has no special information assigned to its members.
Each individual is only a genotype. By default :mod:`simuPOP` uses ``ind_id``,
``father_id``, ``mother_id``, ``father_idx`` and ``mother_idx`` for its very
useful _Tagger functions. We can save some hassle by using these for
identifying individuals.

.. _Tagger: http://simupop.sourceforge.net/manual_svn/build/refManual_ch3_sec10.html

.. code-block:: python
   :caption: Add the ``infoFields`` ``ind_id``, ``g`` and ``p``

   example_pop.infoFields()
   # ()
   example_pop.addInfoFields(['ind_id', 'g', 'p'])
   example_pop.infoFields()
   # ('ind_id', 'g', 'p')

By default information fields are set to ``0.0``. We can initialize the
``ind_id`` field using a :mod:`simuPOP` function.

.. code-block:: python
   :caption: Initialize individual identifiers

   print(np.array(example_pop.indInfo('ind_id')))
   # [ 0.  0.  0.  ...  0.]
   sim.tagID(example_pop)
   print(np.array(example_pop.indInfo('ind_id')))
   # [ 1.  2.  3.  ...  105.]

.. note::
   ::
   In this step we converted the output into a np.array for aesthetics
   
.. _calculate_allele_frequencies:

Calculate Allele Frequencies
==========================

.. code-block:: python
   :caption: Using :mod:`simuPOP` to compute allele frequencies
   
   sim.stat(example_pop, alleleFreq=sim.ALL_AVAIL)

.. _determine_segregating_loci:

Determine Segregating Loci
==========================

For simplicity we will use loci which have more than one allele i.e.
segregating.

.. code-block:: python
   :caption: Using :mod:`simuPOP` to find segregating loci

   sim.stat(example_pop, numOfSegSites=sim.ALL_AVAIL,
                         vars=['numOfSegSites', 'segSites', 'fixedSites'])
   example_pop.dvars().numOfSegSites
   # 42837
   print(example_pop.dvars().segSites[::1000]) # every 1000th segregating locus
   # [0, 1040, 2072, ..., 43578]

There are 42,837 segregating loci in this population. ``saegus`` has a function
to put the alleles into an array and assign the alleles at ``qtl`` an effect as
a draw from a specified distribution.

.. _additive_trait:

Additive Trait
##############

We have all the information we need from the previous steps. We will randomly
choose ``5`` QTL from the segregating loci. Both alleles at each QTL are
assigned an effect as a random draw with an exponential distribution.

.. _choose_QTL:

Choosing QTL and Assign Effects
===============================

For this example we will pick 5 loci to designate as quantitative trait loci.

.. code-block:: python
   :caption: Choosing QTL and assigning allele effects

   segregating_loci = example_pop.dvars().segSites
   qtl = sorted(random.sample(segregating_loci, 5))
   print(qtl)
   # [6, 2972, 12694, 30642, 34123]

Every allele is initially assigned an effect of ``0``. Now alleles only at each QTL 
will be assigned a non-zero effect drawn from the Exponential distribution.

.. code-block:: python
   :caption: Assign allele effects using an exponential distribution

   example_run = analyze.Study('example_pop')
   allele_states = example_run.gather_allele_data(example_pop)
   alleles = np.array([allele_states[:, 1], allele_states[:, 2]]).T
   trait = parameters.Trait()
   ae_table = trait.construct_allele_effects_table(alleles, qtl, random.expovariate, 1)
   print(ae_table[qtl]) # qtl only
   # [[    6.          1.          0.47333     3.          1.1387 ]
   #  [ 2972.          1.          0.50155     2.          0.81906]
   #  [12694.          1.          0.41925     3.          1.32648]
   #  [30642.          1.          0.70116     3.          0.16591]
   #  [34123.          1.          3.27972     3.          0.33993]]
   print(ae_table) # all loci
   # [[     0.      1.      0.      2.      0.]
   #  [     1.      2.      0.      3.      0.]
   #  [     2.      2.      0.      3.      0.]
   #  ...,
   #  [ 44442.      1.      0.      2.      0.]
   #  [ 44443.      1.      0.      3.      0.]
   #  [ 44444.      1.      0.      3.      0.]]

Alternatively, we could use another distribution, such as the Normal.
This overwrites the previously assigned effects.

.. code-block:: python
   :caption: Assign allele effects using a normal distribution

   ae_table = trait.construct_allele_effects_table(alleles, qtl, random.normalvariate, 0, 1)
   print(ae_table[qtl]) # qtl only
   # [[    6.          1.          0.09821     3.          0.19477]
   #  [ 2972.          1.         -1.48559     2.          1.47764]
   #  [12694.          1.         -1.16001     3.          0.09613]
   #  [30642.          1.          0.44827     3.          0.11772]
   #  [34123.          1.          2.15811     3.          0.99274]]

For speed of computation we construct an array of allele effects where the row
of the array corresponds to the locus and the column corresponds to the integer
representing the allele state.

.. code-block:: python
   :caption: Putting the allele effects in an array for speed of computation

   ae_array = trait.construct_ae_array(ae_table, qtl)
   print(ae_array[qtl])
   # [[ 0.       0.09821  0.       0.19477  0.       0.     ]
   #  [ 0.      -1.48559  1.47764  0.       0.       0.     ]
   #  [ 0.      -1.16001  0.       0.09613  0.       0.     ]
   #  [ 0.       0.44827  0.       0.11772  0.       0.     ]
   #  [ 0.       2.15811  0.       0.99274  0.       0.     ]]

.. _definition_of_g:

Definition of ``g``
===================

``g`` is the sum of the allele effects of an individual's genotype. There is
no noise or error in ``g`` because we have *a priori* determined the allele
effects.

.. code-block:: python
   :caption: Calculating g values

   operators.calculate_g(example_pop, ae_array)
   print(np.array(example_pop.indInfo('g')))
   # [ 3.7728   5.66723  0.90614  ...  6.83259]

.. _calculating_error:

Calculation of Error Term
=========================

To simulate the experimental noise a term :math:`\epsilon` is added to each
individual's ``a`` value.
:math:`\epsilon` is a random variable with a normal distribution given by
mean :math:`0` and variance given by:

.. math::
 
   \sigma^2_\epsilon = \sigma^2_a(\frac{1}{h^2}-1);

where :math:`\sigma^2_a` is the variance of ``a`` and :math:`h^2` is the
narrow sense heritability.

.. math::

   \varepsilon \sim \mathcal{N} (0,\sigma^2_\epsilon)
   
.. Now that we have an appropriately g-scaled, genome-wide error variance,
.. the locus-specific variance is computed as:

.. .. math::

..   \sigma^2_\epsilon_\l = \frac{sigma^2_\epsilon}{l};

.. where l is the total number of segregating sites.

Hence, an individual's value of ``p`` is calculated by

.. math::

   p = a + \epsilon

.. _calculating_p:

Calculating ``p``
=================

It is straightforward to calculate ``p`` for the population but we already
have a function to make it even easier for ourselves.

.. code-block:: python
   :caption: Computing ``p`` for each individual

   heritability = 0.7
   operators.calculate_error_variance(example_pop, heritability)
   operators.calculate_p(example_pop)
   print(np.array(example_pop.indInfo('p')))
   # [ 4.70345  8.28645  0.7787  ...  6.9911 ]
   
.. _validating_h2:

Validating the ``h2`` Function
=======================================
Becuase :math:`\epsilon` is a random variable, we will compute 
median h2 from 30 replications (median b/c h2 is bounded)

.. code-block:: python
   :caption: Validating the calculation of ``h2``
   
   check_h2 = []
   for x in range(0, 30):
       ae_table = trait.construct_allele_effects_table(alleles, qtl, random.normalvariate, 0, 1)
       ae_array = trait.construct_ae_array(ae_table, qtl)
       operators.calculate_g(example_pop, ae_array)
       operators.calculate_error_variance(example_pop, heritability)
       operators.calculate_p(example_pop)
       check_h2.append(np.var(example_pop.indInfo('g')) / np.var(example_pop.indInfo('p')))
   
   check_h2
   # [0.8594594061513856, 0.547350607012552, 0.8588487371957536, ..., 0.4640558795605753]
   np.median(check_h2_v2)
   # 0.6260138213421671
   
.. _validating_the_calculate_g_function:

Validating the ``calculate_g`` Function
=======================================

Let's make sure that our function is correctly matching allele to its effect and
summing the effects correctly. We will look at the alleles individual ``1`` of
``example_pop`` at the QTL. Then we will sum the effects and compare the result
with our function :func:`calculate_g`.

.. code-block:: python
   :caption: Validating the calculation of ``g``

   example_ind = example_pop.individual(0)
   alpha_qtl_alleles = np.array(example_ind.genotype(ploidy=0))[qtl]
   omega_qtl_alleles = np.array(example_ind.genotype(ploidy=1))[qtl]
   example_g = [[], []]
   for locus, alpha, omega in zip(qtl, alpha_qtl_alleles, omega_qtl_alleles):
       print(locus, alpha, ae_array[locus, alpha], omega, ae_array[locus, omega])
       example_g[0].append(ae_array[locus, alpha])
       example_g[1].append(ae_array[locus, omega])
   
   sum(example_g[0]) + sum(example_g[1])
   # 0.8047750628903483
   example_pop.indByID(1).g
   # 0.8047750628903477