.. _structured_population_expansion: #################################################### Expanding A Population According to Structure Matrix #################################################### This example uses :py:mod:`simuPOP`'s capability to perform non-random mating schemes We have inferred the population structure of ``example_pop``. The goal of this example is to use the population structure to create a simulated population with similar structure to the original population. The genome of each individual in ``example_pop`` derives from a single sub-population or a combination of sub-populations. ``population_structure_matrix.txt`` defines the proportion of the genome inherited by an individual from each sub-population. For our example: each individual is assigned a ``primary`` sub-population. The primary sub-population is the sub-population from which the individual inherited the largest proportion of their genome. The proportions of inheritance are interpreted as probabilities for determining which sub-population the mate will derive from. .. code-block:: python :caption: Module imports import simuOpt simuOpt.setOptions(alleleType='short', quiet=True) import simuPOP as sim import numpy as np, pandas as pd import collections as col from scipy import stats from saegus import parameters, breed, parse np.set_printoptions(suppress=True, precision=3) We will continue to use the same population as the rest of our examples. .. code-block:: python :caption: Load the population from ``example_pop.pop`` example_pop = sim.loadPopulation('example_pop.pop') We will add information fields to the population so that we can track the pedigree. If we wanted to analyze the pedigree we could look at the mother and father of each individual. .. code-block:: python :caption: Adding information fields example_pop.addInfoFields(['ind_id', 'mother_id', 'father_id', 'primary']) sim.tagID(example_pop) example_pop.indInfo('ind_id') # (1.0, 2.0, 3.0, 4.0, 5.0, ... 105.0) We will import a file that tells us the likely mating structure of each of the 105 individuals of our population. .. code-block:: python :caption: Importing the population structure matrix structure_matrix = pd.read_csv('example_population_structure_matrix.txt', index_col=0) popst = parameters.PopulationStructure(example_pop) proportions = np.array(np.array(structure_matrix)[:, 1:7]) print(proportions) # [[0.0, 0.9996, 0.0, 0.0004, 0.0, 0.0], # [0.011000000000000001, 0.0015, 0.0004, 0.1047, 0.0, 0.8824], # [0.0, 0.3832, 0.0, 0.6168, 0.0, 0.0], # ... # [0.0, 0.4602, 0.0, 0.5398, 0.0, 0.0]] There is a very small amount of rounding error in the proportions for some individuals. If the proportions do not sum to ``1`` then we cannot use them to make a probability mass function. For example: .. code-block:: python :caption: Example of rounding error proportions[33] # array([0.8856999999999998, 0.0016, 0.0009, 0.1065, 0.0042, 0.0011], dtype=object) sum(proportions[33]) # 1.0000000000000002 So we will use a function to adjust the small difference from ``1`` by adding or subtracting from the ``primary`` sub-population proportion. .. code-block:: python :caption: Correcting the rounding error corrected_proportions = popst.correct_rounding_error(proportions) sum(corrected_proportions[33]) # 0.9999999999999999 Apparently the result of ``0.9999999999999999`` is close enough for the ``scipy.stats`` module we are about to use. For peace of mind, we can use the ``name`` attribute of the ``stats.rv_discrete`` function to match the ``ind_id`` with the corresponding probabilities. .. code-block:: python :caption: Creating the probability mass functions mating_pmfs = {} for i, ind in enumerate(example_pop.individuals()): mating_pmfs[ind.ind_id] = stats.rv_discrete(values=([0.0, 1.0, 2.0, 3.0, 4.0, 5.0], corrected_proportions[i]), name=str(ind.ind_id)) example_pop.dvars().mating_probabilities = mating_pmfs .. _validating_the_mating_probabilities: Validating the Mating Probabilities ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before we proceed we should check the empirical distributions of the probability mass functions. We will use an example individual who is quite diverse in its lineage. .. code-block:: python :caption: Comparing empirical results versus pmf corrected_proportions[5] # array([0.2195, 0.0198, 0.021, 0.2371, 0.1295, 0.3731], dtype=object) mating_pmfs[6].pk # corresponding mating pmf # array([0.2195, 0.0198, 0.021, 0.2371, 0.1295, 0.3731], dtype=object) mating_pmfs[6].name # 6.0 This individual is composed from all six sub-populations. We will draw 1000 times from the corresponding probability mass function and compare the results. .. code-block:: python :caption: Comparing empirical distribution draw_results = mating_pmfs[6].rvs(size=1000) draw_results # array([4, 3, 5, 3, 3, ... 4]) draw_counts = col.Counter(draw_results) draw_frequencies = [] for sp in range(6): draw_frequencies.append(draw_counts[sp]/1000) Finally let's compare the ``1000`` draws with the probabilities. .. code-block:: python :caption: Are they close? draw_frequencies # [0.219, 0.017, 0.021, 0.223, 0.148, 0.372] corrected_proportions[5] # array([0.2195, 0.0198, 0.021, 0.2371, 0.1295, 0.3731], dtype=object) The draw frequencies are pretty close to the probability mass function. If we increased the number of draws to 10,000 the differences would become even smaller. .. _assigning_primary_subpopulations: Assigning Primary Subpopulations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We will continue by assigning each individual a primary sub-population. The primary sub-population is the sub-population from which the majority of their genome is derived. .. code-block:: python :caption: Assignment of Primary Sub-Populations primary_subpops = {ind.ind_id: float(np.argmax(corrected_proportions[i])) for i, ind in enumerate(example_pop.individuals())} for ind in example_pop.individuals(): ind.primary = primary_subpops[ind.ind_id] example_pop.indInfo('primary') # (1.0, 5.0, 3.0, ..., 3.0) Then we will use the virtual sub-population feature of ``simuPOP`` to group the individuals without restricting mating between groups. .. code-block:: python :caption: Split ``example_pop`` into virtual sub-populations primary_subpopulation_splitter = sim.InfoSplitter(field='primary', values=[0.0, 1.0, 2.0, 3.0, 4.0, 5.0]) example_pop.setVirtualSplitter(primary_subpopulation_splitter) .. _parent_chooser_and_recombination_map: Parent Chooser and Recombination Map ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The class containing the parent chooser function must be instantiated with the expanded population size. The recombination map will be parsed with an older function. We will explain in a later section more details about recombination in :py:mod:`simuPOP`. .. code-block:: python :caption: Instantiating parent chooser and parsing recombination map popst_parent_chooser = breed.ForcedPopulationStructureParentChooser(1000, mating_pmfs) tf = parse.TusonFounders() recom_rates = tf.parse_recombination_rates('genetic_map.txt') recom_rates # [0.0020926625899999962, 2.2615580000007186e-05, 0.00042822784999999361, ..., 0.0] .. _expanding_the_population: Expanding the Population ~~~~~~~~~~~~~~~~~~~~~~~~ Finally we can expand the population via mating according to the population structure probability mass functions. Each mating event follows this process: 1. Randomly draw the first parent 2. Given the mating probability mass function of the first parent: draw the second parent from the probability mass function of the first parent 3. Cross the two parents This procedure is repeated 1,000 times because each mating event produces a single offspring. .. code-block:: python :caption: Expand the population to ``1000`` individuals example_pop.evolve( matingScheme=sim.HomoMating( sim.PyParentsChooser(popst_parent_chooser.forced_structure_parent_chooser), sim.OffspringGenerator( ops=[sim.IdTagger(), sim.PedigreeTagger(), sim.Recombinator(recom_rates)], numOffspring=1), subPopSize=1000 ), gen=1 ) If we wanted to analyze the specific crosses we can create a pedigree using the ``ind_id``, ``mother_id`` and ``father_id`` fields. .. code-block:: python :caption: Create a pedigree pedigree = np.array((example_pop.indInfo('ind_id'), example_pop.indInfo('mother_id'), example_pop.indInfo('father_id'))).T print(pedigree) # [[ 106., 45., 86.], # [ 107., 26., 70.], # [ 108., 60., 31.], # ..., # [ 1105., 39., 40.]]