Expanding A Population According to Structure Matrix

This example uses simuPOP’s capability to perform non-random mating schemes We have inferred the population structure of example_pop. The goal of this example is to use the population structure to create a simulated population with similar structure to the original population. The genome of each individual in example_pop derives from a single sub-population or a combination of sub-populations.

population_structure_matrix.txt defines the proportion of the genome inherited by an individual from each sub-population. For our example: each individual is assigned a primary sub-population. The primary sub-population is the sub-population from which the individual inherited the largest proportion of their genome. The proportions of inheritance are interpreted as probabilities for determining which sub-population the mate will derive from.

Module imports
import simuOpt
simuOpt.setOptions(alleleType='short', quiet=True)
import simuPOP as sim
import numpy as np, pandas as pd
import collections as col
from scipy import stats
from saegus import parameters, breed, parse
np.set_printoptions(suppress=True, precision=3)

We will continue to use the same population as the rest of our examples.

Load the population from example_pop.pop
example_pop = sim.loadPopulation('example_pop.pop')

We will add information fields to the population so that we can track the pedigree. If we wanted to analyze the pedigree we could look at the mother and father of each individual.

Adding information fields
example_pop.addInfoFields(['ind_id', 'mother_id', 'father_id', 'primary'])
sim.tagID(example_pop)
example_pop.indInfo('ind_id')
# (1.0, 2.0, 3.0, 4.0, 5.0, ... 105.0)

We will import a file that tells us the likely mating structure of each of the 105 individuals of our population.

Importing the population structure matrix
structure_matrix = pd.read_csv('example_population_structure_matrix.txt', index_col=0)
popst = parameters.PopulationStructure(example_pop)
proportions = np.array(np.array(structure_matrix)[:, 1:7])
print(proportions)
# [[0.0, 0.9996, 0.0, 0.0004, 0.0, 0.0],
# [0.011000000000000001, 0.0015, 0.0004, 0.1047, 0.0, 0.8824],
# [0.0, 0.3832, 0.0, 0.6168, 0.0, 0.0],
# ...
# [0.0, 0.4602, 0.0, 0.5398, 0.0, 0.0]]

There is a very small amount of rounding error in the proportions for some individuals. If the proportions do not sum to 1 then we cannot use them to make a probability mass function. For example:

Example of rounding error
proportions[33]
# array([0.8856999999999998, 0.0016, 0.0009, 0.1065, 0.0042, 0.0011], dtype=object)
sum(proportions[33])
# 1.0000000000000002

So we will use a function to adjust the small difference from 1 by adding or subtracting from the primary sub-population proportion.

Correcting the rounding error
corrected_proportions = popst.correct_rounding_error(proportions)
sum(corrected_proportions[33])
# 0.9999999999999999

Apparently the result of 0.9999999999999999 is close enough for the scipy.stats module we are about to use. For peace of mind, we can use the name attribute of the stats.rv_discrete function to match the ind_id with the corresponding probabilities.

Creating the probability mass functions
mating_pmfs = {}
for i, ind in enumerate(example_pop.individuals()):
    mating_pmfs[ind.ind_id] = stats.rv_discrete(values=([0.0, 1.0, 2.0, 3.0, 4.0, 5.0],
        corrected_proportions[i]), name=str(ind.ind_id))

example_pop.dvars().mating_probabilities = mating_pmfs

Validating the Mating Probabilities

Before we proceed we should check the empirical distributions of the probability mass functions. We will use an example individual who is quite diverse in its lineage.

Comparing empirical results versus pmf
corrected_proportions[5]
# array([0.2195, 0.0198, 0.021, 0.2371, 0.1295, 0.3731], dtype=object)
mating_pmfs[6].pk # corresponding mating pmf
# array([0.2195, 0.0198, 0.021, 0.2371, 0.1295, 0.3731], dtype=object)
mating_pmfs[6].name
# 6.0

This individual is composed from all six sub-populations. We will draw 1000 times from the corresponding probability mass function and compare the results.

Comparing empirical distribution
draw_results = mating_pmfs[6].rvs(size=1000)
draw_results
# array([4, 3, 5, 3, 3, ... 4])
draw_counts = col.Counter(draw_results)
draw_frequencies = []
for sp in range(6):
    draw_frequencies.append(draw_counts[sp]/1000)

Finally let’s compare the 1000 draws with the probabilities.

Are they close?
draw_frequencies
# [0.219, 0.017, 0.021, 0.223, 0.148, 0.372]
corrected_proportions[5]
# array([0.2195, 0.0198, 0.021, 0.2371, 0.1295, 0.3731], dtype=object)

The draw frequencies are pretty close to the probability mass function. If we increased the number of draws to 10,000 the differences would become even smaller.

Assigning Primary Subpopulations

We will continue by assigning each individual a primary sub-population. The primary sub-population is the sub-population from which the majority of their genome is derived.

Assignment of Primary Sub-Populations
primary_subpops = {ind.ind_id: float(np.argmax(corrected_proportions[i]))
                   for i, ind in enumerate(example_pop.individuals())}

for ind in example_pop.individuals():
    ind.primary = primary_subpops[ind.ind_id]

example_pop.indInfo('primary')
# (1.0, 5.0, 3.0, ..., 3.0)

Then we will use the virtual sub-population feature of simuPOP to group the individuals without restricting mating between groups.

Split example_pop into virtual sub-populations
primary_subpopulation_splitter = sim.InfoSplitter(field='primary', values=[0.0, 1.0, 2.0, 3.0, 4.0, 5.0])
example_pop.setVirtualSplitter(primary_subpopulation_splitter)

Parent Chooser and Recombination Map

The class containing the parent chooser function must be instantiated with the expanded population size. The recombination map will be parsed with an older function. We will explain in a later section more details about recombination in simuPOP.

Instantiating parent chooser and parsing recombination map
popst_parent_chooser = breed.ForcedPopulationStructureParentChooser(1000, mating_pmfs)
tf = parse.TusonFounders()
recom_rates = tf.parse_recombination_rates('genetic_map.txt')
recom_rates
# [0.0020926625899999962, 2.2615580000007186e-05, 0.00042822784999999361, ..., 0.0]

Expanding the Population

Finally we can expand the population via mating according to the population structure probability mass functions. Each mating event follows this process:

  1. Randomly draw the first parent
  2. Given the mating probability mass function of the first parent: draw the second parent from the probability mass function of the first parent
  3. Cross the two parents

This procedure is repeated 1,000 times because each mating event produces a single offspring.

Expand the population to 1000 individuals
example_pop.evolve(
    matingScheme=sim.HomoMating(
        sim.PyParentsChooser(popst_parent_chooser.forced_structure_parent_chooser),
        sim.OffspringGenerator(
            ops=[sim.IdTagger(), sim.PedigreeTagger(), sim.Recombinator(recom_rates)],
                numOffspring=1),
        subPopSize=1000
    ),
    gen=1
)

If we wanted to analyze the specific crosses we can create a pedigree using the ind_id, mother_id and father_id fields.

Create a pedigree
pedigree = np.array((example_pop.indInfo('ind_id'),
                     example_pop.indInfo('mother_id'),
                     example_pop.indInfo('father_id'))).T
print(pedigree)
# [[  106.,    45.,    86.],
# [  107.,    26.,    70.],
# [  108.,    60.,    31.],
# ...,
# [ 1105.,    39.,    40.]]