Collect and Store Data

This is probably the most under-developed aspect of saegus. I have written up the data model (the way the data is stored) in a Google Doc. For the time being we will design the functions to store and access the data around this data model. I hope to move to a more permanent solution such as SciDB sooner or later.

Categories of Data

There is allele data, genotype data and possibly haplotype data. This example creates an HDF5 file and adds each kind of data as it is generated. Then we will show how to re-open the HDF5 file and retrieve the data in both Python and R.

Note

The Google Doc about the data model goes into greater detail and gives examples.

This example is forcing me to fully develop the HDF5 storage functions.

Generating the Data

Since this is a simulation program we are free to make our own data up. Let’s import the necessary modules and load the example population like usual.

Module imports
import simuOpt
simuOpt.setOptions(alleleType='short', quiet=True, numThreads=4)
import simuPOP as sim
import pandas as pd
import numpy as np
from saegus import analyze, parse
np.set_printoptions(precision=3, suppress=True)

Load the example population:

Load the example population, recombination map and initialize sex
example_pop = sim.loadPopulation('example_pop.pop')
example_pop.addInfoFields(['ind_id', 'mother_id', 'father_id'])
sim.tagID(example_pop)
sim.initSex(example_pop)
tf = parse.TusonFounders()
recom_map = tf.parse_recombination_rates('genetic_map.txt')

Allele Data

Now we have to determine the allele frequencies and declare the minor _vs_ major alleles and alpha _vs_ omega alleles. A saegus function returns an array with columns. Tied loci (loci with both alleles at 0.5 frequency) have the alpha allele set at the minor allele and the omega allele set as the major allele.

  • locus | alpha | omega | minor | major
Allele data
sim.stat(example_pop, alleleFreq=sim.ALL_AVAIL, genoFreq=sim.ALL_AVAIL)
example_run = analyze.Study('example_pop')
allele_data_table = example_run.gather_allele_data(example_pop)
print(allele_data_table)
# [[     0.      1.      2.      1.      2.]
#  [     1.      2.      3.      2.      3.]
#  [     2.      2.      3.      3.      2.]
#  ...,
#  [ 44442.      1.      2.      2.      1.]
#  [ 44443.      1.      3.      3.      1.]
#  [ 44444.      1.      3.      1.      3.]]

The array for allele frequencies has the same ordering of columns:

Allele frequencies
allele_frequencies = example_run.gather_allele_frequencies(example_pop, allele_data_table)
print(allele_frequencies)
# [[     0.         0.319      0.681      0.319      0.681]
#  [     1.         0.219      0.781      0.219      0.781]
#  [     2.         0.938      0.062      0.062      0.938]
#  ...,
#  [ 44442.         0.533      0.467      0.467      0.533]
#  [ 44443.         0.738      0.262      0.262      0.738]
#  [ 44444.         0.267      0.733      0.267      0.733]]

Genotype Data

Genotype data is stored in a different way than allele data. Genotype frequencies are stored in a 3-dimensional array with axes:

locus x alpha x omega

Where the frequency of genotype (1, 1) at locus 0 is (0, 1, 1). The frequency data is stored in a numpy.ndarray. We can collect the genotype frequency array by using a saegus function.

Structure of genotype frequency data
genotype_frequencies = example_run.gather_genotype_frequencies(example_pop)
print(genotype_frequencies)
# [[[ 0.     0.     0.     0.     0.   ]
#   [ 0.     0.133  0.     0.     0.   ]
#   [ 0.     0.371  0.495  0.     0.   ]
#   [ 0.     0.     0.     0.     0.   ]
#   [ 0.     0.     0.     0.     0.   ]]
#
# [[ 0.     0.     0.     0.     0.   ]
#  [ 0.     0.     0.     0.     0.   ]
#  [ 0.     0.     0.086  0.     0.   ]
#  [ 0.     0.     0.267  0.648  0.   ]
#  [ 0.     0.     0.     0.     0.   ]]
#
# [[ 0.     0.     0.     0.     0.   ]
#  [ 0.     0.     0.     0.     0.   ]
#  [ 0.     0.     0.886  0.105  0.   ]
#  [ 0.     0.     0.     0.01   0.   ]
#  [ 0.     0.     0.     0.     0.   ]]
#  ...,
# [[ 0.     0.     0.     0.     0.   ]
#  [ 0.     0.305  0.457  0.     0.   ]
#  [ 0.     0.     0.238  0.     0.   ]
#  [ 0.     0.     0.     0.     0.   ]
#  [ 0.     0.     0.     0.     0.   ]]
#
# [[ 0.     0.     0.     0.     0.   ]
#  [ 0.     0.562  0.     0.352  0.   ]
#  [ 0.     0.     0.     0.     0.   ]
#  [ 0.     0.     0.     0.086  0.   ]
#  [ 0.     0.     0.     0.     0.   ]]
#
# [[ 0.     0.     0.     0.     0.   ]
#  [ 0.     0.143  0.     0.     0.   ]
#  [ 0.     0.     0.     0.     0.   ]
#  [ 0.     0.248  0.     0.61   0.   ]
#  [ 0.     0.     0.     0.     0.   ]]]

The syntax to access the frequency of genotype (1, 1) at locus 0 is

Accessing genotype frequencies
print(genotype_frequencies[0, 1, 1])
# 0.133333333333

Unlike the allele data we do not have an organized array of genotypes by locus. However, we can obtain all the genotypes as a set of coordinates by locus using a very simple manipulation.

Genotypes as coordinates
genotypes_by_locus = np.array(np.ndarray.nonzero(genotype_frequencies)).T
print(genotypes_by_locus)
# [[    0     1     1]
#  [    0     2     1]
#  [    0     2     2]
#  ...,
#  [44444     1     1]
#  [44444     3     1]
#  [44444     3     3]]

Note

simuPOP considers (2, 1) and (1, 2) as distinct genotypes

This tells us that at locus 0 there are genotypes: (1, 1), (2, 1) and (2, 2). genotypes_by_locus is a 2-dimensional array. There are a variable number of genotypes at each locus. At fixed sites there is only one genotype. At segregating sites there may be up to 4 genotypes because simuPOP orders genotypes. Therefore, genotypes_by_locus has more rows than the number of loci.

Variable number of genotypes by locus
print(genotypes_by_locus.shape)
# (122993, 3)

It is clear that the locus index will not match the genotypes_by_locus index. If we wanted to see the genotypes at a specific locus we can use the np.where function. For example if we wanted the genotypes present at locus 5 we would do:

Retrieve genotypes by locus
locus_five_genotypes = np.array(np.where(genotypes_by_locus[:, 0] == 5))
print(locus_five_genotypes)
# [14, 15, 16]
print(genotypes_by_locus[locus_five_genotypes])
# [[5 1 1]
#  [5 1 3]
#  [5 3 3]]
print(genotypes_by_locus[locus_five_genotypes][:, 1:]) # without locus
# [[1 1]
#  [1 3]
#  [3 3]]

This tells us that at locus 5 there are genotypes (1, 1), (1, 3) and (3, 3). Let’s check their frequencies.

Checking genotypic frequencies at locus 5
print(genotype_frequencies[5, 1, 1])
# 0.904761904762
print(genotype_frequencies[5, 1, 3])
# 0.0857142857143
print(genotype_frequencies[5, 3, 3])
# 0.00952380952381

Storing Data in HDF5 Files

Our data take the form of arrays. Hierarchical Data Format 5 (HDF5) is a file format optimized for ‘lookup’ operations. HDF5 allow for \(n\)-dimensional arrays as well as metadata attached to HDF5 Groups. This part of this guide will demonstrate how to store allele data, genotype data and the corresponding metadata.

Basics of Working with HDF5 and h5py

HDF5 files can be navigated the same way as a directory. Every file has at minimum a root directory: '/'. numpy arrays can be directly stored into HDF5 files as if you were working with a dict.

Creating an HDF5 file
import h5py
example_data = h5py.File('example_data.hdf5')
allele_group = example_data.create_group('allele')
allele_group['states'] = allele_data # store data
print(allele_group['states'])
#  <HDF5 dataset "states": shape (44445, 5), type "<f8">
print(np.array(allele_group['states'])) # retrieve the data
# [[     0.      1.      2.      1.      2.]
#  [     1.      2.      3.      2.      3.]
#  [     2.      2.      3.      3.      2.]
#  ...,
#  [ 44442.      1.      2.      2.      1.]
#  [ 44443.      1.      3.      3.      1.]
#  [ 44444.      1.      3.      1.      3.]]

It is best to think of an HDF5 file as its very own directory. So we can use an absolute path to get to data or we can use the relative path. A “relative” path means using the allele_group object versus using the example_data object.

Absolute versus relative paths in HDF5
print(example_data['allele/states']) # absolute path to dataset
# <HDF5 dataset "states": shape (44445, 5), type "<f8">
print(allele_group['states']) # relative path to dataset
# <HDF5 dataset "states": shape (44445, 5), type "<f8">

Groups, Datasets and Metadata

A group is a sub-directory and a dataset is an array of data. A sub-directory has metadata: size measured in bytes and access permissions. An HDF5 group in HDF5 can have metadata; however, a dataset can also have metadata.

HDF5 groups versus datasets
print(example_data)
# <HDF5 file "example_data.hdf5" (mode r+)>
print(allele_group)
# <HDF5 group "/allele" (1 members)>
print(type(allele_group))
# <class 'h5py._hl.group.Group'>
allele_group['states'].attrs['columns'] = list(map(np.string_, ['locus',  # metadata attached to dataset
                                                                'alpha',
                                                                'omega',
                                                                'minor',
                                                                'major' ]))
print([name.decode('UTF-8') for name in allele_group['states'].attrs['columns']])
# ['locus', 'alpha', 'omega', 'minor', 'major']
allele_group.attrs['info'] = list(map(np.string_, # metadata attached to group
                                      ['Declaration of alpha, omega, minor and major alleles']))
print(allele_group.attrs['info'])
# [b'Declaration of alpha, omega, minor and major alleles']
allele_group.attrs['info'][0].decode('UTF-8')
# Declaration of alpha, omega, minor and major alleles

Storing Frequency Data

We can store the allele frequency data and genotype frequency data in their own groups.

Storing frequency data
allele_group['generation/founder'] = allele_frequencies
genotype_group = example_data.create_group('genotype')
genotype_group['generation/founder'] = genotype_frequencies # store
print(np.array(genotype_group['generation/founder'])) # retrieve
# [[[ 0.     0.     0.     0.     0.   ]
#   [ 0.     0.133  0.     0.     0.   ]
#   [ 0.     0.371  0.495  0.     0.   ]
#   [ 0.     0.     0.     0.     0.   ]
#   [ 0.     0.     0.     0.     0.   ]]
#
#  [[ 0.     0.     0.     0.     0.   ]
#   [ 0.     0.     0.     0.     0.   ]
#   [ 0.     0.     0.086  0.     0.   ]
#   [ 0.     0.     0.267  0.648  0.   ]
#   [ 0.     0.     0.     0.     0.   ]]
#
#  [[ 0.     0.     0.     0.     0.   ]
#   [ 0.     0.     0.     0.     0.   ]
#   [ 0.     0.     0.886  0.105  0.   ]
#   [ 0.     0.     0.     0.01   0.   ]
#   [ 0.     0.     0.     0.     0.   ]]
#   ...,
#  [[ 0.     0.     0.     0.     0.   ]
#   [ 0.     0.305  0.457  0.     0.   ]
#   [ 0.     0.     0.238  0.     0.   ]
#   [ 0.     0.     0.     0.     0.   ]
#   [ 0.     0.     0.     0.     0.   ]]
#
#  [[ 0.     0.     0.     0.     0.   ]
#   [ 0.     0.562  0.     0.352  0.   ]
#   [ 0.     0.     0.     0.     0.   ]
#   [ 0.     0.     0.     0.086  0.   ]
#   [ 0.     0.     0.     0.     0.   ]]
#
#  [[ 0.     0.     0.     0.     0.   ]
#   [ 0.     0.143  0.     0.     0.   ]
#   [ 0.     0.     0.     0.     0.   ]
#   [ 0.     0.248  0.     0.61   0.   ]
#   [ 0.     0.     0.     0.     0.   ]]]

Data from Multiple Generations

We will demonstrate how easy it is to generate and store multiple generations of data. We will store the allele frequencies and genotype frequencies from five generations of random mating. The initial population size of 105 will be increased to 1000.

Generation 1

Generation 1
example_pop.popSize() # pre-random mating
# 105
example_pop.evolve(
    matingScheme=sim.RandomMating(
        ops=[
        sim.IdTagger(),
        sim.PedigreeTagger(),
        sim.Recombinator(recom_map)],
        subPopSize=1000
    ),
    gen=1
)
# 1
example_pop.popSize() # post random mating
# 1000
allele_group['generation/1'] = analyze.gather_allele_frequencies(example_pop, allele_data)
genotype_group['generation/1'] = analyze.gather_genotype_frequencies(example_pop)

Generation 2

Generation 2
   example_pop.evolve(
    matingScheme=sim.RandomMating(
        ops=[
        sim.IdTagger(),
        sim.PedigreeTagger(),
        sim.Recombinator(recom_map)],
        subPopSize=1000
    ),
    gen=1
)
# 1
allele_group['generation/2'] = analyze.gather_allele_frequencies(example_pop, allele_data)
genotype_group['generation/2'] = analyze.gather_genotype_frequencies(example_pop)

Generation 3

Generation 3
   example_pop.evolve(
    matingScheme=sim.RandomMating(
        ops=[
        sim.IdTagger(),
        sim.PedigreeTagger(),
        sim.Recombinator(recom_map)],
        subPopSize=1000
    ),
    gen=1
)
# 1
allele_group['generation/3'] = analyze.gather_allele_frequencies(example_pop, allele_data)
genotype_group['generation/3'] = analyze.gather_genotype_frequencies(example_pop)

Generation 4

Generation 4
   example_pop.evolve(
    matingScheme=sim.RandomMating(
        ops=[
        sim.IdTagger(),
        sim.PedigreeTagger(),
        sim.Recombinator(recom_map)],
        subPopSize=1000
    ),
    gen=1
)
# 1
allele_group['generation/4'] = analyze.gather_allele_frequencies(example_pop, allele_data)
genotype_group['generation/4'] = analyze.gather_genotype_frequencies(example_pop)

Generation 5

Generation 5
   example_pop.evolve(
    matingScheme=sim.RandomMating(
        ops=[
        sim.IdTagger(),
        sim.PedigreeTagger(),
        sim.Recombinator(recom_map)],
        subPopSize=1000
    ),
    gen=1
)
# 1
allele_group['generation/5'] = analyze.gather_allele_frequencies(example_pop, allele_data)
genotype_group['generation/5'] = analyze.gather_genotype_frequencies(example_pop)

After the final generation close the HDF5 file.

Close the HDF5 file
example_data.close()

Very Brief Example HDF5 in R

R is a very popular language for statistical computing in the biological sciences. This example shows how to use the h5 package to extract the data that we have just created. Examining the file object reveals our two groups: allele and genotype.

Using h5 to explore the file
> library(h5)
> r_example_data = h5file('example_data.hdf5')
> r_example_data
H5File 'example_data.hdf5' (mode 'a')
+ allele
+ genotype

We can look at the contents of each group the exact same way as we would in Python. The metadata that we stored as attributes is prefixed by A and a single dataset is prefixed by D. The + indicates that generation contains multiple objects.

Walking through the group
> r_example_data['allele']
H5Group '/allele'
+ generation
D states
A columns
A info
Looking at allele/generation
> r_example_data['allele/generation']
H5Group '/allele/generation'
D 1
D 2
D 3
D 4
D 5

Finally let’s look at the actual data.

Examining a dataset
> r_example_data['allele/generation/1']
DataSet '1' (44445 x 5)
type: numeric
chunksize: NA
maxdim: 44445 x 5
> r_example_data['allele/generation/1'][]
0    0.319047619     0.68095238      0.319047619     0.6809524
1    0.219047619     0.78095238      0.219047619     0.7809524
2    0.938095238     0.06190476      0.061904762     0.9380952
3    0.061904762     0.93809524      0.061904762     0.9380952
⋮    ⋮       ⋮       ⋮       ⋮
44443        0.73809524      0.26190476      0.26190476      0.7380952
44444        0.26666667      0.73333333      0.26666667      0.7333333