Analysis Details

File format

The lingua franca for this project is the multiple alignment format, MAF.

Sequence naming conventions

Sequences should be named as species.chromosome, e.g. hg18.chr2 or simHuman.chrA.

Analysis pipeline

The analysis pipeline (available here) is controlled by `make' via the repository Makefile. An analysis may be launched using

$ make analysis location=/path/to/testPackage set=testSet

where location is the path the package location and set is one of the prefixes from the registries directory, i.e. flySet, mammalSet, primateSet, testSet.

The Makefile can be run in parallel with -j=[integer] which allows the evaluations in an analysis to be run in parallel.

When an analysis is run the Makefile checks for executables in the repository's evaluations/bin/ directory and forms a cartesian product with the mafs stored in the package/predictions/ directory.

The results of an analysis are written to package/analysis/, with one directory per evaluation-prediction pair named in the format evaluation-prediction.maf/.

Packages

A package in this document refers to a set of sequences making up an alignment problem. This project consists of four packages, named: test simulation; primate simulation; mammal simulation; fly. Each package consists of a set of sequences,

Registries

A registry in this document refers to a file that contains information about what data files exist and what evaluations should be run for one specific package. The registries live in the analysis repo at the root level under registries/. The registries included in the repo are templates (with the suffix .template). When the Makefile runs an analysis it checks for a version of the registry without the .template suffix and if doesn't find it it makes a copy of the template. You may edit the non-template version to include or exclude data or evaluations at your discretion.

Evaluations

An evaluation in this document refers to a program (or wrapper) that that takes five inputs in the following order:

the path to the package directory
the path to the predicted maf
the path to the registry file
the path to a temporary directory
the path to the directory where output may be written

I.e.,

$ myEval path/to/package/ \
                path/to/prediction.maf \
                path/to/registry/registry.reg.tab \
                /path/to/temp \
                /path/to/out

The evaluation will be passed these arguments by the analysis Makefile. The output directory will be specific to one prediction and evaluation pair. I.e. if there are four prediction mafs and two evaluations then eight different directories will be created. If you have multiple parameters to pass to a custom evaluation then you would do that through a wrapper that would accept the three arguments and then perform the multi parameter assessments.

An evaluation must be able to parse the registry file and to check whether or not it is included in the 'evaluations' field. Evaluations should only run when they are included in this field. If they are not included, then they should exit with status 0.

An evaluation may be included in the src/ directory in its own self-contained directory with a Makefile to perform the build. Each evaluation should be installed in the evaluations/bin/ directory located in this directory. Each file in evaluations/bin/ is considered an evaluation by the analysis rule of the Makefile and will be called on each prediction maf.

An evaluation may NOT include a hyphen, '-', in its filename. Doing so will gum up the makefile. Use underscores instead.

Evaluations are allowed to have dependencies (i.e. analysis programs like StatSigMA-w or PSAR) that are not included in the github repository. Indeed this is the ideal, the repo should just contain wrappers that call analysis programs. The guiding rule is that it should be reasonably easy to install the dependency on a typical linux server.

mafComparator

There are several wrappers that will use the analysis program mafComparator. This program, a part of the mafTools package, compares two MAF files against one another and performs "homology tests" between pairs of sequences from the two MAF files as defined in the mafComparator README.md

Since we have known truth for the simulated data we may use mafComparator to compare the true MAF file against a predicted MAF file.

Additionally, mafComparator can restrict its analysis to particular regions, specified by BED files. For the simulated data sets we have complete information of not only the alignments between species, but also the positions of many annotated regions including genic, neutral and repetitive regions . We use this information to see how aligners perform when only these regions are considered.

PSAR

This program is not included in our distributed evaluation code because it requires the use of a compute cluster. The details of how we sampled regions and then sharded regions into chunks for independent PSAR analysis are in our paper. PSAR is capable of estimating alignment reliability by probabilistically sampling an alignment [1]. It doesn't require a true alignment or the use of an accurate phylogeny.

StatSigMA-w (planned)

This program is not capable of accepting arbitrary input data and we did not have the resources to rewrite it to do so. As a result we could not use it for the Alignathon project, though it's premise is promising.

This program is not yet included in our evaluations but will be soon. StatSigMA-w is capable of estimating the accuracy of genome size alignments without knowing the true alignment [2, 3]. The program does require an accurate phylogeny for the sequences involved but such a phylogenies are available for all the test packages in the Alignathon.

References

1. Kim and Ma. PSAR: measuring multiple sequence alignment reliability by probabilistic sampling. Nucleic Acids Res (2011) vol. 39 (15) pp. 6359-68 http://nar.oxfordjournals.org/content/39/15/6359.long

2. Prakash and Tompa. Measuring the accuracy of genome-size multiple alignments. Genome Biol (2007) vol. 8 (6) pp. R124 http://genomebiology.com/content/8/6/R124

3. Chen and Tompa. Comparative assessment of methods for aligning multiple genome sequences. Nature Biotechnology (2010) vol. 28 (6) pp. 567-72 http://www.nature.com/nbt/journal/v28/n6/full/nbt.1637.html