Description

Zoonomia Gene Annotations

Methods

Annotations were created in the following manner:
  1. Starting from the Ensembl 99 version of the Human annotation, all protein coding genes selected.
  2. “zoonomia_broad” entries are the result of taking a human annotated transcript, and lifting those coordinates through the hal alignment via halLiftover. Liftover results are then smoothed by taking the earliest and latest resulting coordinate per contig, creating a single interval per contig. 500bp pads are then added to each end of these intervals. Genomic sequence is extracted from the target species covering these padded, lifted over regions. Exonerate protein2genome is then used to search for sequence matching the human protein sequence within the liftover intervals. Coordinates are parsed back from exonerate results and made into gtf entries. This process is attempted for the longest transcript for each gene annotation. All transcripts are attempted, from longest to shortest, until an entry satisfying the requirements is found, or until there are no more transcripts. Transcripts are reported with identity and similarity scores (from Exonerate) on both the exon and transcript level. Insertions/deletions (from Exonerate) are reported on the exon level.
  3. “zoonomia_pfenning” entries are the result from taking the proteinID from the longest human transcript per gene and using that as input into the “orthoGene_200M.2.1.4.py” script from the Pfenning lab. Briefly, the script chooses the best protein annotation from the human, goat, or mouse annotation, based on the target species position in the tree and sequence similarity to the human protein, lifting those coordinates through the hal via halLiftover, and then attempting to correct for liftover artifacts by shifting the result coordinates to maintain a reading frame with as few invalid codons as possible.
  4. To be included, annotations from either source must start with Methionine, have no coding/frame errors, and result in predicted proteins within 90-110% of the length of the human reference protein.
  5. Data is available for every species in the hal alignment, regardless of the presence of other annotations except for Human, goat, and mouse, as those genomes are used as source annotations.
  6. All entries are tagged with human gene, transcript, and protein ensemble ids, and gene names as appropriate.

Credits

This data was generated by Ross Swofford of the Broad Institute.