Scaffolder--Software for Reproducible Genome Scaffolding.

Hazel Barton, University of Akron

Abstract

Background: Assembly of short-read sequencing data can result in a fragmented non-contiguous series of genomic sequences. Therefore a common step in a genome project is to join neighboring sequence regions together and fill gaps in the assembly using additional sequences. This scaffolding step, however, is non-trivial and requires manually editing large blocks of nucleotide sequence. Joining these sequences together also hides the source of each region in the final genome sequence. Taken together, these considerations may make reproducing or editing an existing genome build difficult. Methods: The software outlined here, “Scaffolder,” is implemented in the Ruby programming language and can be installed via the RubyGems software management system. Genome scaffolds are defined using YAML - a data format, which is both human and machine-readable. Command line binaries and extensive documentation are available. Results: This software allows a genome build to be defined in terms of the constituent sequences using a relatively simple syntax to define the scaffold. This syntax further allows unknown regions to be defined, and adds additional sequences to fill gaps in the scaffold. Defining the genome construction in a file makes the scaffolding process reproducible and easier to edit compared with FASTA nucleotide sequence. Conclusions: Scaffolder is easy-to-use genome scaffolding software. This tool promotes reproducibility and continuous development in a genome project. Scaffolder can be found at http://next.gs.