We have analyzed our implementation on a 875-million read whole-genome dataset, on which LSG has built the string graph using only 1GB of main memory (reducing the memory occupation by a factor of 50 with respect to SGA), while requiring slightly more than twice the time than SGA.
#Memory on disk graph software#
LSG is open source software and is available online. Moreover, we have developed a pipeline for genome assembly from NGS data that integrates LSG with the assembly step of SGA (Simpson and Durbin, 2012 ), a state-of-the-art string graph-based assembler, and uses BEETL for indexing the input data. LSG relies on a new representation of the FM-index that is exploited to use an amount of main memory requirement that is independent from the size of the data set. We have developed a disk-based algorithm for computing string graphs in external memory: the light string graph (LSG).
![memory on disk graph memory on disk graph](https://i5.walmartimages.com/asr/de1b7003-abf9-42b4-bcc4-0148ccb18c72.1bd06127e43030e54ba93d584fef789b.jpeg)
Our article is also motivated by the open problem of designing a space-efficient algorithm to compute a string graph using an indexing procedure based on the Burrows-Wheeler transform (BWT).
![memory on disk graph memory on disk graph](https://www.pingdom.com/wp-content/uploads/2010/02/hard-disk-drive-1024x613.jpg)
Positive results in this direction stimulate the investigation of efficient external memory algorithms for de novo assembly from NGS data.
![memory on disk graph memory on disk graph](https://1it.ee/wp-content/uploads/2021/01/Lenovo-Thinkpad-T570.png)
The large amount of short read data that has to be assembled in future applications, such as in metagenomics or cancer genomics, strongly motivates the investigation of disk-based approaches to index next-generation sequencing (NGS) data.