Abstract Details
Name
Host Finder: computational detection of virus-host interactions by large-scale data-mining of the Sequence Read Archive
Presenter
Shyan Mascarenhas, University of Waterloo
Co-Author(s)
Shyan Mascarenhas (Department of Biology, University of Waterloo), Harold Hodgins (Department of Biology, University of Waterloo), Andrew C. Doxey (Department of Biology, University of Waterloo; David R. Cheriton School of Computer Science, University of Waterloo; Department of Medicine, McMaster University)
Abstract Category
Discovering & Evolving
Abstract
The Sequence Read Archive (SRA) is the largest public sequencing repository and represents an unparalleled record of global biodiversity. Despite the success of prior studies in discovering novel viruses through the SRA, the development of reliable methods for inferring virus–host interactions remain an important unmet need. Here, we present HostFinder, a computational framework that infers virus-host associations at scale by analysing co-occurrence patterns across the entire SRA. HostFinder generates taxonomic profiles using STAT (SRA Taxonomic Analysis Tool), quantifies host and viral abundances per dataset, and calculates association strength through k-mer thresholds and log-odds co-occurrence scores. Using curated eukaryote–virus interactions from Virus-Host DB, we evaluated whether co-occurrence scores could distinguish validated biological associations from false ones. Based on the co-occurrence score alone, the method achieved 92% precision with an 8% false positive rate, correctly recovering 85% of known virus–host interactions. Scaling across the entire SRA, we analyzed 3.78 billion virus-host pairs for potential interactions. Our analysis identified 906,141 high-confidence interactions (log-odds > 3, FDR < 1%) involving 53025 viral species and 15143 host species. As a case study, we discovered novel Partitiviridae associations with multiple insect hosts that are unreported in existing literature. These predictions were independently validated through viral genome assembly and phylogenetic analysis. HostFinder demonstrates that large-scale co-occurrence analysis of public sequencing repositories can reveal the hidden structure of virus-host networks and accelerate discovery of novel viral associations.
Close