Cargando…

Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis

Here, we develop k -mer substring space decomposition (Kssd), a sketching technique which is significantly faster and more accurate than current sketching methods. We show that it is the only method that can be used for large-scale dataset comparisons at population resolution on simulated and real d...

Descripción completa

Detalles Bibliográficos
Autores principales: Yi, Huiguang, Lin, Yanling, Lin, Chengqi, Jin, Wenfei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7962209/
https://www.ncbi.nlm.nih.gov/pubmed/33726811
http://dx.doi.org/10.1186/s13059-021-02303-4
Descripción
Sumario:Here, we develop k -mer substring space decomposition (Kssd), a sketching technique which is significantly faster and more accurate than current sketching methods. We show that it is the only method that can be used for large-scale dataset comparisons at population resolution on simulated and real data. Using Kssd, we prioritize references for all 1,019,179 bacteria whole genome sequencing (WGS) runs from NCBI Sequence Read Archive and find misidentification or contamination in 6164 of these. Additionally, we analyze WGS and exome runs of samples from the 1000 Genomes Project. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13059-021-02303-4.