Cargando…

Reconstruction of evolving gene variants and fitness from short sequencing reads

Directed evolution can generate proteins with tailor-made activities. However, full-length genotypes, their frequencies, and fitnesses are difficult to measure for evolving gene-length biomolecules using most high-throughput DNA sequencing methods as short read lengths can lose mutation linkages in...

Descripción completa

Detalles Bibliográficos
Autores principales: Shen, Max W., Zhao, Kevin T., Liu, David R.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8551035/
https://www.ncbi.nlm.nih.gov/pubmed/34635842
http://dx.doi.org/10.1038/s41589-021-00876-6
Descripción
Sumario:Directed evolution can generate proteins with tailor-made activities. However, full-length genotypes, their frequencies, and fitnesses are difficult to measure for evolving gene-length biomolecules using most high-throughput DNA sequencing methods as short read lengths can lose mutation linkages in haplotypes. We present Evoracle, a machine learning method that accurately reconstructs full-length genotypes (R(2) = 0.94) and fitness using short-read data from directed evolution experiments, with substantial improvements over related methods. We validate Evoracle on phage-assisted continuous evolution (PACE), phage-assisted non-continuous evolution (PANCE) of adenine base editors, and OrthoRep evolution of drug-resistant enzymes. Evoracle retains strong performance (R(2) = 0.86) on data with complete linkage loss between neighboring nucleotides and large measurement noise such as pooled Sanger sequencing data (~$10/timepoint), and broadens the accessibility of training machine learning models on gene variant fitnesses. Evoracle can also identify high-fitness variants, including low-frequency ‘rising stars’, well before they are identifiable from consensus mutations.