Investigating the mitochondrial genomic landscape of Arabidopsis thaliana by long-read sequencing

Plant mitochondrial genomes have distinctive features compared to those of animals; namely, they are large and divergent, with sizes ranging from hundreds of thousands of to a few million bases. Recombination among repetitive regions is thought to produce similar structures that differ slightly, kno...

Descripción completa

Detalles Bibliográficos
Autores principales: Masutani, Bansho, Arimura, Shin-ichi, Morishita, Shinichi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7833223/
https://www.ncbi.nlm.nih.gov/pubmed/33434206
http://dx.doi.org/10.1371/journal.pcbi.1008597
_version_ 1783642016983285760
author Masutani, Bansho
Arimura, Shin-ichi
Morishita, Shinichi
author_facet Masutani, Bansho
Arimura, Shin-ichi
Morishita, Shinichi
author_sort Masutani, Bansho
collection PubMed
description Plant mitochondrial genomes have distinctive features compared to those of animals; namely, they are large and divergent, with sizes ranging from hundreds of thousands of to a few million bases. Recombination among repetitive regions is thought to produce similar structures that differ slightly, known as “multipartite structures,” which contribute to different phenotypes. Although many reference plant mitochondrial genomes represent almost all the genes in mitochondria, the full spectrum of their structures remains largely unknown. The emergence of long-read sequencing technology is expected to yield this landscape; however, many studies aimed to assemble only one representative circular genome, because properly understanding multipartite structures using existing assemblers is not feasible. To elucidate multipartite structures, we leveraged the information in existing reference genomes and classified long reads according to their corresponding structures. We developed a method that exploits two classic algorithms, partial order alignment (POA) and the hidden Markov model (HMM) to construct a sensitive read classifier. This method enables us to represent a set of reads as a POA graph and analyze it using the HMM. We can then calculate the likelihood of a read occurring in a given cluster, resulting in an iterative clustering algorithm. For synthetic data, our proposed method reliably detected one variation site out of 9,000-bp synthetic long reads with a 15% sequencing-error rate and produced accurate clustering. It was also capable of clustering long reads from six very similar sequences containing only slight differences. For real data, we assembled putative multipartite structures of mitochondrial genomes of Arabidopsis thaliana from nine accessions sequenced using PacBio Sequel. The results indicated that there are recurrent and strain-specific structures in A. thaliana mitochondrial genomes.
format Online
Article
Text
id pubmed-7833223
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-78332232021-01-26 Investigating the mitochondrial genomic landscape of Arabidopsis thaliana by long-read sequencing Masutani, Bansho Arimura, Shin-ichi Morishita, Shinichi PLoS Comput Biol Research Article Plant mitochondrial genomes have distinctive features compared to those of animals; namely, they are large and divergent, with sizes ranging from hundreds of thousands of to a few million bases. Recombination among repetitive regions is thought to produce similar structures that differ slightly, known as “multipartite structures,” which contribute to different phenotypes. Although many reference plant mitochondrial genomes represent almost all the genes in mitochondria, the full spectrum of their structures remains largely unknown. The emergence of long-read sequencing technology is expected to yield this landscape; however, many studies aimed to assemble only one representative circular genome, because properly understanding multipartite structures using existing assemblers is not feasible. To elucidate multipartite structures, we leveraged the information in existing reference genomes and classified long reads according to their corresponding structures. We developed a method that exploits two classic algorithms, partial order alignment (POA) and the hidden Markov model (HMM) to construct a sensitive read classifier. This method enables us to represent a set of reads as a POA graph and analyze it using the HMM. We can then calculate the likelihood of a read occurring in a given cluster, resulting in an iterative clustering algorithm. For synthetic data, our proposed method reliably detected one variation site out of 9,000-bp synthetic long reads with a 15% sequencing-error rate and produced accurate clustering. It was also capable of clustering long reads from six very similar sequences containing only slight differences. For real data, we assembled putative multipartite structures of mitochondrial genomes of Arabidopsis thaliana from nine accessions sequenced using PacBio Sequel. The results indicated that there are recurrent and strain-specific structures in A. thaliana mitochondrial genomes. Public Library of Science 2021-01-12 /pmc/articles/PMC7833223/ /pubmed/33434206 http://dx.doi.org/10.1371/journal.pcbi.1008597 Text en © 2021 Masutani et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Masutani, Bansho
Arimura, Shin-ichi
Morishita, Shinichi
Investigating the mitochondrial genomic landscape of Arabidopsis thaliana by long-read sequencing
title Investigating the mitochondrial genomic landscape of Arabidopsis thaliana by long-read sequencing
title_full Investigating the mitochondrial genomic landscape of Arabidopsis thaliana by long-read sequencing
title_fullStr Investigating the mitochondrial genomic landscape of Arabidopsis thaliana by long-read sequencing
title_full_unstemmed Investigating the mitochondrial genomic landscape of Arabidopsis thaliana by long-read sequencing
title_short Investigating the mitochondrial genomic landscape of Arabidopsis thaliana by long-read sequencing
title_sort investigating the mitochondrial genomic landscape of arabidopsis thaliana by long-read sequencing
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7833223/
https://www.ncbi.nlm.nih.gov/pubmed/33434206
http://dx.doi.org/10.1371/journal.pcbi.1008597
work_keys_str_mv AT masutanibansho investigatingthemitochondrialgenomiclandscapeofarabidopsisthalianabylongreadsequencing
AT arimurashinichi investigatingthemitochondrialgenomiclandscapeofarabidopsisthalianabylongreadsequencing
AT morishitashinichi investigatingthemitochondrialgenomiclandscapeofarabidopsisthalianabylongreadsequencing