Cargando…

Global-scale phylogenetic linguistic inference from lexical resources

Automatic phylogenetic inference plays an increasingly important role in computational historical linguistics. Most pertinent work is currently based on expert cognate judgments. This limits the scope of this approach to a small number of well-studied language families. We used machine learning tech...

Descripción completa

Detalles Bibliográficos
Autor principal: Jäger, Gerhard
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6176785/
https://www.ncbi.nlm.nih.gov/pubmed/30299438
http://dx.doi.org/10.1038/sdata.2018.189
_version_ 1783361755701837824
author Jäger, Gerhard
author_facet Jäger, Gerhard
author_sort Jäger, Gerhard
collection PubMed
description Automatic phylogenetic inference plays an increasingly important role in computational historical linguistics. Most pertinent work is currently based on expert cognate judgments. This limits the scope of this approach to a small number of well-studied language families. We used machine learning techniques to compile data suitable for phylogenetic inference from the ASJP database, a collection of almost 7,000 phonetically transcribed word lists over 40 concepts, covering two thirds of the extant world-wide linguistic diversity. First, we estimated Pointwise Mutual Information scores between sound classes using weighted sequence alignment and general-purpose optimization. From this we computed a dissimilarity matrix over all ASJP word lists. This matrix is suitable for distance-based phylogenetic inference. Second, we applied cognate clustering to the ASJP data, using supervised training of an SVM classifier on expert cognacy judgments. Third, we defined two types of binary characters, based on automatically inferred cognate classes and on sound-class occurrences. Several tests are reported demonstrating the suitability of these characters for character-based phylogenetic inference.
format Online
Article
Text
id pubmed-6176785
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Nature Publishing Group
record_format MEDLINE/PubMed
spelling pubmed-61767852018-10-12 Global-scale phylogenetic linguistic inference from lexical resources Jäger, Gerhard Sci Data Data Descriptor Automatic phylogenetic inference plays an increasingly important role in computational historical linguistics. Most pertinent work is currently based on expert cognate judgments. This limits the scope of this approach to a small number of well-studied language families. We used machine learning techniques to compile data suitable for phylogenetic inference from the ASJP database, a collection of almost 7,000 phonetically transcribed word lists over 40 concepts, covering two thirds of the extant world-wide linguistic diversity. First, we estimated Pointwise Mutual Information scores between sound classes using weighted sequence alignment and general-purpose optimization. From this we computed a dissimilarity matrix over all ASJP word lists. This matrix is suitable for distance-based phylogenetic inference. Second, we applied cognate clustering to the ASJP data, using supervised training of an SVM classifier on expert cognacy judgments. Third, we defined two types of binary characters, based on automatically inferred cognate classes and on sound-class occurrences. Several tests are reported demonstrating the suitability of these characters for character-based phylogenetic inference. Nature Publishing Group 2018-10-09 /pmc/articles/PMC6176785/ /pubmed/30299438 http://dx.doi.org/10.1038/sdata.2018.189 Text en Copyright © 2018, The Author(s) http://creativecommons.org/licenses/by/4.0/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files made available in this article.
spellingShingle Data Descriptor
Jäger, Gerhard
Global-scale phylogenetic linguistic inference from lexical resources
title Global-scale phylogenetic linguistic inference from lexical resources
title_full Global-scale phylogenetic linguistic inference from lexical resources
title_fullStr Global-scale phylogenetic linguistic inference from lexical resources
title_full_unstemmed Global-scale phylogenetic linguistic inference from lexical resources
title_short Global-scale phylogenetic linguistic inference from lexical resources
title_sort global-scale phylogenetic linguistic inference from lexical resources
topic Data Descriptor
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6176785/
https://www.ncbi.nlm.nih.gov/pubmed/30299438
http://dx.doi.org/10.1038/sdata.2018.189
work_keys_str_mv AT jagergerhard globalscalephylogeneticlinguisticinferencefromlexicalresources