Cargando…
Global-scale phylogenetic linguistic inference from lexical resources
Automatic phylogenetic inference plays an increasingly important role in computational historical linguistics. Most pertinent work is currently based on expert cognate judgments. This limits the scope of this approach to a small number of well-studied language families. We used machine learning tech...
Autor principal: | |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6176785/ https://www.ncbi.nlm.nih.gov/pubmed/30299438 http://dx.doi.org/10.1038/sdata.2018.189 |
_version_ | 1783361755701837824 |
---|---|
author | Jäger, Gerhard |
author_facet | Jäger, Gerhard |
author_sort | Jäger, Gerhard |
collection | PubMed |
description | Automatic phylogenetic inference plays an increasingly important role in computational historical linguistics. Most pertinent work is currently based on expert cognate judgments. This limits the scope of this approach to a small number of well-studied language families. We used machine learning techniques to compile data suitable for phylogenetic inference from the ASJP database, a collection of almost 7,000 phonetically transcribed word lists over 40 concepts, covering two thirds of the extant world-wide linguistic diversity. First, we estimated Pointwise Mutual Information scores between sound classes using weighted sequence alignment and general-purpose optimization. From this we computed a dissimilarity matrix over all ASJP word lists. This matrix is suitable for distance-based phylogenetic inference. Second, we applied cognate clustering to the ASJP data, using supervised training of an SVM classifier on expert cognacy judgments. Third, we defined two types of binary characters, based on automatically inferred cognate classes and on sound-class occurrences. Several tests are reported demonstrating the suitability of these characters for character-based phylogenetic inference. |
format | Online Article Text |
id | pubmed-6176785 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | Nature Publishing Group |
record_format | MEDLINE/PubMed |
spelling | pubmed-61767852018-10-12 Global-scale phylogenetic linguistic inference from lexical resources Jäger, Gerhard Sci Data Data Descriptor Automatic phylogenetic inference plays an increasingly important role in computational historical linguistics. Most pertinent work is currently based on expert cognate judgments. This limits the scope of this approach to a small number of well-studied language families. We used machine learning techniques to compile data suitable for phylogenetic inference from the ASJP database, a collection of almost 7,000 phonetically transcribed word lists over 40 concepts, covering two thirds of the extant world-wide linguistic diversity. First, we estimated Pointwise Mutual Information scores between sound classes using weighted sequence alignment and general-purpose optimization. From this we computed a dissimilarity matrix over all ASJP word lists. This matrix is suitable for distance-based phylogenetic inference. Second, we applied cognate clustering to the ASJP data, using supervised training of an SVM classifier on expert cognacy judgments. Third, we defined two types of binary characters, based on automatically inferred cognate classes and on sound-class occurrences. Several tests are reported demonstrating the suitability of these characters for character-based phylogenetic inference. Nature Publishing Group 2018-10-09 /pmc/articles/PMC6176785/ /pubmed/30299438 http://dx.doi.org/10.1038/sdata.2018.189 Text en Copyright © 2018, The Author(s) http://creativecommons.org/licenses/by/4.0/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files made available in this article. |
spellingShingle | Data Descriptor Jäger, Gerhard Global-scale phylogenetic linguistic inference from lexical resources |
title | Global-scale phylogenetic linguistic inference from lexical resources |
title_full | Global-scale phylogenetic linguistic inference from lexical resources |
title_fullStr | Global-scale phylogenetic linguistic inference from lexical resources |
title_full_unstemmed | Global-scale phylogenetic linguistic inference from lexical resources |
title_short | Global-scale phylogenetic linguistic inference from lexical resources |
title_sort | global-scale phylogenetic linguistic inference from lexical resources |
topic | Data Descriptor |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6176785/ https://www.ncbi.nlm.nih.gov/pubmed/30299438 http://dx.doi.org/10.1038/sdata.2018.189 |
work_keys_str_mv | AT jagergerhard globalscalephylogeneticlinguisticinferencefromlexicalresources |