lørdag den 17. januar 2015

Studying the Nahuan family with Methods of Molecular Biology

In this blog post I will give a sneak-peek on some ongoing work that I am doing which involves learning the techniques biologists use to build family trees (called phylogenies) for species or other groups of organisms and applying those techniques to understanding the relations between Nahuan dialects.

As we know, languages form "families", which basically means that a group of languages can trace their lineages back to a shared ancestor. Language families appear because most languages split into daughter languages each of which diverge from the parent in a different way. This process is what we can observe when we look at dialects of a single language. Each dialect is a variety of the language that is in the process of breaking away from the parent language and form its own language. Divergence like this happens because all languages innovate, by introducing new ways of speaking, that didn't exist in the parent language.

The same is of course what happens in biological organisms where new mutations accumulate and spread in the DNA of a population causing daughter populations to diverge from the parent population. Biologists working with molecular data analyze how these mutations can be compared and traced so that we can reconstruct the genealogy of an entire family of related organisms by grouping those organisms that share specific mutations together.

Because of each new way of speaking that arises and spreads in a community can be compared to a mutation in the DNA that spreads in a population, we can use the same methods that biologists use to trace biological phylogenies to understanding the divergence of dialects.

Nahuan Dialects:

I am in the process of understanding how all the different regional varieties (dialects) of Nahuatl are related to each other. This would help us understand where the Nahuan parent language (called proto-Nahua by linguists) originated and how it spread. In order to do this I have to find out what the shared innovations that characterize each dialect area are, and then trace them back to build a family tree. I am well underways with this project, and I presented a preliminary version of this at the Meeting of the Friends of Uto-Aztecan in Nayarit this summer (Here is a link to the paper).

But what I am doing now is mapping this information into the same kind of model that biologists use, in order to apply statistical methods to better understand the relations. Doing this allows me to generate nice graphic trees using a program called Mesquite, and this is what I want to let you peek at. 

Here is an example of how each dialect (identified with a three letter code)
is plotted into a matrix of shared innovations.
These are only some of the characters I am using.

The data is entered in binary form: each innovation present in a dialect is entered as a 1, and the lack of a specific innovation is entered as a zero (i.e. when the original form is retained). Each position of 1/0 is called "a character". By listing all the characters that I am tracking, each dialect is identified with a string of ones and zeros (called a Markov chain). And by analyzing all the characters together a phylogeny is formed. For example the dialect area of Morelos is identified with the string: 011110101 - 100000100001 – 1000000000 Where each 1 is an innovation the dialects in Morelos have, and each 0 an innovation they don't have. 

I am working with 22 distinct dialects, most of them are regional, though a few are limited to only a single community. The communities and the codes I use to identify them are: Durango/Mexicanero (DUR). Michoacan (around Pomaro) (MCH), Mexico State (around Toluca) (MEX), North Guerrero (around Coatepec Costales), Federal District (mainly Texcoco)  (DFE), North Puebla (around Huauhchinango) (NPU), Morelos (MOR), Tetelcingo, Morelos (TET), SOuth Puebla (around Tehuacan) (SPU), Zongolica, Veracruz (ZON), Tlaxcala (TLX), Western Huasteca (Hidalgo/San Luis), Eastern Huasteca (Hidalgo/Veracruz), Sierra de Puebla (SDP), Isthmus (Area north of Coatzacoalcos, Veracruz) (IST),  Pajapan, Veracruz (Town in the Isthmus area), Tabasco Nawat (TAB), Chiapas Nahuat (extinct) (CHS), Pipil Nawat of El Salvador (PIP), Central Guerrero Nahuatl (GRO), Oapan (a town in Central Guerrero) and Southern Guerrero Nahuatl (CGR). Each of these have been assigned a specific binary chain, although I still need to double check some of them in the literature to be sure I am assigning the right values, and perhaps adding more characters to the chains.

The preliminary result of my analysis is a tree that looks like this, which is interesting because it has a much more treelike branching structure than most of the previous classifications. Otherwise it does not diverge in major ways from the recent classifications by Canger (1988) or Kaufman (2001). That is partly because I have used their analyses to identify the innovations that have relevance for classification.

The bushiness comes from the fact that I have been able to identify chains of innovations in the eastern branch, where particularly the development of the negation provides a signal that I believ can can be traced along several steps. Previous classifications have been shallow bushes because they have not to the same extent tried to trace independent innovations in different areas. Canger and Dakin introduced the basic split between eastern and western branches, based on a single innovation. I have been able to add a few further innovations to the basic split. But this is about the only deep isogloss in the standard classification - the rest of it is basically a synchronic classification of the distinct areas assigning them to one or the other of the two branches. By looking at the relations between the areas within the two branches I have been able to get a bushier tree.

The tree has two main branches an Eastern and a Western branch. The Western branch is divided into a Periphery (including the dialects of Durango, Michoacan, Mexico state and North Guerrero), and a Center (including the dialects of the D.F. Morelos, North Puebla, South Puebla, Zongolica and Tlaxcala). The Eastern branch is divided into the Eastern Periphery (including the dialects of the Huasteca, the Isthmus, and Central America), and Guerrero (Central and South). The Eastern periphery is subdivided into a Huastecan and an Isthmian branch, with the Sierra de Puebla dialects in a kind of intermediate position between the two.

The next step after double and triple checking the data for the tree is to map each dialect to its geographical correlates and use a phylogeographic mapping program to calculate the probable paths that each community took to arrive to their current locations.

Meanwhile, here is the pretty tree for you to take a look at:

Ingen kommentarer:

Send en kommentar