Abstract
Revue des données et des solutions informatiques libre d’accès pour croiser les aspects culturels (mêmes) et génétiques (gènes) dans le cadre de la diffusion du Néolithique en EuropeGene-culture coevolution (GC-coev) is one of the main challenge of evolutionary archaeology. It requires the cross study of aDNA (‘Who ?’) and the different aspects of the culture (‘What ?’).
This HTML web page presentation is hosted on this GitHub repository. It results from a R report generation. The source document is a Rmarkdown file using mainly packages like Leaflet and Plotly.
This presentation attempts to illustrate the reuse of existing aDNA data – specifically the mtDNA haplogroups (hg) – and cross-analysis methods with radiocarbon dates and culture data. To simplify the lecture of the presented methods, part of R code used to generate the graphs is shown and the analysis is restricted to the transition between hunter-gatherers (HG) and early farmers (EF) in Central and West Europe
The mitochondrial DNA (mtDNA) makes it possible to trace the maternal line. It passes from the mother to her children (of both sexes). Published mitochondrial sequences coming from the ancient DNA samples (aDNA) can be download from the Ancient mtDNA database (AmtDB). The dataset is composed by the data and the metadata
The data file mtdb_1511.fasta is downloaded from the AmtDB. The current metadata format is .fasta, a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. FASTA formats can be read with the phylotools package. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data:
fasta <- read.fasta("mtdb_1511.fasta")
fasta$seq.name[1]
## [1] "RISE509"
substr(fasta$seq.text[1], 1, 250)
## [1] "GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAAT"
Each of these identifiers is a unique key for a sample. The data file counts 1511 different identifiers. These identifiers which allow to join the second dataset, metadata for cross-analysis
The metadata file metadata is a .csv file recording the site name, culture, epoch, geographical coordinates, etc.:
identifier | alternative_identifiers | country | continent | region | culture | epoch | group | comment | latitude | longitude | sex | site | site_detail | mt_hg | ychr_hg | year_from | year_to | date_detail | bp | c14_lab_code | reference_names | reference_links | reference_data_links | c14_sample_tag | c14_layer_tag | avg_coverage | sequence_source | ychr_snps |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HUGO_169sk2 | 169 SK 2 | Germany | Europe | central Europe | Bell Beaker | Bronze Age | BBC | 48.33041 | 10.899122 | F | Augsburg, Hugo-Eckener-Straße | U5a2b3 | -2461 | -2211 | cal BC 2461-2211 | 3863±25 | MAMS-18915 | Knipper et al. 2017 | https://doi.org/10.1073/pnas.1706355114 | https://www.ncbi.nlm.nih.gov/popset/1240339410 | 1 | 0 | 35.100 | reconstructed | ||||
I5023 | RISE565 | Germany | Europe | central Europe | Bell Beaker | Copper Age | BBC | 48.69100 | 13.016000 | F | Osterhofen-Altenmarkt, Bavaria | U2e1a1 | -2500 | -2000 | 2500-2000 BCE | Olalde et al. 2018 | https://dx.doi.org/10.1038/nature25738 | http://www.ebi.ac.uk/ena/data/view/PRJEB23635 | 0 | 0 | 530.454 | bam | ||||||
CL87 | Italy | Europe | southern Europe | Longobards | Middle Ages | LONG | 45.08333 | 7.583333 | F | Collegno | H1 | 570 | 880 | 570-880 CE (Phase 1: 570/590–630/640, Phase 2: 640–700, Phase 3: 8th century) | Amorim et al. 2018 | https://doi.org/10.1038/s41467-018-06024-4 | https://www.ncbi.nlm.nih.gov/biosample/8513217 | 0 | 1 | 18.300 | reconstructed |
It counts 2426 samples and 29 columns
The mtDNA data are joined to their associated radiocarbon dates (mtdb_metadata.csv .csv file). The dates have a unique laboratory number (LabCode) which can be easily be connected to existing databases
The dataset needs to be cleaned by keeping interesting fields and removing samples with typo errors (like longitude < - 90) and data without radiocarbon information. In the AmDB database, the radiocarbon field ‘bp’ is not strictly formatted, it needs to be edited with regular expressions (regex) by splitting the values from ‘bp’ field (split on ‘±’) to copy them into two columns, ‘c14bp’ and ‘c14sd.’ Finally, to simplify the output readings, the analysis focus on Neolithic in Spain
selected.fields <- c("identifier", "site", "country", "culture", "epoch", "mt_hg", "c14_lab_code", "bp", "longitude", "latitude", "reference_names", "reference_links")
df <- read.csv(metadata, encoding = "UTF-8")
df <- df[, selected.fields]
df <- df %>%
filter(mt_hg != "" & !is.na(mt_hg)) %>%
filter(bp != "" & !is.na(bp)) %>%
filter(c14_lab_code != "" & !is.na(c14_lab_code)) %>%
filter(str_detect(bp, "±")) %>%
filter(!str_detect(bp, "BP|and")) %>%
filter(country == "Spain") %>%
filter(epoch == 'Neolithic')
bps <- unlist(sapply(df$bp,
function(x) strsplit(x, "±")),
use.names = FALSE)
mat <- matrix(bps, ncol=2, byrow=TRUE)
df$c14bp <- as.integer(mat[, 1])
df$c14sd <- as.integer(mat[, 2])
df$bp <- NULL
Now the new metadata dataset has 10 samples and 13 columns:
identifier | site | country | culture | epoch | mt_hg | c14_lab_code | longitude | latitude | reference_names | reference_links | c14bp | c14sd |
---|---|---|---|---|---|---|---|---|---|---|---|---|
I0413 | Els Trocs | Spain | Iberia_EN | Neolithic | V | MAMS-16166 | 0.500000 | 42.5000 | Haak et al. 2015 | https://dx.doi.org/10.1038/nature14317 | 6234 | 28 |
Eiros2 | Eiros, Triacastela, Lugo | Spain | Iberia_MN/CA | Neolithic | U5b3 | Beta-356716 | -7.213395 | 42.7680 | Gonzalez-Fortes et al. 2019 | https://doi.org/10.1098/rspb.2018.2288 | 4750 | 30 |
I0409 | Els Trocs | Spain | Iberia_EN | Neolithic | J1c3 | MAMS-16159 | 0.500000 | 42.5000 | Haak et al. 2015 | https://dx.doi.org/10.1038/nature14317 | 6280 | 25 |
I0412 | Els Trocs | Spain | Iberia_EN | Neolithic | N1a1a1 | MAMS-16164 | 0.500000 | 42.5000 | Haak et al. 2015 | https://dx.doi.org/10.1038/nature14317 | 6249 | 28 |
I0410 | Els Trocs | Spain | Iberia_EN | Neolithic | T2c1d or T2c1d2 | MAMS-16161 | 0.500000 | 42.5000 | Haak et al. 2015 | https://dx.doi.org/10.1038/nature14317 | 6217 | 25 |
Val05 | Valdavara 1, Becerrea, Lugo | Spain | Iberian_MN | Neolithic | H1ay | Beta-235729, Beta-235730 | -7.231459 | 42.8571 | Gonzalez-Fortes et al. 2019 | https://doi.org/10.1098/rspb.2018.2288 | 4490 | 40 |
The dataset can be spatialized
Figure 1: mtDNA samples for the Neolithic of Spain
R offers packages to calibrate dates like Bchron or rcarbon. Using this latter, we plot the dates and add the identifiers of mtDNA samples for each summed probability distributions
idfs <- paste0(df.MesoNeo.b$c14_lab_code, "\n", df.MesoNeo.b$identifier)
c14dates <- calibrate(x = df.MesoNeo.b$c14bp,
errors = df.MesoNeo.b$c14sd,
calCurves = 'intcal20',
ids = idfs,
verbose = FALSE)
multiplot(c14dates,
decreasing = TRUE,
rescale = TRUE,
col.fill = 'lightgrey')
Figure 2: Calibration of radiocarbon dates associated with mtDNA data for the Neolithic of Spain
If one wants to study contemporaneous mtDNA samples, a statistical test has be performed, like a Mann-Whitney (function wilcox.test) on the ageGrid (results of the function BchronCalibrate) for each radiocarbon date pairwise. Each pairwise having a p-value superior to the conventional statistical threshold (0.05) we considered a homogeneity of distribution: the two dates are the same (i.e., contemporaneous)
l <- list()
for(i in 1:nrow(df.MesoNeo.b)){
ids <- df.MesoNeo.b[i, "identifier"]
a.date <- BchronCalibrate(ages=df.MesoNeo.b[i, "c14bp"],
ageSds=df.MesoNeo.b[i, "c14sd"],
calCurves='intcal20',
ids=df.MesoNeo.b[i, "identifier"])
lst <- list(eval(parse(text=paste0("a.date$",ids,"$ageGrid"))))
names(lst) <- ids
l[length(l)+1] <- lst
}
n.obs <- sapply(l, length)
seq.max <- seq_len(max(n.obs))
mat <- t(sapply(l, "[", i = seq.max))
rownames(mat) <- df.MesoNeo.b[, "identifier"]
mat.t <- t(mat)
n.col <- 1:length(rownames(mat))
lcompar <- t(combn(n.col, 2))
df.test <- data.frame(a = character(0),
b = character(0),
pval = integer(0))
for(i in 1:nrow(lcompar)){
p1 <- lcompar[i, 1]
p2 <- lcompar[i, 2]
a <- colnames(mat.t)[p1]
b <- colnames(mat.t)[p2]
pval <- round(wilcox.test(mat.t[, p1], mat.t[, p2])$p.value, 3)
df.test <- rbind(df.test, c(a, b, pval))
df.test <- rbind(df.test, c(b, a, pval))
}
colnames(df.test) <- c("Var1", "Var2", "value")
df.test$value <- as.numeric(df.test$value)
ggplot(data = df.test, aes(x=Var1, y=Var2, fill=value)) +
geom_tile(na.rm = T) +
scale_fill_gradient2(low = "yellow", high = "red", space = "Lab",
na.value = 'white')+
geom_text(aes(Var1, Var2, label = value),
color = "black", size = 2.5)+
theme_minimal()+
theme_bw()+
theme(axis.title = element_blank())+
theme(axis.text=element_text(size=7))+
theme(axis.ticks = element_blank())+
theme(axis.ticks.length=unit(.1, "cm"))+
theme(legend.position = "none")
Figure 3: Square matrix of Mann-Whitney test results on grid ages extents for all pairwise of radiocarbon dates
Only two mtDNA/radiocarbon dates have homogenized grid ages extents and can be considerated has contemporaneous, I0413 and I0410 from the Els Trocs cave (Huesca, Spain) (Haak et al. 2015).
site | country | culture | identifier | mt_hg | c14_lab_code | c14bp | c14sd | reference_names | reference_links |
---|---|---|---|---|---|---|---|---|---|
Els Trocs | Spain | Iberia_EN | I0413 | V | MAMS-16166 | 6234 | 28 | Haak et al. 2015 | https://dx.doi.org/10.1038/nature14317 |
Els Trocs | Spain | Iberia_EN | I0410 | T2c1d or T2c1d2 | MAMS-16161 | 6217 | 25 | Haak et al. 2015 | https://dx.doi.org/10.1038/nature14317 |
Within the AmtDB database, one (1) field concerns the mtDNA hg (‘mt_hg’) and two (2) other fields give insights on the cultural membership of the sample:
To illustrate GC-coev, we will focus on the culture tags (i.e., ‘culture’ column) of the Mesolithic/Neolithic transition (i.e., ‘epoch’ == ‘Mesolithic’ or ‘Neolithic’). Our region of interest is the Central and Western Mediterranean (i.e., from Greece to Spain). Samples with empty values in ‘mt_hg’ are removed. To have an easily visibly of the dataset we remove also the little represented cultures (Freq == 1)
df <- read.csv(metadata, encoding = "UTF-8")
df.MesoNeo <- df %>%
filter(mt_hg != "" & !is.na(mt_hg)) %>%
filter(epoch == 'Mesolithic' | epoch == 'Neolithic') %>%
filter(latitude < 46 & latitude > 35) %>%
filter(longitude > -6 & longitude < 28)
df.MesoNeo <- df.MesoNeo %>%
group_by(culture) %>%
filter(n() > 1)
MNcultures <- as.data.frame(table(df.MesoNeo$culture),
stringsAsFactors = F)
After processing, the new dataset counts 144 samples and 144 distinct cultures. The Plotly package allow to enriched the graphic output by offering hover tools, navigation and data selection button
Figure 4: Counts of mtDNA samples for the main cultures
The selected mtDNA samples can be mapped. R and Leaflet allow to create interactive web maps and many ways to manage maps’ layout (colors, hyperlinks, etc.): mtDNA samples have the same color as the pie chart (Figure 4). Inside their popup (column ‘lbl’), the hg appears always colored in red, culture color depends on the culture, the mtDNA identifier and the bibliographical reference are hyperlinks
df.MesoNeo.a <- df.MesoNeo
df.MesoNeo.a <- merge(df.MesoNeo.a, df.colors, by.x="culture", by.y="Var1", all.x=T)
df.MesoNeo.a$lbl <- paste0("<b>", df.MesoNeo.a$site, "</b> <br>",
"hg: <b> <span style='color:red'>",
df.MesoNeo.a$mt_hg, "</span> </b>, ",
'identifier: <a href=',
shQuote(paste0(amtdb,df.MesoNeo.a[,'identifier'])),
"\ target=\"_blank\"", ">",
df.MesoNeo.a[,'identifier'], "</a> <br>",
"culture: <span style='color:",df.MesoNeo.a$color,";'>",
df.MesoNeo.a$culture, "</span> <i>",
df.MesoNeo.a$epoch, "</i><br>",
'ref: <a href=',
shQuote(paste0(df.MesoNeo.a[,'reference_links'])),
"\ target=\"_blank\"",
">", df.MesoNeo.a[,'reference_names'], "</a>")
leaflet(data = df.MesoNeo.a, width = "100%", height = "500px") %>%
addTiles(group = 'OSM') %>%
addCircleMarkers(layerId = ~identifier,
lng = ~longitude,
lat = ~latitude,
weight = 1,
radius = 3,
popup = ~lbl,
color = ~color,
opacity = 0.7,
fillOpacity = 0.7) %>%
addLegend("bottomleft",
colors = unique(df.MesoNeo.a$color),
labels= unique(df.MesoNeo.a$culture),
title = "cultures",
opacity = 1)
Figure 5: Sample and cleaned dataset
Graph drawing is a well-known heuristic to model relationships between discrete entities. With the R packages igraph and plotly, we can trace an enriched interactive graph to model the dataset structure. The default layout is a force-directed algorithm (e.g. Fruchterman-Reingold layout) allowing to bring closer or move away vertices depending on the edges they share. The graph is a multi-partite one with three (3) classes of vertices: cultures, hg and identifiers. The mtDNA identifiers (leafs of the graph) are URLs hyperlinks pointing to the same samples in the AmtDB database
Figure 6: AmtBD dataset structure modeled with graph
To go further with the GC-coev analysis, we can focus on co-occurences’ between hg and cultures. Here, we simply use the igraph package. The new graph is a bipartite one (i.e, 2-mode graph, two classes of vertices). Its spatialization is different from the (Graph 6) but the relations between hg and cultures are strictly the same
par(mar = c(0, 0, 0, 0))
edges <- df.MesoNeo.a[, c("culture", "mt_hg")]
nds.culture <- as.character(data$Var1)
nds.culture.color <- as.character(data$color)
nds.mt_hg<- unique(df.MesoNeo.a$mt_hg)
nds.mt_hg.color <- rep("grey", length(nds.mt_hg))
thm.ths.nds <- data.frame(name=c(nds.culture, nds.mt_hg),
color=c(nds.culture.color, nds.mt_hg.color),
stringsAsFactors = F)
g <- graph_from_data_frame(edges,
directed=F,
vertices= thm.ths.nds)
g <- simplify(g)
set.seed(123)
layout.fr <- layout_with_fr(g)
plot(g, vertex.label.cex= 0.7, layout = layout.fr,
vertex.frame.color=NA,
vertex.label.color = "black",
vertex.label.family="Helvetica",
vertex.size=9)
Figure 7: Bipartite network of cultures and haplogroups (hg)
A first reading of the graph shows that:
The dataset variability can be reduced with agglomeration techniques like the the community detection algorithm
To to detect communities of each vertex (hg and culture), one can uses the ‘fast greedy’ algorithm (fastgreedy.community) from the igraph package. This algorithm is a hierarchical ranking algorithm where initially each vertex belongs to a distinct community, and the communities are merged iteratively, so that each merge is locally optimal. The algorithm stops when it is no longer possible to increase the modularity, it will be thus gives a grouping as well as a dendrogram. This algorithm is close to the agglomeration of Ward used in hierarchical clustering (HC)
Figure 8: Community detetction on the bipartite network of cultures and haplogroups (hg)
A first reading of the clusters show a clear separation (i.e. edge distance) between three groups:
Surely, this GC-coev attempt needs to be refined. The cultural membership of mtDNA sample can be discussed upstream specifying the cultures they belong to, or downstream, choosing a different community detection algorithm, etc.
The time period covered by the Late Mesolithic-Early Neolithic (LM/EN) transition to the Late Neolithic (LN) represents three major demic diffusions:
Between these major population diffusions, smaller demic changes occurred. LM societies (ca. 7000-6000 BC) are considered to form the background of the early farmers (EF) arrivals. LM societies are composed by hunter-gatherers (HG) with a low net reproduction rate (Diamond 1998), generally a semi-sedentary residence (i.e. foragers), living in small groups with few storage capacities (Bradley 2003) and long-distance exchanges are sparse (Karamitrou-Mentessidi et al. 2013). Direct contacts between these HG and EF, or close settlements, are very few. In Eastern and Northern Europe, despite various material exchanges, few admixtures between EF farmers and indigenous HG have been observed: the hg U5 is typical of HG and the hg N1a is typical of EF. Mesolithics could have backed up to West Europe, the westernmost part of the Eurasian continent, and admixture could be more frequent in the West. In the middle Rhine valley, last HG populations have probably played a role in the Michelsberg cultural affirmation (Beau et al. 2017) and such interrogations also exist for the development of Megalithism along the Atlantic shore (Vincent and Emmanuel 2018).
Indeed, during this period and from Greece to Spain, we observed an absence of occupations (the so-called "hiatus" or "gap"). This "hiatus" is maybe due to a major mobility at the end of the 7th millenium BC linked to a major climatic event (8.2 Ky BP event) with drying and cooling in Europe (Guilaine 2011). The earliest Neolithic culture in Europe concerns the South-eastern part of the continent occupied by the PPC, ca 6500-6000 BC. After PPC diffusion in Greece, Thrace and the Balkan peninsula, a first stop in the progression is observed on its western and northern frontiers, respectively in western Greece with the PPC/ICC transition, and in the western Hungarian with the PPC/LBK transition.
After that, the two main currents for the beginning of farming in Central and Western Europe are the ICC (ca 5800-4900 BC) in the Mediterranean area, and the LBK (ca 5600-4900 BC) in the continent area. These currents where relatively fast even if other delays can also be observed in the diffusion of farming (Guilaine 2000-2001).
ICC current of farming is the fastest but shows spatial discontinuities (leapfrog movements) (Binder et al. 2017). The first phase of ICC, Impressa pioneer front, is short (ca 5800-5600 BC). The material culture of Impressa (e.g. lithics, subsistence system) shows significant differences with the Mesolithic indigenous. But contacts between Mesolithics and Early Neolithics exist: Impressa layers of Arene Candide (Finale Ligure, Liguria) show flints coming from a region controlled by late Mesolithic groups (Monti Lessini in Appenins)(Binder and Maggi 2001). The second phase of ICC, Cardial (ca 5600-5000 BC), shows more connections and admixture with the Mesolithic way of living with, for example, a part of the hunting in the subsistence system and adoption of lithic tools. This latter phase could be interpreted as an acculturation phase between Neolithics and Mesolithics.
LBK current of neolithisation shows a homogeneous culture (settlement, ceramic and lithic productions, subsistence system) recognized for a long time by archaeologists thanks to a favorable taphonomy of the settlements. As said, its diffusion seems to correspond to the IDD model where the demic factor is the most important one. Few genetic admixture has been recognized between Recent LBK farmers and latest HG even in Western Europe (Rivollat et al. 2016).
In Western Europe, first evidence of contacts between Recent LBK and Late Cardial (ca. 5200-4700 BC) takes place on both sides of an almost East-West middle-ground, extended from the middle part of the Rhone valley to the north of the Aquitaine basin. The Middle Neolithic (4500-3500 BC), except for the margins of Europe, is the achievement period of the farming diffusion process. This is a period of relative homogeneity and long-distance trades (green-stones, obsidian, bedoulian flint, etc.). It is not clear if this networks of distance trades were also the place of significant demic diffusion or if they were only traits exchanges. Megalithism starts as a response to a demographic stress (Hamon and Manen 2018). The mining phenomena also starts (e.g. in Chassean and Michelsberg contexts). Mining activities were probably the most complex activities of the period, they requires a complex organization all along the CO. At the westernmost part of the continent, the English Channel between the agro-pastoral continent and the British Islands, is crosses around 4000 BC by early farmers (Collard et al. 2010).