method has been applied in the famous sofware Arlequin to estimate
parameters of pairwise differentiation (F
ST)
and gene flow (Nm) via pairwise comparisons among populations.
However, under some circumstances, this method may result in
F
ST values below zero. Such results
have made many beginners confused because the method estimating
F
STs provided in most courses are based on
Wright's model, in which a F
ST < 0
is impossible. Here I provide a simple description about the method
developed by Hudson et al. and its differences comparing to methods
developed by Wright and by Nei (1982). Two examples are also
included in this article to point out conditions which may result
to F
ST values below zero using the
method, in which two populations are varied largely in their sizes
and have high similarity in their genetic compositions. ——
在DNA序列開始被應用於族群遺傳研究之後,族群間分化的估計方式便不再侷限於比較基因型(alleles)間的差異,而擴張至序列(sequences)間的差異。與之前所發展的分析方法不同,新的分析方法在發展時都希望能夠充分利用DNA序列間所包含的差異
資訊,也就是屬於各個鹼基位置的差異。
Ever since DNA sequences being applied in studies of population
genetics, estimations of differentiation among populations have
extended from comparison between alleles to comparisons between
sequences. Different from previous methods, the basic goal of these
new methods is to apply all information of differences among
sequences, namly, the differences (or polymorphisms) on every
nucleotide residue.
以Hudson et al. (1992)的方法為例,其估算族群分化的公式如下:
For example, the formula from Hudson et al. (1992) described:
F
ST = 1 - Hw / Hb,其中
F
ST 表示族群間的分化參數;
Hw 表示所有族群內部序列配對比較後得到的序列間差異數總和;
Hb 表示所有族群之間序列配對比較後得到的序列間差異數總和。
F
ST is the population differentiation
parameter;
Hw is the sum of differences found among sequences within single
population;
Hb is the sum of differences found among sequences from different
populations.
這個方法和傳統教科書會提到的,源自Wright的計算方法(F
ST = 1 -
Hs /
Ht)有兩點不同。其一在於前者將單一子族群的序列視為一個整體,將整體內任意兩條序列的差異皆納入計算,而後者考慮的則是雙倍體(diploid)族群中異形合子(heterozygote)出現的頻率;其二在於前者使用的分母Hb僅包含「源自不同族群的成對序列」之間的差異,後者卻包含根據所有母族群內各基因型比例所推估出的異型合子出現頻率。由於Wright的計算方法不考慮由單型(haplotype)構成的單倍體(haploid)族群,因此像是葉綠體或是粒線體的DNA序列資料在理論上便不能直接套用Wright的計算方法。再加上Wright的
模型只考慮基因型間的差異,不考慮不同基因型間差異的幅度,因此Hudson et
al.的方
法會更適合被應用在估算單型族群間的分化。
There are two differences between this method and Wright's
(F
ST = 1 - Hs / Ht). First, Hudson et
al. consider the differences between every pair of individual
sequences within the same subpopulation as the nominator Hw.
Second, they use the differences of sequences 'from different
populations' as the denominator Hb. Wright's method can not be
applied to genetic data generated from organelles (eg. chloroplasts
or mitochondria) because it does not consider conditions of
haplotype (or haploid) populations. Moreover, method form Hudson et
al. puts weights to difference between sequences based on the
number of different nucleotide residues found between them, making
it more powerful in estimating differentiations among haplotype
populations.
然而Hudson et al.在發展這套方法的過程中使用了一些假設,使得這套方法在某些場合之下會產生令人訝異的結果。
However, some of the assumptions in this method have leaded to
strange outcomes under some conditions.
以下用兩個例子實際說明:
範例中各單型間差異(鹼基數)假定為1。
The following are two examples showing weird results when applying
the method by Hudson et al. (1992). Differences (number of
nucleotide residues) between haplotypes in both examples are all
assumed as 1.
範例一:
Case 1:
Haplotype A B C Total
population 1 4 3 1 8
population 2 3 1 0 4
Hw = Hw1 + Hw2 = (12 + 3 + 4) + 3 = 22
Hb = Hb(AB) + Hb(AC) + Hb(BC) = (4 + 9) + 3 + 1 = 17
F
ST = 1 - Hw / Hb = 1 - 22 / 17 =
-0.29411... ~ -0.294
範例二:
Case 2:
Haplotype A B C D E F G Total
population 1 16 1 1 1 0 0 0 19
population 2 29 2 0 0 1 1 1 34
Hw = Hw1 + Hw2 = (16 * 3 + 3) + (29 * 5 + 2 * 3 + 3) = 51 + 154 =
205
Hb = Hb(AB) + Hb(AC) + Hb(AD) + Hb(AE) + Hb(AF) + Hb(AG) + Hb(BC) +
Hb(BD) + Hb(BE)
+ Hb(BF) + Hb(BG) + Hb(CE) + Hb(CF) + Hb(CG) + Hb(DE) + Hb(DF) +
Hb(DG)
= (32 + 29) + 29 * 2 + 16 * 3 + 2 * 2 + 9 = 180
F
ST = 1 - Hw / Hb = 1 - 205 / 180 =
-0.138888... ~ -0.139
這兩個範例的共同特徵在於兩個族群的大小差異極大(比例上接近一比二),且各族群內單型的類別與出現頻率皆十分接近。在這種狀況之下,Hudson
et al.的計算方式所得到的Hw將有可能大於Hb,並進而造成F
ST小於零的情況發生。在Hudson et al.
(1992)的原始文章當中,所有參與比較的子族群大小皆為16,在這種狀況之下,就算各族群內單型的類別與出現頻率十分接近,也不會造成F
ST小於零的情況:
Both examples show a pair of populations with a nearly 1:2 ratio in
size and similar frequencies of shared haplotypes. Under this
circumstance, method from Hudson et al. may give a Hw larger than
Hb, causing the result of F
ST < 0.
In the original article of Hudson et al. (1992), size of all
subpopulations were set as 16 (sequences) for the simulations. So
there will never have a F
ST value
below zero even if the frequencies of shared haplotypes are similar
in both populations.
範例三,各單型在各族群內頻率與範例一相同:
Case 3, in which frequencies of each haplotype within each
population are equal to case 1:
Haplotype A B C Total
population 1 8 6 2 16
population 2 12 4 0 16
Hw = Hw1 + Hw2 = (48 + 16 + 12) + 48 = 124
Hb = Hb(AB) + Hb(AC) + Hb(BC) = (32 + 72) + 24 + 8 = 136
F
ST = 1 - Hw / Hb = 1 - 124 / 136 =
0.088235... ~ 0.088
雖然在解釋上,我們可以將F
ST小於零的狀況視為族群內變異高於族群間變異的結果,但是追根究底說來,如此結果的出現純粹是Hudson et
al.的計算方式既不採用重複取樣(取樣所得的序列可能與自己做比較),也不將所有子族群視為一個母族群(將同一子族群內的序列比較結果納入分母)的緣故。而這兩個假設剛好在Nei
(1982)的δst計算公式中都被採用:
The reason for F
ST < 0 is that both
comparisons of the selected sequence itself in each subpopulation
and of sequences from the same subpopulation in the total
population are not consideredin the method from Hudson et al.. On
the other hand, both kinds of comparison are included in Nei's
(1982) estimation of δst:
δst = πT - πS,其中
πT是母族群的核酸多樣性
(該數值並未排除各子族群內部所貢獻的核酸多樣性)
πS是各子族群和酸多樣性的平均。
πT is the nucleotide diversity of the total population
(which does not exclude contributions from the nucleotide
diversities of each subpopulation);
πS is the average of the nucleotide diversities of each
subpopulation.
因此在實際操作上,要解決估算部分族群間遺傳分化值小於零的問題,最簡單的做法便是採用Nei
(1982)的估算方法。而這個估算方法可以在DNAsp中找到。
Therefore, to solve the problem of F
ST
< 0, Nei's (1982) method should be applied instead of the
one from Hudson et al. (1992). And this method has been applied in
the software DNAsp.
Further readings:
Hudson RR, Slatkint M, Maddison WP (1992). Estimation of levels of
gene flow from DNA sequence data. Genetics 132, 583-589.
Nei, M (1982). Evolution of human races at the gene level, pp.
167-181. In B. Bonne-Tamir, T. Cohen, and R. M. Goo
dman (eds.), Human genetics, part A: The unfolding
genome. Alan R. Liss, New York.
Wright S (1951) The genetical structure of populations. Annuals of
Eugenics 15, 323-354....
本人根据 Hudson Fst,假设有G
g一个组里出现A频率为x,另一个组里是y,那么fst=(x^2+y^2-2*x*y)/(x+y-2*x*y),模拟如下:
library(scatterplot3d)
library(Rcmdr)
x<-seq(0.01,1,0.01)
y<-seq(0.01,1,0.01)
matrix<-matrix(0,1,3)
for( i in x ){
for( j in y ){
val<-(i^2+j^2-2*i*j)/(i+j-2*i*j)
mat<-c(i,j,val)
rbind(matrix,mat)->matrix
}
}
matrix[,1]->x1
matrix[,2]->y1
matrix[,3]->z1
scatter3d(x1, y1, z1)
scatter3d(x1, y1, z1,surface=FALSE,point.col=2)
得图如下:
结论:当两个 sub group 数量差不多的情况下 两组频率组成差异越大则Fst越大,频率组成相同则出现的Fst越小
当两个组相差C倍时 再讨论