美文网首页js css html
分析 | kaks计算

分析 | kaks计算

作者: shwzhao | 来源:发表于2022-07-05 16:44 被阅读0次

准备文件:test.cds.fatest.pep.fa

$ grep ">" test.pep.fa
>AT3G47090
>AT3G47110
>AT3G47570
>AT3G47580
>AT5G20480
>AT5G39390

1. 分步计算

1.1 序列全局比对

注意:这里我将同源基因两两配对,在实际分析中选择的是共线性块的基因对。

$ # 这里用 mafft
$ mafft --auto --thread 10 test.AT3G47110-AT3G47090.pep.fa > test.AT3G47110-AT3G47090.pep.fa.aln

1.2 pal2nal.pl 回译

  • -output: 输出格式(clustal|paml|fasta|codon),默认clustal
  • -nogap: 删除 gap
$ pal2nal.pl test.AT3G47110-AT3G47090.pep.fa.aln test.AT3G47110-AT3G47090.cds.fa -output clustal -nogap > test.AT3G47110-AT3G47090.fa.aln.clustal 

1.3 KaKs_Calculator2

  • AXTConvertor
    Usage: AXTConvertor [Clustal/Msf/Nexus/Phylip/Pir] [AXT]
AXTConvertor test.AT3G47110-AT3G47090.fa.aln.clustal test.AT3G47110-AT3G47090.fa.aln.axt
  • KaKs_Calculator
    -i: axt 文件,将各基因对axt文件cat到一起,生成test.pep.fa.axt
    -o: 结果文件
    -c: 密码子表
    -m: 计算方法(NG|YN|......),默认MA
$ KaKs_Calculator -i test.pep.fa.axt -o test.pep.fa.kaks
Method(s): MA
Genetic code: 1-Standard Code
Please wait while reading sequences and calculating...
1 AT3G47110&AT3G47090   [OK]
2 AT3G47570&AT3G47090   [OK]
3 AT3G47570&AT3G47110   [OK]
4 AT3G47580&AT3G47090   [OK]
5 AT3G47580&AT3G47110   [OK]
6 AT3G47580&AT3G47570   [OK]
7 AT5G20480&AT3G47090   [OK]
8 AT5G20480&AT3G47110   [OK]
9 AT5G20480&AT3G47570   [OK]
10 AT5G20480&AT3G47580  [OK]
11 AT5G39390&AT3G47090  [OK]
12 AT5G39390&AT3G47110  [OK]
13 AT5G39390&AT3G47570  [OK]
14 AT5G39390&AT3G47580  [OK]
15 AT5G39390&AT5G20480  [OK]
Mission accomplished. (Time elapsed: 0:25)

1.4 yn00

还没看,之后补充

yn00

2. 一步计算

  • ParaAT.pl
    -h: 同源基因对文件
    -n: cds.fa
    -a: pep.fa
    -p: 指定多线程文件
    -m: 指定比对工具(clustalw2|t_coffee|mafft|muscle)
    -g: 去除比对有gap的密码子
    -k: 用KaKs_Calculator 计算kaks值
    -o: 输出结果的目录
    -f: 输出比对文件的格式(axt|fasta|paml|codon|clustal)
$ grep ">" test.pep.fa | sed 's/>//' | awk '{a[NR]=$1;for(i=1;i<NR;i++){print $1"\t"a[i]}}' > geneid.pairs.txt
AT3G47110       AT3G47090
AT3G47570       AT3G47090
AT3G47570       AT3G47110
AT3G47580       AT3G47090
AT3G47580       AT3G47110
AT3G47580       AT3G47570
AT5G20480       AT3G47090
AT5G20480       AT3G47110
AT5G20480       AT3G47570
AT5G20480       AT3G47580
AT5G39390       AT3G47090
AT5G39390       AT3G47110
AT5G39390       AT3G47570
AT5G39390       AT3G47580
AT5G39390       AT5G20480
$ echo 12 > proc # 设置线程数
$ ParaAT.pl -h geneid.pairs.txt -n test.cds.fa -a test.pep.fa -o paraat -m mafft -f axt -g -p proc -kaks
$ ls paraat | head -4
AT3G47110-AT3G47090.cds_aln.axt
AT3G47110-AT3G47090.cds_aln.axt.kaks
AT3G47570-AT3G47090.cds_aln.axt
AT3G47570-AT3G47090.cds_aln.axt.kaks

参考:
简书 | Kaks_calculator计算ka/ks 值
简书 | kaks calculator批量计算多个基因的选择压力kaks值
公众号 | 基因学苑 | 生物信息百Jia软件(三):Muscle
简书 | Mr_我爱读文献 | 学会正确选择多序列比对(coding-sequences)软件

补充:

多序列比对的软件:musclemafftclustalw......
现在常用的是前两者,上面用的mafft,这里看一下muscle

什么时候要用到多序列比对,它的结果能用于什么呢?

1. 用于构建基因树:
1.1 用trimAl 修剪比对结果,用iqtreefasttree等进行建pep树;
1.2 用pal2nal.plcds序列回帖到比对结果,用于构建cds树。

2. 用于构建物种树:将单拷贝基因家族的比对结果串联建树。

3. 共线性块上的基因对进行全局比对,回帖cds序列,用yn00等计算ka、ks值。

如果不正确,希望批评指正!


$ conda search muscle
Loading channels: done
# Name                       Version           Build  Channel
muscle                        3.8.31               0  bioconda
muscle                      3.8.1551               1  bioconda
muscle                      3.8.1551               2  bioconda
muscle                      3.8.1551      h2d50403_3  bioconda
muscle                      3.8.1551      h6bb024c_4  bioconda
muscle                      3.8.1551      h7d875b9_6  bioconda
muscle                      3.8.1551      hc9558a2_5  bioconda
muscle                           5.1      h7d875b9_0  bioconda
muscle                           5.1      h9f5acd7_1  bioconda
  • 运行
$ cat test.fa
>gene1
MRLFLLLAFNALMQLEAYGFTDESDRQALLEIKSQVSESKRDALSAWNNSFP
>gene2
MGVPCIVMRLILVSALLVSVSLEHSDMVCAQTIRLTEETDKQALLEFKETSRVVLG
>gene3
MRLFLLLAFNALMLLETHGFTDETDRQALLQFKSQVSEDKRVVLSSWNHSFPLCNWKGVT
>gene4
MKLFLLLSFSAHLLLGETDRQALLEFKSQVSEGKRDVLSSWNNSFPLCNWKWVT
>gene5
MKLSFSLVFNALTLLLQVCIFAQARFSNETDMQALLEFKSQVSENNKREVLASWNHSSPF
>gene6
MKVCILVFAQARFSNETDMQALLEFKSQVTENKREVLASWNHSFPL
$ muscle -in test.fa -quiet | seqkit seq -w 0
>gene2
MGVPCIVMRLILVSALLVSVSLEHSDMVCAQTIRLTEETDKQALLEFKE-----TSRVVLG---------------
>gene5
-------MKLSFS--LVFNALTLLLQVCIFAQARFSNETDMQALLEFKSQVSENNKREVLASWNHSSPF-------
>gene6
-------MKVCIL---------------VFAQARFSNETDMQALLEFKSQVTE-NKREVLASWNHSFPL-------
>gene4
-------MKLFLL--LSFSAHL------LL------GETDRQALLEFKSQVSE-GKRDVLSSWNNSFPLCNWKWVT
>gene1
-------MRLFLL--LAFNALM------QLEAYGFTDESDRQALLEIKSQVSE-SKRDALSAWNNSFP--------
>gene3
-------MRLFLL--LAFNALM------LLETHGFTDETDRQALLQFKSQVSE-DKRVVLSSWNHSFPLCNWKGVT
  • 其他
$ muscle

MUSCLE v3.8.1551 by Robert C. Edgar

http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.


Basic usage

    muscle -in <inputfile> -out <outputfile>

Common options (for a complete list please see the User Guide):

    -in <inputfile>    Input file in FASTA format (default stdin)
    -out <outputfile>  Output alignment in FASTA format (default stdout)
    -diags             Find diagonals (faster for similar sequences)
    -maxiters <n>      Maximum number of iterations (integer, default 16)
    -maxhours <h>      Maximum time to iterate in hours (default no limit)
    -html              Write output in HTML format (default FASTA)
    -msf               Write output in GCG MSF format (default FASTA)
    -clw               Write output in CLUSTALW format (default FASTA)
    -clwstrict         As -clw, with 'CLUSTAL W (1.81)' header
    -log[a] <logfile>  Log to file (append if -loga, overwrite if -log)
    -quiet             Do not write progress messages to stderr
    -version           Display version information and exit

Without refinement (very fast, avg accuracy similar to T-Coffee): -maxiters 2
Fastest possible (amino acids): -maxiters 1 -diags -sv -distance1 kbit20_3
Fastest possible (nucleotides): -maxiters 1 -diags

看来我的版本比较老了

相关文章

网友评论

    本文标题:分析 | kaks计算

    本文链接:https://www.haomeiwen.com/subject/mmpsbrtx.html