fel
案例路径
- 23 /home/databank/zzy/project/usp18/MD_whole/100ns_1025/analysis/fel/noN_rms_gyrate
前言
以下图为例,以本人的理解,FEL本质上是一张二维散点图。这张图中点密集的地方在物理意义上代表低能(下图蓝色),反之,点稀疏处代表高能(下图红色)。FEL中的低能区域往往代表一个蛋白的优势构象,即它在溶液中倾向于以何种样子存在。
那么跑完了MD后怎么做一个FEL呢?这里先要说一下,CMD也是能做FEL的。只是局限于采样能力,往往无法拿到完整的自由能面,比如只能拿到下图的右侧,一个能量洼点。那么我们自然是想拿到尽可能完整的自由能面,那么我们就需要增强采样(remd,umbrella sampling,SMD等等)。
以REMD为例,我们跑很多条在不同温度下的轨迹。统计的时候,我们把这些高温轨迹都reweighting他们到300k,因为我们一般感兴趣常温(实验条件)下的蛋白自由能面。怎么得到二维散点图中的数据点呢?还是看下图,实际上就是计算轨迹中每一帧构象的2个具体的cv(Collective Variables, 用来描述体系的变化,比如蛋白构象转变)。下图选中了蛋白去掉N端的rmsd和蛋白去掉N端的回旋半径作为cv。选择合适的,能尽可能体现体系变化的cv对于画出一个能够展示出真正蛋白构象转变的FEL很重要。
综上所述,我印象里的FEL是这回事,如果以后发现有错误要告诉我:)
、
计算cv(数据点的x,y轴数据)
- 23 /home/databank/zzy/project/usp18/MD_whole/100ns_1025/analysis/fel/noN_rms_gyrate
for 循环就是,你vsREMD跑了几条轨迹,你当然就要循环几次。我这里是26条轨迹
- mkdir xvg
rmsd计算(交后台,要算挺久)
- ##### 给定index.ndx 来选择特定部分计算rmsd
- nohup sh -c 'for i in {1..26};do echo 18 18|gmx_mpi rms -s ../../../tpr/md${i}.tpr -f ../../align/${i}.xtc -n noN.ndx -o xvg/x_${i}.xvg;done' &
# echo 18 18 是选择rmsd计算的用哪些原子叠合,哪些原子计算。这里我的体系,我用的是自己创建的index
gyrate(回旋半径)计算
- nohup sh -c 'for i in {1..26};do echo 18|gmx_mpi gyrate -s ../../../tpr/md${i}.tpr -f ../../align/${i}.xtc -n noN.ndx -o xvg/y_${i}.xvg;done' &
整理x,y轴数据
- cd xvg
对rmsd和gyrate的输出文件xvg进行文本处理。整理为FEL 数据点的 x,y 轴数据(此时距离这些数据成为绘制FEL的输入数据还差根据玻尔兹曼分布reweighting 源自高温副本的数据到300k)
- # 删除#/@开头的行
- for i in {1..26};do sed -i '/^#/d' x_${i}.xvg;done
- for i in {1..26};do sed -i '/^@/d' x_${i}.xvg;done
- for i in {1..26};do sed -i '/^#/d' y_${i}.xvg;done
- for i in {1..26};do sed -i '/^@/d' y_${i}.xvg;done
- for i in {1..26};do cat y_${i}.xvg | cut -b 4-23 > cuty_${i}
.xvg; done - cd ../input
- for i in {1..26};do python xvg_combine.py ../xvg/x_${i}.xvg ../xvg/cuty_${i}.xvg output${i}.xvg;done
- for i in {1..26};do cat output${i}.xvg | cut -b 16-100 > input_md${i}.xvg;done
- for i in {1..26};do sed -i '1 i aax y' input_md${i}.xvg;done
- for i in {1..26};do sed -i 's/aa/ /g' input_md${i}.xvg;done
- mkdir csv
- cd csv
- for i in {1..26};do awk '{print($1,",",$2)}' ../input_md${i}.xvg > input_md${i}.csv;done
- for i in {1..26};do mv input_md${i}.csv input-md${i}.csv;done
- for i in {1..26};do sed -i 's/x , y/x,y/g' input-md${i}.csv;done
- 3 编号从md0开始
- for a in {1..26};do b=
`expr ${a} - 1`;mv input-md${a}.csv input-md${b}.csv;done
看一下 input-md${i}.csv 该长啥样(以 input-md25.csv 为例)
reweighting
根据玻尔兹曼分布,不同的温度可能reweighting系数不一样,那么你就还要制作一份“副本-温度”的列表,这里我的命名是 input_list.csv
脚本如下:
- import numpy as np
- import pandas as pd
- input_list = 'input_list.csv'
- reference_temp = 300.00
- x_grid_num = 50
- y_grid_num = 50
- output_ = 'ref_noN_rms_gyrate.csv'
- def location(m,m_grid):
- if m % m_grid == 0:
- m_loc = m // m_grid - 1
- else:
- m_loc = m // m_grid
- if m_loc == -1:
- m_loc = 0
- return int(m_loc)
- def count(data,re_factor=1):
- table = [[0 for i in range(x_grid_num)] for i in range(y_grid_num)]
- table_re = [[0 for i in range(x_grid_num)] for i in range(y_grid_num)]
- total_num =len(data)
- x= data.loc[:,'x']
- y= data.loc[:,'y']
- for i in range(total_num):
- x_loc = location((x[i] - x_min),x_grid)
- y_loc = location((y[i] - y_min),y_grid)
- table[x_loc][y_loc] += 1
- for i in range(x_grid_num):
- for j in range(y_grid_num):
- table_re[i][j] = pow((table[i][j] / total_num),re_factor) * total_num
- return table_re
- input_ = pd.read_csv('input_list.csv',sep=',')
- xmin = []
- ymin = []
- xmax = []
- ymax = []
- for file in input_.loc[:,'file']:
- data_=pd.read_csv(file,sep=',')
- xmin.append(min(data_.loc[:,'x']))
- ymin.append(min(data_.loc[:,'y']))
- xmax.append(max(data_.loc[:,'x']))
- ymax.append(max(data_.loc[:,'y']))
- x_min, x_max = min(xmin), max(xmax)
- y_min, y_max = min(ymin), max(ymax)
- x_grid = (x_max - x_min) / x_grid_num + 0.1
- y_grid = (y_max - y_min) / y_grid_num + 0.1
- count_n = [[0 for i in range(x_grid_num)] for i in range(y_grid_num)]
- E = [[0 for i in range(x_grid_num)] for i in range(y_grid_num)]
- for i in range(len(input_)):
- data_=pd.read_csv(input_.loc[:,'file'][i],sep=',')
- temp = input_.loc[:,'temp'][i]
- re_factor = temp / reference_temp
- count_ = count(data_,re_factor)
- for m in range(x_grid_num):
- for n in range(y_grid_num):
- count_n[m][n] += count_[m][n]
- if i == int(len(input_)-1):
- N_sum = 0
- for m in range(x_grid_num):
- for n in range(y_grid_num):
- N_sum += count_n[m][n]
- maxE = 0
- for m in range(x_grid_num):
- for n in range(y_grid_num):
- if count_n[m][n] != 0:
- E[m][n] = -1.3717*np.log10(count_n[m][n]/N_sum) # kcal/mol
- if E[m][n] > maxE:
- maxE = E[m][n]
- count_out = ''
- for m in range(x_grid_num):
- for n in range(y_grid_num):
- x_loc = m * x_grid + x_min
- y_loc = n * y_grid + y_min
- if E[m][n] == 0:
- count_out += '%.4f,%.4f,%.8f\n' % (x_loc, y_loc, maxE)
- else:
- count_out += '%.4f,%.4f,%.8f\n' % (x_loc, y_loc, E[m][n])
- with open(output_,'w') as f:
- f.write('x,y,e\n' + count_out)
- print('Job is done!')
- else:
- print('md' + str(i) + ' is done!')
- continue
运行后,得到reweighting后的数据ref_noN_rms_gyrate.csv,x,y轴单位都是nm
我习惯用埃米做单位
- awk -F ',' '{print $1*10,$2*10,$3}' noN_rms_gyrate.csv > noN_rms_gyrate_A.csv
作图
但是不重要,都可以作图,脚本如下
- import pandas as pd
- import matplotlib.pyplot as plt
- import matplotlib as mpl
- import numpy as np
- import pyemma
- data_re=np.array(pd.read_csv('./ref_noN_rms_gyrate_A.csv',sep=','))
- fig, ax = plt.subplots(figsize=(6, 4))
- pyemma.plots.plot_contour(*data_re[:, :3].T,cmap='rainbow',method='cubic',alpha=1,ncontours=15,ax=ax)
- ax.set_xticks(np.linspace(0,25,11))
- ax.set_xlim([2.729,25])
- ax.set_yticks(np.linspace(10,27,16))
- ax.set_ylim([19.726,27])
- ax.set_xlabel('RMSD of protein without N_loop (Å)')
- ax.set_ylabel('Gyrate of protein without N_loop (Å)')
- ax.set_title('300K-reweighting (20-120ns)',fontweight='bold')
- fig.tight_layout()
- ### fig.savefig('noN_rms_gyrate.png')
大概长下图这样,有点丑,我当时这个体系cv选的就不好。
但是流程大概是这个流程。
debug
上述流程做完了,图看着就不对,怎么回事呢?从以下几个地方排查
原始轨迹
用于计算x,y轴数据的轨迹,周期性处理了吗?每一条轨迹有没有间隔一些帧后输出为pdb检查?
- 23 /home/dddc/xhl/chenyuanjieout/zzy-check
- gmx_mpi trjconv -s ../md13/md.tpr -f ../md13/dry.xtc -o look.pdb -skip 1000
