fel

案例路径
  • 23 /home/databank/zzy/project/usp18/MD_whole/100ns_1025/analysis/fel/noN_rms_gyrate

前言

以下图为例,以本人的理解,FEL本质上是一张二维散点图。这张图中点密集的地方在物理意义上代表低能(下图蓝色),反之,点稀疏处代表高能(下图红色)。FEL中的低能区域往往代表一个蛋白的优势构象,即它在溶液中倾向于以何种样子存在。
那么跑完了MD后怎么做一个FEL呢?这里先要说一下,CMD也是能做FEL的。只是局限于采样能力,往往无法拿到完整的自由能面,比如只能拿到下图的右侧,一个能量洼点。那么我们自然是想拿到尽可能完整的自由能面,那么我们就需要增强采样(remd,umbrella sampling,SMD等等)。
以REMD为例,我们跑很多条在不同温度下的轨迹。统计的时候,我们把这些高温轨迹都reweighting他们到300k,因为我们一般感兴趣常温(实验条件)下的蛋白自由能面。怎么得到二维散点图中的数据点呢?还是看下图,实际上就是计算轨迹中每一帧构象的2个具体的cv(Collective Variables, 用来描述体系的变化,比如蛋白构象转变)。下图选中了蛋白去掉N端的rmsd和蛋白去掉N端的回旋半径作为cv。选择合适的,能尽可能体现体系变化的cv对于画出一个能够展示出真正蛋白构象转变的FEL很重要。
综上所述,我印象里的FEL是这回事,如果以后发现有错误要告诉我:)

计算cv(数据点的x,y轴数据)

  • 23 /home/databank/zzy/project/usp18/MD_whole/100ns_1025/analysis/fel/noN_rms_gyrate
for 循环就是,你vsREMD跑了几条轨迹,你当然就要循环几次。我这里是26条轨迹
  • mkdir xvg
rmsd计算(交后台,要算挺久)
  • ##### 给定index.ndx 来选择特定部分计算rmsd
  • nohup sh -c 'for i in {1..26};do echo 18 18|gmx_mpi rms -s ../../../tpr/md${i}.tpr -f ../../align/${i}.xtc -n noN.ndx -o xvg/x_${i}.xvg;done' &
  • # echo 18 18 是选择rmsd计算的用哪些原子叠合,哪些原子计算。这里我的体系,我用的是自己创建的index
gyrate(回旋半径)计算
  • nohup sh -c 'for i in {1..26};do echo 18|gmx_mpi gyrate -s ../../../tpr/md${i}.tpr -f ../../align/${i}.xtc -n noN.ndx -o xvg/y_${i}.xvg;done' &

整理x,y轴数据

  • cd xvg
对rmsd和gyrate的输出文件xvg进行文本处理。整理为FEL 数据点的 x,y 轴数据(此时距离这些数据成为绘制FEL的输入数据还差根据玻尔兹曼分布reweighting 源自高温副本的数据到300k)
  • # 删除#/@开头的行
  • for i in {1..26};do sed -i '/^#/d' x_${i}.xvg;done
  • for i in {1..26};do sed -i '/^@/d' x_${i}.xvg;done
  • for i in {1..26};do sed -i '/^#/d' y_${i}.xvg;done
  • for i in {1..26};do sed -i '/^@/d' y_${i}.xvg;done

  • for i in {1..26};do cat y_${i}.xvg | cut -b 4-23 > cuty_${i}.xvg;done

  • cd ../input

  • for i in {1..26};do python xvg_combine.py ../xvg/x_${i}.xvg ../xvg/cuty_${i}.xvg output${i}.xvg;done

  • for i in {1..26};do cat output${i}.xvg | cut -b 16-100 > input_md${i}.xvg;done
  • for i in {1..26};do sed -i '1 i aax                y' input_md${i}.xvg;done
  • for i in {1..26};do sed -i 's/aa/  /g' input_md${i}.xvg;done

  • mkdir csv
  • cd csv

  • for i in {1..26};do awk '{print($1,",",$2)}' ../input_md${i}.xvg > input_md${i}.csv;done
  • for i in {1..26};do mv input_md${i}.csv input-md${i}.csv;done
  • for i in {1..26};do sed -i 's/x , y/x,y/g' input-md${i}.csv;done

  • 3 编号从md0开始
  • for a in {1..26};do b=`expr ${a} - 1`;mv input-md${a}.csv input-md${b}.csv;done
看一下 input-md${i}.csv 该长啥样(以 input-md25.csv 为例)

reweighting

根据玻尔兹曼分布,不同的温度可能reweighting系数不一样,那么你就还要制作一份“副本-温度”的列表,这里我的命名是 input_list.csv 
 
 脚本如下:
  • import numpy as np
  • import pandas as pd

  • input_list = 'input_list.csv'
  • reference_temp = 300.00
  • x_grid_num = 50
  • y_grid_num = 50
  • output_ = 'ref_noN_rms_gyrate.csv'

  • def location(m,m_grid):
  •     if m % m_grid == 0:
  •         m_loc = m // m_grid - 1
  •     else:
  •         m_loc = m // m_grid   
  •     if m_loc == -1:
  •         m_loc = 0   
  •     return int(m_loc)

  • def count(data,re_factor=1):
  •     table = [[0 for i in range(x_grid_num)] for i in range(y_grid_num)]
  •     table_re = [[0 for i in range(x_grid_num)] for i in range(y_grid_num)]    
  •     total_num =len(data)
  •     x= data.loc[:,'x']
  •     y= data.loc[:,'y']
  •     for i in range(total_num):
  •         x_loc = location((x[i] - x_min),x_grid)
  •         y_loc = location((y[i] - y_min),y_grid)
  •         table[x_loc][y_loc] += 1
  •     for i in range(x_grid_num):
  •         for j in range(y_grid_num):
  •             table_re[i][j] = pow((table[i][j] / total_num),re_factor) * total_num
  •     return table_re

  • input_ = pd.read_csv('input_list.csv',sep=',')
  • xmin = []
  • ymin = []
  • xmax = []
  • ymax = []

  • for file in input_.loc[:,'file']:
  •     data_=pd.read_csv(file,sep=',')
  •     xmin.append(min(data_.loc[:,'x']))
  •     ymin.append(min(data_.loc[:,'y']))
  •     xmax.append(max(data_.loc[:,'x']))
  •     ymax.append(max(data_.loc[:,'y']))

  • x_min, x_max = min(xmin), max(xmax)
  • y_min, y_max = min(ymin), max(ymax)
  • x_grid = (x_max - x_min) / x_grid_num + 0.1
  • y_grid = (y_max - y_min) / y_grid_num + 0.1

  • count_n = [[0 for i in range(x_grid_num)] for i in range(y_grid_num)]
  • E =  [[0 for i in range(x_grid_num)] for i in range(y_grid_num)]

  • for i in range(len(input_)):
  •     data_=pd.read_csv(input_.loc[:,'file'][i],sep=',')
  •     temp = input_.loc[:,'temp'][i]
  •     re_factor = temp / reference_temp
  •     count_ = count(data_,re_factor)
  •     for m in range(x_grid_num):
  •         for n in range(y_grid_num):
  •             count_n[m][n] += count_[m][n]

  •     if i == int(len(input_)-1):
  •         N_sum = 0
  •         for m in range(x_grid_num):
  •             for n in range(y_grid_num):
  •                 N_sum += count_n[m][n]

  •         maxE = 0
  •         for m in range(x_grid_num):
  •             for n in range(y_grid_num): 
  •                 if count_n[m][n] != 0:
  •                     E[m][n] = -1.3717*np.log10(count_n[m][n]/N_sum) # kcal/mol
  •                     if E[m][n] > maxE:
  •                         maxE = E[m][n]
  •         
  •         count_out = ''
  •         for m in range(x_grid_num):
  •             for n in range(y_grid_num):
  •                 x_loc = m * x_grid + x_min
  •                 y_loc = n * y_grid + y_min
  •                 if E[m][n] == 0:    
  •                     count_out += '%.4f,%.4f,%.8f\n' % (x_loc, y_loc, maxE)
  •                 else:
  •                     count_out += '%.4f,%.4f,%.8f\n' % (x_loc, y_loc, E[m][n])
  •         with open(output_,'w') as f:
  •             f.write('x,y,e\n' + count_out)
  •         print('Job is done!')
  •     else:
  •         print('md' + str(i) + ' is done!')
  •         continue
 运行后,得到reweighting后的数据ref_noN_rms_gyrate.csv,x,y轴单位都是nm
 我习惯用埃米做单位
  • awk -F ',' '{print $1*10,$2*10,$3}' noN_rms_gyrate.csv > noN_rms_gyrate_A.csv

作图

但是不重要,都可以作图,脚本如下
  • import pandas as pd
  • import matplotlib.pyplot as plt
  • import matplotlib as mpl
  • import numpy as np
  • import pyemma

  • data_re=np.array(pd.read_csv('./ref_noN_rms_gyrate_A.csv',sep=','))
  • fig, ax = plt.subplots(figsize=(6, 4))
  • pyemma.plots.plot_contour(*data_re[:, :3].T,cmap='rainbow',method='cubic',alpha=1,ncontours=15,ax=ax)
  • ax.set_xticks(np.linspace(0,25,11))
  • ax.set_xlim([2.729,25])
  • ax.set_yticks(np.linspace(10,27,16))
  • ax.set_ylim([19.726,27])
  • ax.set_xlabel('RMSD of protein without N_loop (Å)')
  • ax.set_ylabel('Gyrate of protein without N_loop (Å)')
  • ax.set_title('300K-reweighting (20-120ns)',fontweight='bold')
  • fig.tight_layout()
  • ### fig.savefig('noN_rms_gyrate.png')
大概长下图这样,有点丑,我当时这个体系cv选的就不好。
但是流程大概是这个流程。

debug

上述流程做完了,图看着就不对,怎么回事呢?从以下几个地方排查

原始轨迹

用于计算x,y轴数据的轨迹,周期性处理了吗?每一条轨迹有没有间隔一些帧后输出为pdb检查?
  • 23 /home/dddc/xhl/chenyuanjieout/zzy-check
  • gmx_mpi trjconv -s ../md13/md.tpr -f ../md13/dry.xtc -o look.pdb -skip 1000