Sampling exhaustiveness and precision
Run sampling performance analysis with imp-sampcon tool (described by Viswanath et al. 2017)
Enter the output modelling directory
Prepare the
density.txt
file:create_density_file.py --project_dir <path to the original project dir> config.json --by_rigid_body
For complexes containing multiple copies of the same subunit, prepare the
symm_groups.txt
file storing information necessary to properly align homo-oligomeric structurescreate_symm_groups_file.py --project_dir <path to the original project dir> config.json params.py
By default all molecule copies of the same subunits are grouped together, and this should be sufficient in most cases.
In some special cases where subunits are directed to specific series, to group by series use
--by series
option:create_symm_groups_file.py --project_dir <path to the original project dir> --by series config.json params.py
To additionally group series into bigger groups use
--extra_series_groups
option, e.g.:create_symm_groups_file.py \ --project_dir <path to the original project dir> \ --by series \ --extra_series_groups NR_1,NR2;CR_1,CR_2 \ config.json params.py
which will group by series but on top will consider ambiguity between selected series
Run
setup_analysis.py
script to prepare input files for the sampling exhaustiveness analysis based on your resulting models:setup_analysis.py -s <abs path to csv scores file produced by extract_all_scores.py> \ -o <specified output dir> \ -d <density.txt file generated in the previous step> \ -n <number of top scoring models to be analyzed, default is all models> \ -k <restraint score based on which to perform the analysis, default is total score>
Example:
setup_analysis.py -s all_scores.csv -o analysis -d density.txt -n 20000
To see available options and default values run:
setup_analysis.py -h
Run
imp-sampcon exhaust
tool (command-line tool provided with IMP) to perform the actual analysis:cd <output dir created by the above setup_analysis.py script> imp_sampcon exhaust -n <prefix for output files> \ --rmfA sample_A/sample_A_models.rmf3 \ --rmfB sample_B/sample_B_models.rmf3 \ --scoreA scoresA.txt --scoreB scoresB.txt \ -d <path to density.txt file>/density.txt \ -m <calculator selection> \ -c <int for cores to process> \ -gp \ -g <float with clustering threshold step>
To see available options and default values for imp-sampcon exhaust analysis run:
imp_sampcon exhaust -h
In case the analysis will be run on slurm-based cluster then compile a bash script like the following and run with
sbatch
:#!/bin/bash #SBATCH --job-name=master_sampling_20000.job #SBATCH --output=./master_sampling_20000.out #SBATCH --error=./master_sampling_20000.err #SBATCH --nodes=1 #SBATCH --time=10:00:00 #SBATCH --qos=highest #SBATCH --cpus-per-task=15 #SBATCH --mem-per-cpu=4000 imp_sampcon exhaust -n CR_Y_test --rmfA sample_A/sample_A_models.rmf3 --rmfB sample_B/sample_B_models.rmf3 --scoreA scoresA.txt --scoreB scoresB.txt -d density.txt -m cpu_omp -c 15 -gp -g 5.0
In the output you will get, among other files:
<prefix for output files>.Sampling_Precision_Stats.txt
with estimation of the sampling precision.Clusters obtained after clustering at the above sampling precision in directories and files starting from
cluster
in their names, containing information about the models in the clusters and cluster localization densities<prefix for output files>.Cluster_Precision.txt
listing the precision for each clusterPDF files with plots with the results of exhaustiveness tests
See Viswanath et al. 2017 for detailed explanation of these concepts.
Optimize the plots
The fonts and value ranges in X and Y axes in the default plots from
imp_sampcon exhaust
are frequently not optimal. For this you have to adjust them manually.Copy the original
gnuplot
scripts to the currentanalysis
directory by executing:copy_sampcon_gnuplot_scripts.py
This will copy four scripts to the current directory:
Plot_Cluster_Population.plt
for the<prefix for output files>.Cluster_Population.pdf
plotPlot_Convergence_NM.plt
for the<prefix for output files>.ChiSquare.pdf
plotPlot_Convergence_SD.plt
for the<prefix for output files>.Score_Dist.pdf
plotPlot_Convergence_TS.plt
for the<prefix for output files>.Top_Score_Conv.pdf
plot
Edit the scripts to adjust according to your liking or needs
Run the scripts again:
gnuplot -e "sysname='<prefix for output files>'" Plot_Cluster_Population.plt gnuplot -e "sysname='<prefix for output files>'" Plot_Convergence_NM.plt gnuplot -e "sysname='<prefix for output files>'" Plot_Convergence_SD.plt gnuplot -e "sysname='<prefix for output files>'" Plot_Convergence_TS.plt
For example:
gnuplot -e "sysname='elongator'" Plot_Cluster_Population.plt gnuplot -e "sysname='elongator'" Plot_Convergence_NM.plt gnuplot -e "sysname='elongator'" Plot_Convergence_SD.plt gnuplot -e "sysname='elongator'" Plot_Convergence_TS.plt
Extract cluster models
For example, to extract the 5 top scoring models:
extract_cluster_models.py \ --project_dir <full path to the original project directory> \ --outdir cluster.0/ \ --ntop 5 \ --scores ../all_scores.csv \ Identities_A.txt Identities_B.txt cluster.0.all.txt ../config.json
If the exhaustiveness is not met, run more jobs: