Sampling exhaustiveness and precision
Run sampling performance analysis with imp-sampcon tool (described by Viswanath et al. 2017)
Enter the output modelling directory
Prepare the
density.txtfile:create_density_file.py --project_dir <path to the original project dir> config.json --by_rigid_body
For complexes containing multiple copies of the same subunit, prepare the
symm_groups.txtfile storing information necessary to properly align homo-oligomeric structurescreate_symm_groups_file.py --project_dir <path to the original project dir> config.json params.py
By default all molecule copies of the same subunits are grouped together, and this should be sufficient in most cases.
In some special cases where subunits are directed to specific series, to group by series use
--by seriesoption:create_symm_groups_file.py --project_dir <path to the original project dir> --by series config.json params.py
To additionally group series into bigger groups use
--extra_series_groupsoption, e.g.:create_symm_groups_file.py \ --project_dir <path to the original project dir> \ --by series \ --extra_series_groups NR_1,NR2;CR_1,CR_2 \ config.json params.py
which will group by series but on top will consider ambiguity between selected series
Run
setup_analysis.pyscript to prepare input files for the sampling exhaustiveness analysis based on your resulting models:setup_analysis.py -s <abs path to csv scores file produced by extract_all_scores.py> \ -o <specified output dir> \ -d <density.txt file generated in the previous step> \ -n <number of top scoring models to be analyzed, default is all models> \ -k <restraint score based on which to perform the analysis, default is total score>
Example:
setup_analysis.py -s all_scores.csv -o analysis -d density.txt -n 20000To see available options and default values run:
setup_analysis.py -h
Run
imp-sampcon exhausttool (command-line tool provided with IMP) to perform the actual analysis:cd <output dir created by the above setup_analysis.py script> imp_sampcon exhaust -n <prefix for output files> \ --rmfA sample_A/sample_A_models.rmf3 \ --rmfB sample_B/sample_B_models.rmf3 \ --scoreA scoresA.txt --scoreB scoresB.txt \ -d <path to density.txt file>/density.txt \ -m <calculator selection> \ -c <int for cores to process> \ -gp \ -g <float with clustering threshold step>
To see available options and default values for imp-sampcon exhaust analysis run:
imp_sampcon exhaust -h
In case the analysis will be run on slurm-based cluster then compile a bash script like the following and run with
sbatch:#!/bin/bash #SBATCH --job-name=master_sampling_20000.job #SBATCH --output=./master_sampling_20000.out #SBATCH --error=./master_sampling_20000.err #SBATCH --nodes=1 #SBATCH --time=10:00:00 #SBATCH --qos=highest #SBATCH --cpus-per-task=15 #SBATCH --mem-per-cpu=4000 imp_sampcon exhaust -n CR_Y_test --rmfA sample_A/sample_A_models.rmf3 --rmfB sample_B/sample_B_models.rmf3 --scoreA scoresA.txt --scoreB scoresB.txt -d density.txt -m cpu_omp -c 15 -gp -g 5.0
In the output you will get, among other files:
<prefix for output files>.Sampling_Precision_Stats.txtwith estimation of the sampling precision.Clusters obtained after clustering at the above sampling precision in directories and files starting from
clusterin their names, containing information about the models in the clusters and cluster localization densities<prefix for output files>.Cluster_Precision.txtlisting the precision for each clusterPDF files with plots with the results of exhaustiveness tests
See Viswanath et al. 2017 for detailed explanation of these concepts.
Optimize the plots
The fonts and value ranges in X and Y axes in the default plots from
imp_sampcon exhaustare frequently not optimal. For this you have to adjust them manually.Copy the original
gnuplotscripts to the currentanalysisdirectory by executing:copy_sampcon_gnuplot_scripts.py
This will copy four scripts to the current directory:
Plot_Cluster_Population.pltfor the<prefix for output files>.Cluster_Population.pdfplotPlot_Convergence_NM.pltfor the<prefix for output files>.ChiSquare.pdfplotPlot_Convergence_SD.pltfor the<prefix for output files>.Score_Dist.pdfplotPlot_Convergence_TS.pltfor the<prefix for output files>.Top_Score_Conv.pdfplot
Edit the scripts to adjust according to your liking or needs
Run the scripts again:
gnuplot -e "sysname='<prefix for output files>'" Plot_Cluster_Population.plt gnuplot -e "sysname='<prefix for output files>'" Plot_Convergence_NM.plt gnuplot -e "sysname='<prefix for output files>'" Plot_Convergence_SD.plt gnuplot -e "sysname='<prefix for output files>'" Plot_Convergence_TS.plt
For example:
gnuplot -e "sysname='elongator'" Plot_Cluster_Population.plt gnuplot -e "sysname='elongator'" Plot_Convergence_NM.plt gnuplot -e "sysname='elongator'" Plot_Convergence_SD.plt gnuplot -e "sysname='elongator'" Plot_Convergence_TS.plt
Extract cluster models
For example, to extract the 5 top scoring models:
extract_cluster_models.py \ --project_dir <full path to the original project directory> \ --outdir cluster.0/ \ --ntop 5 \ --scores ../all_scores.csv \ Identities_A.txt Identities_B.txt cluster.0.all.txt ../config.json
If the exhaustiveness is not met, run more jobs: