Gaussian mixture model based atomic model refinement (2024)

Refine atomic models to better fit into a given map, and get better PDB validation score. For detailed method (and to cite), see Preprint here.

Dataset

In this tutorial, we use TRPV1 as an example. Starting from a relatively old PDB model (PDB: 3J5R ), we will fit it into a newer, higher resolution EMDB structure at a different conformation ( EMD-8117 ). This is quite similar to the typical modeling work, when we have an existing homolog or predicted model and a newer, better structure.

In this case, the starting model was built before the PDB validation system came out, so the validation score is not pretty.

validation_pdb

First, roughly fit the model into the map. This can be done in ChimeraX. They have slightly different conformation, so the fit won't be very precise. It is ok to leave it as is.

Fitting model to map

Compile model parameters

First, we compute some parameters from the input model that will be used in the following refinement.

e2gmm_model_compile.py --model 3j5q.pdb

This should create a gmm_model_xx folder, which contains information on the stereochemical constraints of the input model. It is critical to always refer to this folder using the --path option for the later e2gmm_model_xxx runs, so the metadata can be accessed.

All output will be in PDB format if the input model here is PDB. If the input is CIF, the output will be in CIF. Otherwise, specifying --writecif will force output to be CIF.

For your own data, it is worth noting that some non-protein structures are not supported at the moment. It is recommended to remove them from input and add them back after refinement. It is also good to notify me about those so I can add them in the future.

Fit to density map

e2gmm_model_fit.py --path gmm_model_00 --map emd_8117.mrc --resolution 3 --writetxt --rebuild_rotamer

Here gmm_model_00 is the folder generated by the e2gmm_model_comple.py command. This step can be skipped if there is no input map and you only want to refine the stereochemical score of the model.

Make sure the model does fit in the map in the first place. In this case, the loss from command line output should start from around -0.4 and gradually decrease to -0.6 or lower throughout iterations.

The --rebuild_rotamer option will select the rotamer for each residue that fits best into the given map (by map-model FSC). The process can be very slow. It is necessary for this case because there are many sidechain outliers, but can be skipped if the starting model is more reasonable.

The option --writetxt is only useful if you intend to do the multi-model refinement later. It saves the amplitude and width of Gaussian functions in a txt file in addition to the PDB/CIF. It also can be reproduced later with another e2gmm_model_fit.py run.

Take a look at fit_03.pdb and the model should fit into the map.

model fit in map

Full model refinement

e2gmm_model_refine.py --path gmm_model_00 --model gmm_model_00/fit_03.pdb

gmm_model_00/fit_03.pdb is produced by e2gmm_model_fit.py. If the fitting step is skipped, simply remove the --model option.

Normally, the default option should produce a good enough model. But in this case, since we started from a quite bad position, it would be necessary to run the command multiple times. For the second run, specify --niter 0,20 to skip the initial iterations, and --fixrotamer to force good rotamers for the bad residues.

e2gmm_model_refine.py --path gmm_model_00 --model gmm_model_00/model_02.pdb --fixrotamer --niter 0,20

There is some randomness in the process, and the command can be run multiple times. In the end, this should give a relatively good PDB validation score.

pdb validation output

The Q-score should also improve compared to the input model.

q score improve

Additonally, if you use MolProbity and have concern about the C-beta deviation it reports, adding the --cbeta option should solve that. If you don't want the PDB validation server to complain about cis residues, you can enable the --nocis option.

Multi-model refinement

To refine a series of models, we will need to process the dataset from EMPIAR-10059. The dataset only contains particles, so one still needs to run the refinement and heterogeneity analysis. A brief and outdated tutorial can be found here. Hopefully, I will update that later...

At the end of the heterogeneity analysis, a stack of volumes is generated by e2gmm_eval.py that represents one mode of movement within the system. In this tutorial, we will build models for a continuous conformational change from there.

movement volume trpv1

We can actually work within the same gmm_model_00 folder, although it may be safer to make a copy just in case things go wrong. Here we simply copy gmm_model_00 to gmm_model_01.

To start, run

e2gmm_model_multi.py --path gmm_model_01 --maps gmm_00/ptcls_00_cls_0.hdf --modeltxt gmm_model_01/fit_03.txt --modelpdb gmm_model_01/model_02.pdb --resolution 5 --ndim 1

Here, gmm_00/ptcls_00_cls_0.hdf is the volume stack input, gmm_model_01/fit_03.txt is produced by e2gmm_model_fit.py, and gmm_model_01/model_02.pdb is produced by e2gmm_model_refine.py. The --ndim option here corresponds to the --axis options in e2gmm_eval.py. When a volume stack of a linear trajectory is used, i.e. e2gmm_eval.py --axis 0, set --ndim=1. The e2gmm_eval.py program also supports circular trajectories (--axis 0,1 will generate a circular trajectory in the 2D space formed by eigen-axis 0 and 1). In this case, set --ndim=2 for e2gmm_model_multi.py.

This will optimize the large-scale morphing and C-alpha model according to the input volume series. Next, we need to refine the full atom models. The same program is used but it has to be split into two separate runs because of the GPU memory limitation.

e2gmm_model_multi.py --path gmm_model_01 --maps gmm_00/ptcls_00_cls_0.hdf --modelpdb gmm_model_01/model_02.pdb --modeltxt gmm_model_01/fit_03.txt --resolution 5 --load --niter 0,0,2 --batchsz 8 --ndim 1

Note here we use --load and --niter 0,0,2, meaning we skip the first two refinements (morphing and C-alpha), and load the neural networks from the last run. We also use a smaller --batchsz to compensate for the GPU memory.

Finally, we can generate a series of models at a finer sampling.

e2gmm_model_multi.py --path gmm_model_01 --maps gmm_00/ptcls_00_cls_0.hdf --modelpdb gmm_model_01/model_02.pdb --modeltxt gmm_model_01/fit_00.txt --resolution 5 --load --eval 50 --ndim 1

And visualize the results here. Here every frame should have a relatively good stereochemistry score, and minimal clashing atoms.

movement movie trpv1

This can also work on much larger system, like the spliceosome from the tutorial here. It takes more tweaking to get it to work, again, under the memory limitation of the current GPUs...

movement movie spliceosome