Question: I have heard that EMAN can process heterogeneous data sets and produce multiple models, for example, to study dynamics or ligand binding
Answer: Yes, EMAN does have software to do this. The software itself is no more difficult to use than the normal 'refine' command (in fact the syntax is almost identical). However the overall process you normally follow to produce reliable results can be somewhat involved, and hasn't been well documented. Also, note that it is much more computationally intensive than a single model refinement. We have applied this methodology to several systems, including SR398 (a single ring GroEL mutant) and fatty acid synthase. In any case, here's an outline of how to tackle this. This assumes some basic familiarity with normal single-model refinement in EMAN:
Using multirefine:
- set up a directory containing start.hed/img (the raw data), and if you're using Wiener filtered CTF correction, the 1D structure factor file as well
- make a set of numbered subdirectories in this directory for the models you will refine. ie - if you wish to split the data into 4 structures you would mkdir 0 1 2 3
- . each subdirectory must contain a 'threed.0a.mrc' file, the starting model for each subpopulation. In most cases these starting models don't need to look at all like the final structure, and in fact, when you refine a structure from something like a Gaussian blob, this can produce the most convincing results (note that this does not always work). The sole absolute requirement here is that the models cannot be numerically identical. ie - if you start with a Gaussian blob for a starting model, you must add a small amount of perturbative noise to each. It is also quite acceptable, especially for a first effort, to use 'good' models from single model refinement as starting models for the multi model refinement. Typically this would be done by taking the last n iterative models from a normal refine and using them as starting models. ie - if you ran refine for 10 cycles, use threed.7a.mrc as threed.0a.mrc in subdirectory 0, threed.8a.mrc as threed.0a.mrc in directory 1, etc.
- if you have a structure factor file, a copy must also be in each of these subdirectories.
- run 'multirefine'. The options are virtually identical to 'refine', except the second parameter is the number of models to produce - eg 'multirefine 8 4 mask=27 ...' will run 8 rounds of refinement on 4 models (directories 0 - 3).
If you don't have any experience with 'refine', I would strongly suggest learning how to do single model refinements first, then moving on to multirefine.
The overall process:
- Monitor the convergence of the overall process by stepping through the iterative 3D models in each directory. When the models in all of the subdirectories seem to have stabilized (don't change much from one iteration to the next), then you can stop (it is possible to run the convergence test in 'eman' in each of the subdirectories as well)
- When the multirefine has completed and converged reasonably well, we still need to do more. It is very important that you allow multirefine to complete (ie - not killing it in the middle of a run) before this step. The cls*lst files in each subdirectory contain the particles associated with each of the final refined models (except for the first image which is a reference). What we need to do next is combine these images into several new start.hed files and run single model refinements on each of the subpopulations. This will produce much better converged models and will act as an initial validation that the structures you got from multirefine are truly representative of the data:
- mkdir r0 r1 r2 r3 cd 0 foreach i (cls*lst) proc2d $i ../r0/start.hed first=1 refine .... (normal refinement) cd ../1 foreach i (cls*lst) proc2d $i ../r1/start.hed first=1 etc.
- Now compare the final refinements in each r? directory. If the data is truly heterogeneous you should see variations characteristic of the variations present in the data. This works very well for things like partial ligand or antibody binding. For things like large-scale dynamics, it also works, but there can be multiple 'correct' sets of answers.
- In general, multirefine will produce a set of models as different from each other as permitted by the data. However, cryoEM data is inherently noisy, meaning if you start with very homogeneous (but noisy) data, you will still get a set of different models out, with the differences being constructed based on variations in the low frequency noise. These variations will generally be fairly small, but the true variations you are looking for could also be small, so you need to be cautious and do some validations
- So, the next step, assuming you saw interesting variations, is to confirm that they are real. There are several validations you can perform:
- you can repeat the entire proceedure above using different starting models. If your system is discretely heterogeneous (ligand binding, etc), you should get effectively the same results. If your system is continuously heterogeneous (dynamics), you should at least observe a consistent set of relative motion.
- you can perform a cross-validation test. Repeat the refinement in, for example 'r0', but use the final refined model from 'r1' as an initial model, and vice-versa. That is, if you can start with a model strongly biased towards the 'wrong' structure, and still produce the correct model then you can be sure that at least this subset of the data represents the structure you determined.