Diff for "EMAN2/Programs/e2refine2d"

Differences between revisions 9 and 10

e2refine2d

e2refine2d.py runs in much the same way as EMAN1's refine2d.py, though it has been improved in a number of subtle ways

This program will take a set of boxed out particle images and perform iterative reference-free classification to produce a set of representative class-averages. The point of this process is to reduce noise levels, so the overall shape of the particle views present in the data can be better observed. Generally cryo-EM single particles are noisy enough that it is difficult to distinguish subtle, or even not-so-subtle differences between particle images. By aligning and averaging similar particles together, less noisy versions of representative views are created. The class-averages produced by this program are typically used for:

Direct observation to look for heterogeneity or discover symmetry
Building initial models for single particle reconstruction
Separating particles into subgroups for additional analysis

This last point can be used to produce 'population-dynamics' movies of a particle in very close to the same orientation.

This program is quite fast for as many as a few thousand particles and ~100 classes. For most purposes if your data set is large (>10,000) particles you might consider using only a subset of the data for speed, though this clearly isn't appropriate for the 3rd use above. For large numbers of classes, specify the --fastseed option, or you will wait a very long time.

Options:

	--path	string	Path for the refinement, default=auto
	--iter	int	The total number of refinement iterations to perform
	--automask	bool	This will perform a 2-D automask on class-averages to help with centering. May be useful for negative stain data particularly.
	--input	string	The name of the file containing the particle data
	--ncls	int	Number of classes to generate
	--maxshift	int	Maximum particle translation in x and y
	--naliref	int	Number of alignment references to when determining particle orientations
	--exclude	string	The named file should contain a set of integers, each representing an image from the input file to exclude.
	--resume	bool	This will cause a check of the files in the current directory, and the refinement will resume after the last completed iteration. It's ok to alter other parameters.
	--initial	string	File containing starting class-averages. If not specified, will generate starting averages automatically
	--nbasisfp	int	Number of MSA basis vectors to use when classifying particles
	--minchange	int	Minimum number of particles that change group before deicding to terminate. Default = -1 (auto)
	--fastseed	bool	Will seed the k-means loop quickly, but may produce less consistent results.
	--simalign	string	The name of an 'aligner' to use prior to comparing the images (default=rotate_translate_flip)
	--simaligncmp	string	Name of the aligner along with its construction arguments (default=frc)
	--simralign	string	The name and parameters of the second stage aligner which refines the results of the first alignment
	--simraligncmp	string	The name and parameters of the comparitor used by the second stage aligner. (default=dot).
	--simcmp	string	The name of a 'cmp' to be used in comparing the aligned images (default=frc:nweight=1)
	--shrink	int	Optionally shrink the input particles by an integer amount prior to computing similarity scores. For speed purposes.
	--classkeep	float	The fraction of particles to keep in each class, based on the similarity score generated by the --cmp argument (default=0.85).
	--classkeepsig	bool	Change the keep ('--keep') criterion from fraction-based to sigma-based.
	--classiter	int	Number of iterations to use when making class-averages (default=5)
	--classalign	string	If doing more than one iteration, this is the name and parameters of the 'aligner' used to align particles to the previous class average.
	--classaligncmp	string	This is the name and parameters of the comparitor used by the fist stage aligner Default is dot.
	--classralign	string	The second stage aligner which refines the results of the first alignment in class averaging. Default is None.
	--classraligncmp	string	The comparitor used by the second stage aligner in class averageing. Default is dot:normalize=1.
	--classaverager	string	The averager used to generate the class averages. Default is 'mean'.
	--classcmp	string	The name and parameters of the comparitor used to generate similarity scores, when class averaging. Default is frc'
	--classnormproc	string	Normalization applied during class averaging
	--classrefsf	bool	Use the setsfref option in class averaging to produce better filtered averages.
	--normproj	bool	Normalizes each projected vector into the MSA subspace. Note that this is different from normalizing the input images since the subspace is not expected to fully span the image
-P	--parallel	string	Run in parallel, specify type:<option>=<value>:<option>:<value>
	--dbls	string	data base list storage, used by the workflow. You can ignore this argument.
-v	--verbose	int	verbose level [0-9], higner number means higher level of verboseness

Typical usage:

e2refine2d.py --iter=6 --naliref=7 --nbasisfp=5 --path=r2d_01 --input=inputparticles.hdf --ncls=48 --simcmp=ccc --simalign=rotate_translate_flip --simaligncmp=ccc --simralign=refine --simraligncmp=ccc --classcmp=ccc --classalign=rotate_translate_flip --classaligncmp=ccc --classralign=refine --classraligncmp=ccc --classiter=2 --classkeep=1.5 --classnormproc=normalize.edgemean --classaverager=mean --normproj --classkeepsig --parallel=thread:4

Primary options to consider changing:

input : a stack file containing the raw particle data (any format is fine)
iter : Number of iterations of the overall 2-D refinement process to run. For high contrast data, 4-5 iterations may be more than enough, but for low contrast data it could take 10-12 iterations to converge well.
ncls : the number of class-averages to generate. Normally you would want a minimum of ~10-20 particles per class on average, but it's fine to have 100-200 for a large data set. If you plan on making a large number (>100) of classes, you should use the --fastseed option. Note that these averages are not used for final 3-D refinement, so generating a very large number is not useful in most situations.
naliref : The number of alignment references to use in each iteration. You can look at this as the number of different highly distinct views your particle has. With something like GroEL with mostly side views and top views, 3-4 is sufficient. With something like a ribosome something more like 10-15 would be appropriate.
nbasisfp : This is the number of MSD/SVD/PCA eigenimages to use for classification. If you don't know what this means, you're probably fine with values from ~5-10.

Of course, all of the other parameters are meaningful as well, but the ones listed above are the ones you would change first if you don't get the results you expect.

This program uses an iterative MSA-based reference-free classification algorithm. The names in parentheses below are the filenames produced by each step. The files will be found in bdb:r2d_XX (XX is incremented each time e2refine2d.py is run). A brief outline of the process follows :

Initialize the iterative process by making some initial guesses at class-averages. These are invariant-based, meaning even with MSA, this initial classification is not exceptionally good.
1. Generate rotational/translational invariants for each particle (input_fp)
2. Perform MSA on the invariants to define an orthogonal subspace representing the most important differences among the classes (input_fp_basis)
3. Reproject the particles into the MSA subspace using --nbasis vectors (input_fp_basis_proj)
4. Classify the particles into --ncls classes using K-means (classmx_00)
5. Iterative class-averaging of the particles in each class to produce a set of initial averages (classes_init)
Align the current class-averages to each other, and sort them by similarity, keeping them centered (allrefs_YY) (Note that YY starts with 01 and is incremented after each iteration)
Perform MSA on the (aligned) class-averages. Again, this represents largest differences, but now performed on images, not invariants. (basis_YY)
Select a subset of --naliref averages to use as alignment references for this iteration (aliref_YY)
Align each particle to each of the reference averages from the last step. Keep the orientation corresponding to the best-matching reference. (simmx_YY)
Project aligned particles using reference MSA vectors from basis_YY (input_YY_proj)
K-means classification of input_YY_proj (classmx_YY)
New iterative class-averages (classes_YY)
Loop back to step 2 until --iter loops are complete

The primary files you would normally look at after a run is complete are:

classes_YY - the highest numbered file. This contains the final unaligned class-averages
allref_YY - Class-averages which have been sorted and aligned
basis_YY - This contains the MSA basis vectors, which may be useful if looking for signs of specific symmetries, etc.
aliref_YY - May be useful to look at which averages were used as 2-D alignment references

-  ⇤ ← Revision 9 as of 2011-08-22 13:32:16 → 
  Size: 7730
  Editor: SteveLudtke
  Comment:
+   ← Revision 10 as of 2012-04-30 19:57:01 → ⇥
  Size: 9619
  Editor: SteveLudtke
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 57:
+Typical usage:
{{{
e2refine2d.py --iter=6 --naliref=7 --nbasisfp=5 --path=r2d_01 --input=inputparticles.hdf --ncls=48 --simcmp=ccc --simalign=rotate_translate_flip --simaligncmp=ccc --simralign=refine --simraligncmp=ccc --classcmp=ccc --classalign=rotate_translate_flip --classaligncmp=ccc --classralign=refine --classraligncmp=ccc --classiter=2 --classkeep=1.5 --classnormproc=normalize.edgemean --classaverager=mean --normproj --classkeepsig --parallel=thread:4
}}}

Primary options to consider changing:
 * input : a stack file containing the raw particle data (any format is fine)
 * iter : Number of iterations of the overall 2-D refinement process to run. For high contrast data, 4-5 iterations may be more than enough, but for low contrast data it could take 10-12 iterations to converge well.
 * ncls : the number of class-averages to generate. Normally you would want a minimum of ~10-20 particles per class on average, but it's fine to have 100-200 for a large data set. If you plan on making a large number (>100) of classes, you should use the --fastseed option. Note that these averages are not used for final 3-D refinement, so generating a very large number is not useful in most situations.
 * naliref : The number of alignment references to use in each iteration. You can look at this as the number of different highly distinct views your particle has. With something like GroEL with mostly side views and top views, 3-4 is sufficient. With something like a ribosome something more like 10-15 would be appropriate. 
 * nbasisfp : This is the number of MSD/SVD/PCA eigenimages to use for classification. If you don't know what this means, you're probably fine with values from ~5-10. 

Of course, all of the other parameters are meaningful as well, but the ones listed above are the ones you would change first if you don't get the results you expect.

----