Question: I'm running EMAN in parallel on a cluster, and it seems to lock up when it reaches the class-averaging step (classalign2).

Answer: This is known to occur on a few clusters, but not in a predictable fashion. It seems to be related to a bug in older versions of OpenSSH, or perhaps in the network kernel itself. If (while refine is locked up) you run 'netstat' on the node where the refine command was run, and see a number ssh processes in the CLOSE_WAIT state, this is the problem you are having. Unfortunately there isn't a universal fix, other than to run a newer version of Linux on your cluster. Upgrading to a newer version of OpenSSH (4.2 or higher) may solve the problem. If you want a stopgap solution, you can simply run 'killall ssh' on the node where refine is running, and the job will start running again. However, you may be forced to do this once in every iteration, assuming it continues to

FAQ_EMAN_USING_16 (last edited 2008-11-26 04:42:28 by localhost)