Debugging Distributed Parallelism in EMAN2

If you are having difficulties with parallelism where a linux cluster's nodes are running clients and the server is running on a separate workstation, in this example, called frodo.bu.edu.

If you cannot get parallelism working at all

  1. When you run the server you should see a message like "server running on port 9990". If clients connect (nodes on the cluster) you will see a spinning '-/|\' sequence appear, with a count of the number of nodes that it's seen so far. If you don't see 'server running on port 9990', you'll have to contact me.
  2. the first thing to do is to make sure that the workstation can talk to itself. While the server is running, try running e2parallel.py dcclient --server=frodo.bu.edu --port=9990 --verbose=1 in another window on the workstation itself.

    1. If this produces a slowly spinning response and a (1) on the server, then that's working, at least, and you can kill the client and proceed to 3.
    2. If THAT doesn't work, you can try e2parallel.py dcclient --server=localhost --port=9990 --verbose=1

      1. If THAT works, then you likely have a firewall configuration problem on your machine preventing outside connections to port 9990, which needs to be resolved.
      2. If even that doesn't work, you'll have to contact me again, because something mysterious is going on
  3. Ok, we've established that your machine can accept connections on 9990 from itself. The next thing to try is to log into the head-node on the cluster and try e2parallel.py dcclient --server=frodo.bu.edu --port=9990 --verbose=1

    1. If that works, proceed to 4
    2. If that doesn't work, then either your workstation firewall is configured to block 9990 from external sources, or something in the network between the machines is blocking the connection.
  4. Now, manually log into one of the cluster nodes and run e2parallel.py dcclient --server=calisto.bumc.bu.edu --port=9990 --verbose=1

    1. If that works, then you should be all set, and you can run clients on all of your cluster nodes.
    2. If that fails, most likely your cluster head-node isn't configured to forward network connections from the nodes to the outside world. The head node probably needs something like:  iptables -t nat -A POSTROUTING -o eth1 -j MASQUERADE  to be added to the /etc/rc.local file, and the compute nodes need something like route add default gw 192.168.0.132 in their rc.local files. Note that 'eth1' and 192.168.0.132 will need to be adapted to your specific configuration.

If you have parallelism working, but are experiencing odd failures

This sort of problem is can be difficult to debug. If something like this happens to me (due to some sort of bug, used to happen all the time when I was first developing the system), I first follow the following sequence:

  1. kill any running refinements using the dcserver
  2. exit (^c) the server process
  3. e2parallel.py dckillclients
  4. leave this running long enough to insure all the clients are dead
  5. on each client computer:
    • e2bdb.py -c
    • rm -rf <path to cache directory>

  6. on the server computer
    • e2bdb.py -c
    • rm -rf <path to directory where server runs>

  7. relaunch server (with -v 1 or -v 2)
  8. relaunch all clients (with -v 2), and wait for the count displayed on the server to match the number of launched clients
  9. run your refinement job again

If you had some sort of transitory failure, that will normally fix the problem. If you experience an error again, then there is a real problem to debug. Look on all of the clients and the server for any error messages displayed. Don't just look at the last error when the job failed. The first error is probably the most important.

If you find yourself in this situation, emailing sludtke@bcm.edu is probably the best approach :^)

EMAN2/Parallel/Debug (last edited 2012-03-20 16:45:00 by JohnFlanagan)