Pacemaker is known to wreak havoc if it gets angry. The usual path to quick recovery when the cluster goes crazy like this is to make really sure what's the most up to date replica, shut down Pacemaker completely, assign VIP manually to a healthy replica and promote it manually. Then once you're up and back in the business figure out how to rebuild the cluster.
If this is indeed true, doesn't this negate the purpose of pacemaker to begin with? It's like anti-software. When you run with it in your environment, to recover from a failure (which seems to me what HA software should be about) you have to turn it off first or else it will destroy your recovery attempts.
It's like a perverse version of chaos-monkey, except you want it to destroy you when you are most vulnerable.
It's great when it works as expected. When it doesn't... then the fun begins. I've found it quite fragile, components versions sensitive, configuration sensitive, etc. Most of the time I've seen Pacemaker gone crazy Pg itself was happy to cooperate once the Pacemaker was out of the way. The unknown/weird Pacemaker failure modes were a real (and scary) problem.
I guess the lesson here is not to rely entirely on some HA black magic and always have procedures in place for the 'HA black magic failed us' moments. And team trained to deal with situation like this. It's only software so it will break sooner or later.