CLOUDSTACK-8952 - The redundant routers are facing a race condition due to several KeepaliveD/ConntrackD restarts#940
CLOUDSTACK-8952 - The redundant routers are facing a race condition due to several KeepaliveD/ConntrackD restarts#940asfgit merged 12 commits intoapache:masterfrom artificially-ai:fix/rvr__keepalived_restart
Conversation
…ess constructor call - There is no such process, which makes the CsProcess.find return false and restart keepalived all the time.
…'s needed
- With the new logic, the file will be replaced when the router starts, becasue the default
conntrackd config file will be different.
- With the keepalived fixed they should not be needed anymore. So first reducing them drasticaly - I am now making a backup of the template file, write to the template file and compare it with the existing configuration - The template file is recovered afer the process - I also check if the process is running - I fixed a bug in the compare method - I am now updating the configuration variable once the file content is flushed to disk
…outer - There were too many places trying to put the pub interface UP. I centralised it now.
…ile changes
- It was working before because the Routers were restarting about 10 times for each operation
e.g. adding a VM to a network ot acquiring a new IP.
- Adding stat_rules of internal LB to iptables
We needed one extra rule in the INPUT chain
…commit/is_changed methods - We now have to check if the file changed before commiting. Doesn't make sense to write on disk if there was nono change.
- Do not use the API call because it will read what is in the database, that might not have been updated yet
* Check the status in the router directly instead
- Remove all the sleeps
… report back to ACS - If we stop/start a router, the state in the file will still say MASTER, when it is actually not - Checking the state based on the interface (eth1) state - Once master.py is called by keepalived, save the state in the json file to BACKUP just to make sure it's also written there
- We do not need to retry that much
|
Hi @remibergsma @karuturi @miguelaferreira @wido @borisroman @bhaisaab @bvbharat Please have a look at this PR. The three exceptions are related to the network cleanup issue ==> https://issues.apache.org/jira/browse/CLOUDSTACK-8935 == Hardware required tests ==
== No Hardware required tests ==
|
|
This probably also fixes CLOUDSTACK-8927 but we need to confirm. |
|
I'm testing this over the weekend. Reporting back when automated tests are done and when I've played around with it. |
|
Performed the following tests: Result: Next: Results: The failures are a known cleaning-up issue and not related. Next: Result: Then built two VPCs, with one tier each in which I deployed one VM. A VPN between each other allowed for them to ping each other on their internal ip addresses: Based on the above: LGTM. Thanks again @wilderrodrigues! |
|
Ping @karuturi @bhaisaab @wido @bvbharat @DaanHoogland @miguelaferreira Anyone else with some time to test this PR? Please, have a look at the test reports already shared here. Cheers, |
|
nosetests --with-marvin --log-folder-path=/tmp/marvin/ --marvin-config=../../../mct-zone1-kvm1.cfg -a tags=advanced test_internal_lb.py Test to verify access to loadbalancer haproxy admin stats page ... === TestName: test02_internallb_haproxy_stats_on_all_interfaces | Status : SUCCESS === Ran 2 tests in 1026.033s OK |
|
nosetests --with-marvin --log-folder-path=/tmp/marvin/ --marvin-config=../../../mct-zone1-kvm1.cfg -a tags=advanced test_vpc_vpn.py Test Remote Access VPN in VPC ... === TestName: test_vpc_remote_access_vpn | Status : SUCCESS === Ran 2 tests in 728.939s OK |
|
LGTM! |
CLOUDSTACK-8952 - The redundant routers are facing a race condition due to several KeepaliveD/ConntrackD restartsThis PR fixes the following issues: * KeepAliveD being restarted for each action performed on the routers * ConntrackD configuration being copied for each action performed on the routers, causing several restarts * ACS Management Server relying in the JSON file to report which router is Master/Backup * Public Interface on both routers are in UP state due to several places checking if the interface is UP/DOWN and trying to do KeepAliveD * Removing all the sleeps from the test_vpc_redundant.py - those are no longer needed * When KeepAliveD calls master.py during the election, update the cmdline.json to set the router in Backup mode: the election will take care of changing it afterwards. * Add LB stats_rules to iptables INPUT chain * The RVR public interface is set to eth2 instead of eth1 - as in the rVPC. Make sure the check works in both cases Those fixes make all the routers very stable, with ACL, FW, PF and LB working just fine! * pr/940: CLOUDSTACK-8952 - Make the checkrouter.sh compatible with RVR as well CLOUDSTACK-8952 - Make the tests rely on the interface state other than the json file CLOUDSTACK-8952 - Reduce retried from 20 to 5 CLOUDSTACK-8952 - Do not rely in the router state on the json file to report back to ACS CLOUDSTACK-8952 - Make the check for master more reliable CLOUDSTACK-8952 - Restart dnsmasq everytime the configure.py runs CLOUDSTACK-8952 - Make sure the calls to CsFile use the new logic of commit/is_changed methods CLOUDSTACK-8952 - Make sure we restart dnsmasq if the configuration file changes CLOUDSTACK-8952 - The public interface was comming UP in the Backup router CLOUDSTACK-8952 - Do not restart conntrackd unless it's needed CLOUDSTACK-8952 - Do not replace the conntrackd config file unless it's needed CLOUDSTACK-8952 - Remove the '--vrrp' search criteria form the CsProcess constructor call Signed-off-by: Remi Bergsma <github@remi.nl>

This PR fixes the following issues:
Those fixes make all the routers very stable, with ACL, FW, PF and LB working just fine!