Disaster Recovery Exercise of 13/06

·

Intro

On Saturday the 13th of June (luckily not a Friday), we held our first Disaster Recovery (DR) exercise together with our IT infrastructure partner SAVACO.

So what is DR?: Disaster Recovery involves a set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems (and operations) following a natural or human-induced disaster.

Since our founding in 2008, we have detailed Disaster Recovery procedures and playbooks in place for our 24/7 dispatching desk – which are evaluated and refined on a yearly basis. As we mentioned in our latest newsletter these have proven to be useful and have been strengthened throughout the Covid19 crisis – with every dispatcher able to work from home. Only in the extreme scenario of a catastrophic event that wipes Belgium off the face of the earth, would EGSSIS’ dispatching become unavailable. So no worries there.

In line with our ISO27001 certification project we have to simulate a disaster that would severely disrupt our main data centre in Kortrijk, Belgium. The goal of this DR Exercise was to simulate the ‘fail-over’ to our back-up data centre in Brussels. How long would it take us to get all IT systems and databases up and running? This is very important since we are a provider of reliable Software-as-a-Service. Of course our data centres are ISO27001 certified and have many redundancies and fall-backs in place to keep operational during power outages, etc. Nevertheless we have to test the ‘worst-case’ where we lose a full data centre. This calls for a team of experts!

THE A-TEAM of BUTLERS & whizzkids

There are many ‘moving parts’ working behind the scenes to plan and execute such a comprehensive DR exercise. The planning took a few weeks and we had to inform our customers well in advance, as well as all the market parties we communicate with (TSOs, SSOs, etc.). Because of Covid19, our A-Team for the day was separated across different locations:

  • IT team (Jan Corluy, Dirk Van Laere): EGSSIS HQ

  • IT team (Jean-Francois Van Snick): Home office

  • Operations team (Tina Elias, Yannick Van Boven, Wim Allart): EGSSIS HQ

  • Business Analysts (Dieter Juwet, Jonas Lichtert): Home office

  • CEO (Tom Dufraing): EGSSIS HQ

  • SAVACO team (Bert D’Hont, Thijs Deschepper): Home office

Standby team:

  • IT Team (Tom Coppens)

  • SAVACO team (Jen Chiers)

The exercise

The exercise started at 5:00 am (yes it was early!), with the fail over of the EGSSIS software application servers. The complete fail-over of our application servers took 15 minutes, and went smoothly! After these machines were up and running in the back-up data centre, some extra configurations needed to be applied.

  • Internal/private DNS changes (2 minutes)

  • EGSSIS GAS/POWER application servers steps (10 minutes)

    • Configure reverse proxy with the correct backup data center application servers

    • Database fail-over

  • EGSSIS COSMOS application servers steps (120 minutes)

    • Configure reverse proxy with the correct backup data center application servers

    • Change the IP address according to the backup data center network

    • Change routing configurations according to the backup data center

At 7:30 am all was configured and up and running, so we started with internal testing by our Operations and Business Team according to the testing plan. This testing is necessary to check if all communications were set up correctly and all applications are working as expected. We quickly noticed that two market operators (TSOs) did not implement/whitelist our DR IP address correctly (= the IP address for our 2nd data centre). In line with our DR playbook we decided to do all other necessary tests and then to fall-back to our main data centre. This in order to ensure trading/business continuity for all our clients active on these 2 markets.

We started our fail-over back to the main data center at 8:25 am after sending out communications to our customers and market communication counterparts. By 9:00 am everything was up and running again in our primary data center. Below you see a graph with the availability for our software platforms during the DR exercise.

Unfortunately, after being up and running in the main data center, one of our application servers didn’t boot correctly, leading to logon problems for external users. This was solved withing 30 minutes by SAVACO.  After evaluating this, we conclude that the reboot issue was not caused by the fail-over exercise.

NEXT STEPS

The main goal of a DR exercise is to move beyond ‘theoretical what-if’ and to identify points of failure in real-life circumstances.

Following the evaluation of our DR exercise we’ve added some major improvements to our backlog:

  • Change our DNS provider to lower the ‘Time To Live’ of DNS records

  • Improved network configuration for application servers

  • Fail-over configuration updates for application servers

  • Availability tests for external AS2/4 communication parties: so we know 100% that each TSO/SSO/counterpart accepts data/messages from our backup data centre

As an overall conclusion, we can say that all of our core functions were up and running correctly in the backup data-center within the hour. For the tasks that did not run as expected, we will give our commitment that these items will be solved in a timely manner.

Once again many thanks to everyone who contributed to this test leading to further improvement and continuity of our software as a service!

I can confirm we are on track with our ISO27001 roadmap as mentioned during E-World 2020.

Kind regards,

Jan Corluy

CTO – EGSSIS