Part 2: Recovery Time Objective - Plan of Action

In our previous post, Part 1: Recovery Time Objective - Plan of Action, we saw why RTO is important and various factors around it. And as promised, we will explore some of the common methods of recovery used by our users for our application. These methods have their own up and downs. In fact, there are methods that we specially recommend because they are designed by us. And we consider them complete-and-viable solution for recovery. Every other methods are simply workarounds for quicker or ease of recovery but need not be endorsed by us.

1) Snapshots All Day-Everyday!

If your Database (MSSQL/PGSQL) resides on the same server as the ServiceDesk Plus (app), then a VM Clone of your server from a time frame of operation can be a viable backup/recovery method as long as the cloning process was quick and instantaneous.

However, if they are on two different servers, then they can be used only if there are VMsnapshots of both servers. And also if both the clones/backups are on close timestamps to each other. Ideally the same timestamp with the application services stopped.

By experience, this is the fastest method to get services back online if there is a critical issue. As long as there are viable backups from a period of perfect operation.

Known limitations to this method are that this is not the best and recommended practice as suggested by us because this method is not made by us and hence its effectiveness is not controlled by us. And the viability of the recovery is solely dependent on the fidelity of the backup systems.

Any methods followed using tape-backups, virtual machine snapshots, file system backups, Veeam backups or others fall under this category. The number of variables in play are too large for us completely control and hence, is not the best method recommended by us.

2) App-side Backup/Restore:

The application is capable of performing Manual/Scheduled Backups and dump the data to a collection of .data files. The restore time taken for this is solely dependent on the size of data compressed and also the number of files compressed.

We do have an option to exclude non-critical data during backups (known as Trimmed Backups) to ensure the least possible backup/restore time, but it must be pre-planned. This significantly reduces the backup/restore time and resources.

The primary limitation would be that if the backup size is larger than several gigabytes, then the recovery time can take several hours as well based on the number of files in our experience.

This is the best practice and method that is recommended by us because the operation of backup and restore is solely controlled by our systems. The viability of the recovery is dependent only on the safe storage of the backup file itself as long as the backup has completed correctly.

This could also be scheduled to happen periodically during off-operational hours to reduce impact on the server and remove the human error factors possible.

One recommendation to be followed is, store the backups on a separate server or a NAS, to avoid losing backups if there are hardware related issues on the primary server.

3) Failover Services (FOS)- Our inhouse Disaster Recovery Method:

Lastly, we have our Disaster Recovery Method available called “Failover Service” or FOS in short. In this system we create two application servers, master and slave side by side. Where the slave sits dormant on a separate server but connected to the same DB on a 3rd server. If the master dies for some reason, the slave kicks in and becomes operational.

More information on that can be found here - Fail Over Service

The primary advantage is that this system is tried and tested and importantly, it is built in house. However, it does require a separate license for FOS that has to be procured for implementing it. And also the time taken for the system to be operational is dependent on the startup time of your application (which can be tested with a downtime). And lastly, it wouldn’t work if the database is not in a separate server or is a Postgres DB.

General Best Practices:

Now that we have established some of the methods we can tackle recovery, let us look at some of the best practices that we can follow irrespective of the method involved! Let us discuss a few of them.

Always perform an intentional recovery operation on a Test/Dev environment that is created from the current/latest production.

This will tell you exactly how long it takes to bring the system online irrespective of the method used (Recovery Time Calculation).

The factors involved are not the same for individual clients, so it must be done individually to know where you stand.
Test Recovery must be done periodically and not just once.

This will tell you if the time taken for recovery has increased due to data growth or any other factor.

The ability to invoke the need to perform a recovery must be given to an authority that can declare a disaster.
But the operations required to complete the recovery and the privileges should be available to a dedicated team that is tasked for this operation.

Someone from this team should ideally be available around the clock and the authority who can declare a disaster must reachable.

The steps taken for test recovery must be documented so that it can be followed even by inexperienced operators in case of an unavoidable disaster.
Factors to document during each Test Recovery Operations:

Total Recovery Time.
Time split for each sub-processes involved.
All applications have a startup time. From the time the start action is invoked to when the app becomes available.
Checklist to ensure everything is as intended after the recovery is completed.

A rule of 3 - 2 - 1 backups should be followed whenever possible.

3 backups should be present, 2 methods if possible and at least 1 off-site.

In the event of system failure, the vendor (ServiceDesk Plus Support Team) must be approached to declare non-repairability.
But the operations to complete the recovery must be ideally achievable by in-house technicians to avoid vendor dependencies for an OnPremise application.

If the above recommendations are followed by the application owners regularly, there will be a clear document and chart describing the Recovery Time Objective for ServiceDesk Plus and hence will minimize the risks to an organization, ergo save you time, money and other valuable resources.

I hope that the document answers questions that you might have had but not asked and describes the various factors involved in detail. We are constantly working on making the application more efficient in aspects that we can work on. However, there are a lot of factors and variables involved, starting from the data size and quantity, to the server specifications as well. So it is always advised to set up a periodical testing activity to know your individual RTO.

We thank you for choosing to use our application. For any clarifications, do feel free to get in touch with one of our experts in the field and our application!