I had a great opportunity to present two sessions at #SQLSatJax (SQL Saturday Jacksonville for those who don’t hashtag) on Saturday, May 14. Both of my sessions talked about high availability as it relates to both on-premises SQL Server and all varieties of Azure SQL. Jeff Taylor (t) and his team did a fantastic job and it was wonderful to be at an event with so many familiar faces but, even better, many new ones! I enjoyed both of my sessions, received some complimentary and constructive feedback, and learned things from the hallway/outdoor courtyard conversations as well.
In fact, it is conversations like those at past events that led me to put a slight twist in at the end of both of my sessions. Make no mistake, both of these sessions (“HA/DR Fails and Fun: I Broke It So You Don’t Have To” and “This Is Fine: Firefighting for the DBA”) are technical. Topics covered run the gamut from Always On Availability Groups to replication, from fault domains in Azure to other gaps in the cloud’s “magic” and how to fill those to keep your data and applications highly available. We delved into rev.io‘s cloud migration journey and how making the decision to migrate to the cloud (as we have) can play into the evolving high availability and disaster recovery needs of an organization.
The twist I added, though, was some slides and discussion at the end of each session about the mental toll that being on-call and responsible for highly available environments can take on the individuals and teams involved in supporting them. I speak from personal experience here. I have triggered personal medical issues because of this. I have left jobs because of this. I have made mistakes that could have cost a company money had our customer decided to punish us (thankfully, they did not).
As I was preparing these sessions in late 2021, it occurred to me that, in all of the Azure SQL/SQL Server high availability sessions I have presented over the years, it never occurred to me to counsel people about the personal toll that it can take. I am glad I started doing this in Jacksonville and look forward to presenting these sessions at other events. They triggered some wonderful post-session discussions about ways to deal with the stress of positions like this and I look forward to having those in-person and virtually as the year goes on.
I write this sitting at the airport waiting to leave on a much-needed vacation. I’m grateful to work at an organization like rev.io with leaders in technology who understand that people need to rest and recharge and I look forward to doing just that. If you find that your current role (and the technical leadership there) isn’t prioritizing your wellbeing, my DMs are open @sqlatspeed or you can email me at email@example.com. I’m happy to chat about some tips and tricks that have helped me through the years or make connections via #sqlfamily to help find you a new opportunity. Be well – remember that you are not highly available even if you think you are.
Last week, IDERA was kind enough to invite me to host a Geek Sync Webinar for them. I presented this particular Geek Sync as a webinar version of my Top 5 Tips to Keep Always On Always Humming session that I’ve done at a few SQL Saturdays in the past. Although brevity forces the slightly cutesy title, the session is specific to Always On Availability Groups and my top tips to keep them running after you’ve set them up. Essentially, it’s a list of mistakes I’ve made (or colleagues of mine I’ve made), and I don’t want you to trip over the same things that we did!
It was a good, interactive audience and I received a few questions that I simply couldn’t handle in the allotted hour for the Geek Sync. While I plan an additional blog early in the New Year to discuss those, I did get a few specific questions on the basics of setting up a read-only routing list and I committed to those people that I would get a blog out before my vacation began that covered the basics of setting up a read-only routing list.
Before we dive in, this blog assumes that you have setup a functional Always On Availability Group with a listener. If you are struggling with that, feel free to reach out to me via Twitter and we’ll take a look at it and see what the best way is for me to help you. If that’s all working for you, though, all that is left is to follow the instructions below and you’ll have a functional read-only routing list.
We need to first identify the read-only routing URL that we will be using in our script. That URL follows this format: ‘TCP://<FQDN of your server>:port # used by SQL instance’. For example, a read-only routing URL for one of my test servers may look like ‘TCP://TESTSERVER01.MGTEST.DEMOS:1433’. If your server is setup according to defaults (or you know that the instance is only listening on one port and you know that port number), these instructions will be sufficient. If you have an instance whose setup is more complicated than this, I have always found the script contained in this Microsoft blog handy. That script, in very detailed fashion, walks you through how to calculate your read-only routing URL in a wide variety of configurations.
Now that we’ve determined our read-only routing URL, we’re ready to create our read-only routing list. To keep this blog simple and readable, I’ll create a read-only routing list for a two-node Always On Availability Group (meaning the read-only connections will be pointed at the secondary under normal operations). Since, in this blog, we’re doing all of this via T-SQL, I’ve pasted a sample script below that 1) ensures that the secondary replica allows read-only connections, 2) establishes the read-only routing URL, and 3) creates the read-only routing list. My sample script (which sets up routing lists for scenarios when either 01 or 02 is the primary) is pasted below. Following the script, I have a short note about setting up round-robin load balancing via a read-only routing list (this is a SQL 2016(+) only feature) and a holiday sendoff!
ALTER AVAILABILITY GROUP [TESTAG01] MODIFY REPLICA ON N’TESTSERVER01′ WITH (SECONDARY_ROLE (ALLOW_CONNECTIONS = READ_ONLY)); ALTER AVAILABILITY GROUP [TESTAG01] MODIFY REPLICA ON N’TESTSERVER01′ WITH (SECONDARY_ROLE (READ_ONLY_ROUTING_URL = N’TCP://TESTSERVER01.MGTEST.DEMOS:1433′)); GO
ALTER AVAILABILITY GROUP [TESTAG01] MODIFY REPLICA ON N’TESTSERVER02′ WITH (SECONDARY_ROLE (ALLOW_CONNECTIONS = READ_ONLY)); ALTER AVAILABILITY GROUP [TESTAG01] MODIFY REPLICA ON N’TESTSERVER02′ WITH (SECONDARY_ROLE (READ_ONLY_ROUTING_URL = N’TCP://TESTSERVER02.MGTEST.DEMOS:1433′)); GO
ALTER AVAILABILITY GROUP [TESTAG01] MODIFY REPLICA ON N’TESTSERVER01′ WITH (PRIMARY_ROLE (READ_ONLY_ROUTING_LIST=((‘TESTSERVER02’, ‘TESTSERVER01’)))); GO
ALTER AVAILABILITY GROUP [TESTAG01] MODIFY REPLICA ON N’TESTSERVER02′ WITH (PRIMARY_ROLE (READ_ONLY_ROUTING_LIST=((‘TESTSERVER01’, ‘TESTSERVER02’)))); GO
If you’ve made it this far, thanks for reading! As I mentioned, the examples above are simple and meant to provide a basic understanding of how to setup a read-only routing list. That said, SQL Server 2016 and later gives you the ability to provide round-robin load balancing via a read-only routing list. If I were to extend my sample AG above to four nodes and I wanted three of them to be load-balanced, it might look something like this: READ_ONLY_ROUTING_LIST = ((‘TESTSERVER01′,’TESTSERVER02’, ‘TESTSERVER03’), ‘TESTSERVER04’). If you note the extra parentheses to the left of TESTSERVER01 and to the right of TESTSERVER03, that is all it takes for the listener to understand that it is supposed to load balance read-only connections among those three nodes.
As I speak about, Always On Availability Groups can be a very complex topic to write about and discuss. This post is, by no means, meant to be an exhaustive exploration of read-only routing lists or availability groups. It is meant to be, as requested by the Geek Sync attendees, a quick view into setting up a basic read-only routing list. Hopefully it achieved that goal!
My blog will be quiet over the next couple weeks as I celebrate the holidays with family and friends. I sincerely hope all of you are able to relax and recharge as well this holiday season. We’ll be speaking a lot more in 2018, I believe! Cheers and Happy New Year!
If you work with AlwaysOn Availability Groups in a production environment long enough, you will inevitably encounter a situation that forces you to failover your AG. If you’ve just started working with AGs, that failover is likely followed by a developer or application administrator calling you in the middle of the night cheerfully telling you “something is not right with the database”.
I know this because it happened to me in my early days of supporting AlwaysOn AGs in production, and the reason was never because the database itself was not alive and available. Thankfully, while AlwaysOn is absolutely a marketing term, it’s also an accurate description of a properly configured environment. Fortunately, Microsoft provides us a lot of helpful information covering basic setup and configuration of an AlwaysOn AG. That information can be found starting here.
What isn’t centrally provided and what would have prevented a few late night phone calls to my phone over the years was a list of tips to configure the other things an application needs from an AlwaysOn AG configuration to smoothly operate after a failover.
Tip #1 Quorum Votes
While this is more of a server administrator than DBA tip, since it’s likely the DBA’s phone that will ring if this is configured incorrectly, we need to take an interest in what’s happening across our operations center. Correctly establishing server quorum votes, in most environments, is the responsibility of a server administrator. Rest assured, though, when the on-call phone rings at 4 am you’re the one waking up if this is not configured correctly.
While a deep dive into this topic is a blog in and of itself, the short version of the tip is this: if your AG spans datacenters, make sure that each datacenter has an even number of quorum votes in the cluster and that the cluster has a file share witness configured at a third, separate site. Basically, a typical Windows cluster needs a majority of voting cluster nodes online to maintain normal cluster operations. If the votes are evenly spread between two datacenters and you have a witness at a third site breaking the tie, losing just one datacenter will not bring your cluster down. While you can force start a cluster after it loses quorum, that is going to create extra, manual work for you both you and your server administrators and that work will be done under the utmost pressure because your database will be inaccessible to the application via the AG listener even though the servers themselves may be healthy. Trust me when I say it’s better to learn this lesson reading a blog than being woken up in the middle of the night with somebody yelling “we’re down!”
Tip #2 Read Only Routing Lists
An important, but often overlooked tip is to configure a read-only routing list for your AG listener. I typically do this as part of the initial build-out of the AG because, without this, you lose one of the coolest features of AlwaysOn availability groups: the ability to route read-only connections to replicas. That feature can free up the primary to provide much more stable performance to the application(s) it serves. It also greatly decreases the possibility that your phone rings with somebody complaining about read-only queries slowing down (or worse yet, blocking) read/write queries.
Tip #3 Copy User Logins
This tip continues our trip up the stack and (hopefully) solves a problem that is easy to solve but can be very difficult to figure out under pressure. Many of the applications I’ve worked on over the years had a variety of database logins (using both Windows authentication and SQL Server authentication). While this is not ideal, there are often technical and/or business reasons why there are multiple logins (using both types of authentication) that need access to the database. In an AlwaysOn AG, particularly one where you have added replicas after the initial establishment of the AG, the replica addition may not have included setting up all the logins the application needs. In that scenario the failover will appear to have worked fine and led to happy application admins and DBAs…until components of the application begin malfunctioning because they don’t have access to the replica (now primary) database. Trying to fix this under a lot of pressure may lead you on a wild goose chase or three (as the application errors or complaining users may not always point you in the right direction), but if you follow the steps found here it will walk you through the process of synchronizing logins between two servers. This process should be one of the first things you do when setting up a new AG replica instance unless you really and truly enjoy off-hours phone calls about broken applications!
Tip #4 Regular Transaction Log Backups
The last two tips are more focused on keeping your AG happy, healthy, and running smoothly which should help keep you, the DBA, happy and healthy as well. Tip #4 is a general SQL Server maintenance best practice but can take on a bit of added importance when supporting AlwaysOn availability groups. Make sure you have a transaction log backup strategy in place and being actively monitored on all databases, but especially the databases participating in the AG. While the discussion of whether or not you should perform these backups on your primary or replica(s) (AlwaysOn gives you the option to choose the location of these backups) is an in-depth topic best left for another blog entry, regular transaction log backups help make maintenance and recovery of an AG-participating database much easier. While shrinking a transaction log is rarely, if ever, recommended, it’s a much more complicated operation in an AlwaysOn AG. Beyond that, a transaction log whose size is out of control can make database recovery a much longer process simply because of the size of the files you are moving/restoring – which means you are spending many more hours with your phone ringing off the hook with unhappy users. That is something that we’d all like to avoid.
Tip #5 Stagger Index Maintenance
My final tip, #5, should keep your network administrators off your back and your AGs as synchronized as possible. It’s also a fairly simple thing to do. If you have a large database within your AG, make sure you stagger the index maintenance that is performed during your regular index maintenance. Whether you rely on a maintenance plan within SQL Server, homebuilt index maintenance scripts, or one of the many solid index maintenance solutions available within the SQL Server community, ensure that the transaction log traffic your index maintenance generates (which can be surprisingly large) is not causing your remote replicas to fall behind your primary database. If your index maintenance is causing those remote replicas to fall behind by seconds or, worse yet, minutes, simply stagger your index maintenance jobs throughout the off-hours or at least make sure the index maintenance on your large tables is spread throughout the time allotted to you for index maintenance. This means that not only are the users who are using the replicas for ad-hoc querying and/or reports likely receiving up-to-the-second data but, in the event of a failover, your exposure to data loss is as minimal as your network and server infrastructure will allow.
I hope these five tips do indeed help keep your AlwaysOn availability groups, application administrators, and users happy. That should mean some happier DBAs as well! I look forward to taking deeper dives into a couple of these topics here on the blog, but if you enjoyed this post I encourage you to sign up for my December 1 webinar here. It will go deeper into some of these topics and I’ll be able to take questions as well.