I’ve seen this particular Always On issue occur at two different customers. Both would occasionally have issues where one (or more) of the replicas would fail to restart their communication within the availability group. The error message was usually something like this: The Database Mirroring endpoint cannot listen on port 5022 because it is in use by another process.
In a large, complex corporate environment, the DBAs may not always have permission to run the netstat commands necessary to figure out what application is “stealing” the port. Even if they’re able to figure out which application is causing the port conflict, it may take a fair amount of time to change that application’s behavior.
Since a communication failure within an AG is usually a “hair on fire” kind of event, a quick fix may be desired. The quickest fix I’ve found is to change the port on which the AG endpoint is listening. While the Microsoft help pages on this contain enough information to put together the right script eventually, the first time this happened to me I really would have liked a blog post specific to this issue that gave me the right script to use.
In keeping with this T-SQL Tuesday’s theme, below is my SQL Server present. Please note that I used 5023 as an example but your choice can be flexible depending on what else is consuming ports on your machine. Hopefully this helps somebody (or me if I manage to travel back in time and encounter this same issue):
ALTER ENDPOINT [<endpoint name>]
STATE = STARTED
AS TCP (LISTENER_PORT = 5023)
ALTER AVAILABILITY GROUP [<AG name>]
MODIFY REPLICA ON ‘<replica name>’ WITH (ENDPOINT_URL = ‘<server name>:5023’)
If this post helped you and you live in the Louisville, KY area, please consider registering for SQL Saturday 531 on August 6, 2016 and attend my Always On session there – you’re guaranteed to get what you pay for since SQL Saturdays are free!
Thanks for reading! This weekend I’m competing in rounds 3 and 4 of the The Hoosier Tire US Formula First Championship Series at Mid-Ohio Sports Car Course so I’ll be interrupting the SQL Server stuff usually posted here with some news about my first event of the 2016 racing season.
I’m honored to have been selected to speak at SQL Saturday #531 in Louisville, KY on August 6. I’ll be presenting the session that accompanies the first blog post on this site (http://bit.ly/290Oamz), “Top 5 Tips to Keep Always On Always Humming and Users Happy”.
If you work with AlwaysOn Availability Groups in a production environment long enough, you will inevitably encounter a situation that forces you to failover your AG. If you’ve just started working with AGs, that failover is likely followed by a developer or application administrator calling you in the middle of the night cheerfully telling you “something is not right with the database”.
I know this because it happened to me in my early days of supporting AlwaysOn AGs in production, and the reason was never because the database itself was not alive and available. Thankfully, while AlwaysOn is absolutely a marketing term, it’s also an accurate description of a properly configured environment. Fortunately, Microsoft provides us a lot of helpful information covering basic setup and configuration of an AlwaysOn AG. That information can be found starting here.
What isn’t centrally provided and what would have prevented a few late night phone calls to my phone over the years was a list of tips to configure the other things an application needs from an AlwaysOn AG configuration to smoothly operate after a failover.
Tip #1 Quorum Votes
While this is more of a server administrator than DBA tip, since it’s likely the DBA’s phone that will ring if this is configured incorrectly, we need to take an interest in what’s happening across our operations center. Correctly establishing server quorum votes, in most environments, is the responsibility of a server administrator. Rest assured, though, when the on-call phone rings at 4 am you’re the one waking up if this is not configured correctly.
While a deep dive into this topic is a blog in and of itself, the short version of the tip is this: if your AG spans datacenters, make sure that each datacenter has an even number of quorum votes in the cluster and that the cluster has a file share witness configured at a third, separate site. Basically, a typical Windows cluster needs a majority of voting cluster nodes online to maintain normal cluster operations. If the votes are evenly spread between two datacenters and you have a witness at a third site breaking the tie, losing just one datacenter will not bring your cluster down. While you can force start a cluster after it loses quorum, that is going to create extra, manual work for you both you and your server administrators and that work will be done under the utmost pressure because your database will be inaccessible to the application via the AG listener even though the servers themselves may be healthy. Trust me when I say it’s better to learn this lesson reading a blog than being woken up in the middle of the night with somebody yelling “we’re down!”
Tip #2 Read Only Routing Lists
An important, but often overlooked tip is to configure a read-only routing list for your AG listener. I typically do this as part of the initial build-out of the AG because, without this, you lose one of the coolest features of AlwaysOn availability groups: the ability to route read-only connections to replicas. That feature can free up the primary to provide much more stable performance to the application(s) it serves. It also greatly decreases the possibility that your phone rings with somebody complaining about read-only queries slowing down (or worse yet, blocking) read/write queries.
Tip #3 Copy User Logins
This tip continues our trip up the stack and (hopefully) solves a problem that is easy to solve but can be very difficult to figure out under pressure. Many of the applications I’ve worked on over the years had a variety of database logins (using both Windows authentication and SQL Server authentication). While this is not ideal, there are often technical and/or business reasons why there are multiple logins (using both types of authentication) that need access to the database. In an AlwaysOn AG, particularly one where you have added replicas after the initial establishment of the AG, the replica addition may not have included setting up all the logins the application needs. In that scenario the failover will appear to have worked fine and led to happy application admins and DBAs…until components of the application begin malfunctioning because they don’t have access to the replica (now primary) database. Trying to fix this under a lot of pressure may lead you on a wild goose chase or three (as the application errors or complaining users may not always point you in the right direction), but if you follow the steps found here it will walk you through the process of synchronizing logins between two servers. This process should be one of the first things you do when setting up a new AG replica instance unless you really and truly enjoy off-hours phone calls about broken applications!
Tip #4 Regular Transaction Log Backups
The last two tips are more focused on keeping your AG happy, healthy, and running smoothly which should help keep you, the DBA, happy and healthy as well. Tip #4 is a general SQL Server maintenance best practice but can take on a bit of added importance when supporting AlwaysOn availability groups. Make sure you have a transaction log backup strategy in place and being actively monitored on all databases, but especially the databases participating in the AG. While the discussion of whether or not you should perform these backups on your primary or replica(s) (AlwaysOn gives you the option to choose the location of these backups) is an in-depth topic best left for another blog entry, regular transaction log backups help make maintenance and recovery of an AG-participating database much easier. While shrinking a transaction log is rarely, if ever, recommended, it’s a much more complicated operation in an AlwaysOn AG. Beyond that, a transaction log whose size is out of control can make database recovery a much longer process simply because of the size of the files you are moving/restoring – which means you are spending many more hours with your phone ringing off the hook with unhappy users. That is something that we’d all like to avoid.
Tip #5 Stagger Index Maintenance
My final tip, #5, should keep your network administrators off your back and your AGs as synchronized as possible. It’s also a fairly simple thing to do. If you have a large database within your AG, make sure you stagger the index maintenance that is performed during your regular index maintenance. Whether you rely on a maintenance plan within SQL Server, homebuilt index maintenance scripts, or one of the many solid index maintenance solutions available within the SQL Server community, ensure that the transaction log traffic your index maintenance generates (which can be surprisingly large) is not causing your remote replicas to fall behind your primary database. If your index maintenance is causing those remote replicas to fall behind by seconds or, worse yet, minutes, simply stagger your index maintenance jobs throughout the off-hours or at least make sure the index maintenance on your large tables is spread throughout the time allotted to you for index maintenance. This means that not only are the users who are using the replicas for ad-hoc querying and/or reports likely receiving up-to-the-second data but, in the event of a failover, your exposure to data loss is as minimal as your network and server infrastructure will allow.
I hope these five tips do indeed help keep your AlwaysOn availability groups, application administrators, and users happy. That should mean some happier DBAs as well! I look forward to taking deeper dives into a couple of these topics here on the blog, but if you enjoyed this post I encourage you to sign up for my December 1 webinar here. It will go deeper into some of these topics and I’ll be able to take questions as well.