I have been involved in search and rescue for the last 3 years plus. In that time, I never thought that what I am doing during a search and the techniques we use could be directly related to my day job and how we trouble shoot a database issue. As I sat in the a class room at the Sheriff’s department recently, I really started to make the correlation between search methods and troubleshooting methods and it made me think… if I applied these same techniques to troubleshooting, I could organize and plan my troubleshooting better and possibly get to a solution faster.
In Search and Rescue (SAR) we learn that there are 4 different types of searching, Hasty, Area, Grid and Evidence (won’t get into this one). So you might think, how can and how do we apply these different techniques to our troubleshooting. As I listened to our highly trained SAR instructors speak, I couldn’t help but see the similarities.
The Hasty search
The objective of a hasty search is to get into the field as quickly as possible and search high probability areas where a subject might be injured or lost as quickly as possible. This is exactly what we as DBA’s do when we encounter an issue or get a call from a frantic end user. First, what is the error and what normally causes the error. We have a mental list that we begin to review and check off. What was the issue the last time this happened? If it a space issue, let’s look at the drives. Is it a locking issues or waiting issue, lets invoke the dynamic management views or monitoring software. If we don’t find the issue quickly, we regroup and rethink about what could possibly cause an issue like we have. In other words, we debrief our Incident Commander, relay our findings or lack of findings and wait for further instructions. Unfortunately, in our line of work, we fulfill all facets of a search including the Incident Commander. We slowly begin to narrow down a new search area, whether it be memory, space, network, etc.. At this point, we begin to create a more focused approach.
The Area Search
The area search is defined as a smaller section of the entire search area that contains boundaries in which a group of 2-4 searchers will walk fairly rapidly in hopes of finding the subject. In other words, it is a smaller, more defined search area. Based on the information gather from the hasty search, the planners are able to make educated decisions on where a subject *might* be located. Whether they are right or not, you still need to search the area just to ensure that the subject is not there. With troubleshooting, it is exactly the same. Based on the results from the quick hasty search, you make an educated guess on what to do next. It will entail diving deeper into a certain area of the database to ensure the issue either does or does not reside in that area. If the hasty search returned hints that there might be locking issues you might want to focus on the indexes or reducing blocking. Perhaps just a tweak to an index could fix the problem. If the search did not produce anything, we try to come up with a Probability of Detection (POD), in other words, how good do we think we covered that area of the search? This will help decide if we need to go back and look in this area again. In our case, creating a POD might be overkill, but it is important to note how well you covered an area in case you need to come back to it.
The Grid Search
A Grid Search is exactly what it sounds like. Searchers will get in a line and depending what they are looking for will decide the spacing between searchers. They will then make sweeps in a defined area to basically get the POD up to a point that we can say the subject is not in there. This applies the same way with troubleshooting databases, only at a very granular level. For instance, you discover that you have PAGEIOLATCH issues. It has been narrowed down to the issue, but now you can focus on the OS type (64-bit v. 32-bit), do you need data compression or maybe it is the speed of the disks. In any case, you are able to focus on a smaller piece of the puzzle. The grid search will allow you to take a much more precise look at a specific issue.
So the similarities are there. If we can prioritize our trouble shooting the same way a Search and Rescue group does it, perhaps we could improve our organization and increase our time to resolve issues.