A Most Intense Debugging Experience

The Setup

I’d been contracting with the company for 3 weeks.

My onboarding was a bit rough. I was tasked with developing an API middleware to help bridge the gap between the company’s legacy system and a newer medical billing platform. The principal engineer leading the project had set me up with a separate barebones git repository to work in, which didn’t have any of the actual product’s code. We paired and reviewed for a couple of sessions initially, but he became busy with other work and I was left to my own devices.

With this time, I made some headway on the API setup, but with minimal details on the implementation needs, and no working knowledge or understanding of the product’s codebase, there wasn’t a whole lot I could do.

I began exploring other repositories, and studied the primary product’s codebase. I took it upon myself to attempt to spin it up and I ran into numerous issues along the way - mainly due to outdated documentation and system incompatibilites. With some time and effort, I was able to get it running, but I was still missing a lot of the details.

All this to say that I was left with a lot of questions, and there wasn’t much communication from the lead engineer.

The Call

One day, I was plugging along, exploring the codebase and learning about the product. I received a Zoom call from the product manager. He mentioned that they were having some issues with the app, and was wondering if I could help to debug it. He mentioned that the lead engineer was out of the office, so they’re checking in with me to see if I could help.

“Sure!” I said. I like to help out, and this seemed like a good opportunity to get to know the product better.

At first, I thought it might be a simple issue, nothing too serious. I thought perhaps they were giving me a chance to learn more as well as to understand how I approach problems.

I quickly realized that this was not the case.

Soon, the Chief Technology Officer joined the call. Then the Chief Operating Officer. Then others… people I hadn’t really met yet.

I learned more about the issue: users were unable to sign in.

At that point in time, I wasn’t too familiar with the company’s user base. I didn’t know how many users were struggling to sign in, and I didn’t know how many companies or businesses that the issue impacted… and honestly, I didn’t want to know.

An useful strategy to remain calm and collected during these situations is this: do not let the situation get to you. Whether the bug or issue is affecting one customer, one thousand, or one million, the root cause is the same. The potential complexity of the solution does not increase with the number of users affected! This should be good news.

Spoiler: I was able to help them debug the issue, and we were able to get the users back up and running… but wow, what an experience!

The Debugging Process

To debug, I was told where the login page was located in the repository, so I could start there.

I was guided to a line in the mid-500s where the product manager thought the issue might be ocurring.

I began by adding some server-side logs to see if I could get any more information. I checked out a hotfix branch, committed the log statements and pushed, and the PM was able to deploy this branch to a specific environment for testing. Thank goodness he was able to do that, because I was so fresh in the system that there was no way I’d be deploying anything!

Learning Opportunity #1: Awareness. When onboarding engineers, review the app from top to bottom. Keep it to a high level if necessary, but always cover all the points from local setup through deployment and hosting. There’s no need to hide facets of an application from your team! Sharing abundantly could save you in the future. If I understood how this application was normally deployed, I could have more quickly understood the fastest path to debugging it.

After deploying, the CTO was able to see the logs on production (which I didn’t have access to yet), and would paste the log results into our Zoom chat.

Learning Opportunity #2: Transparency. Every engineer on your team should have immediate access to the logging system. This should not be difficult to find.

I realized we weren’t seeing anything meaningful, and most of the logs weren’t even showing up, so the problem must be higher up in the file.

I added more logs starting at the very beginning of the file, and we deployed again.

There! Data. We started to see some details.

The login page itself was lacking in error handling, and there were some linear SQL queries that were being made to the database when the page initially loads and then when users attempt to log in.

One of the log results after a query showed us that the query was returning an empty result set for a specific field… this empty set wasn’t accounted for via error handling, and an empty value broke everything else down the line!

The COO was able to pinpoint the specific data issue in the database, and we were able to fix it.

Voila!

Learning Opportunity #3: Test, test, test… automatically. Proper error handling and automated tests would have caught this error long before it became a production issue. Defensive programming is a must… always expect the worst-case scenario, and account for it in code. Then, automate some tests to replicate those scenarios and ensure they pass regularly.

Overall, this debugging process took about an hour. It was intense, there was about a dozen people on the call at one point… but my blinders were on. I was focused on the code, getting answers, and finding a fix.

The Aftermath

After the fix, I was able to get a better understanding of the codebase, and I learned more about the actual impact of the outage that I helped troubleshoot. Apparently, this issue was a major problem, and the fix was met with relief. The company was grateful for my help.

I found out the next day that the reason the lead engineer was out of the office was because he quit, without notice.

The next chapter is to be written…