Skip to main content

Breaking Up with On-Call

·1839 words·9 mins
Leadership Software Engineering Customer Support On-Call AI
Alexey Karandashev
Author
I help companies build software better.
Table of Contents
Edited on March 14th 2025: expand in the "AI" section

Introduction
#

This article is about why on-call in its current state in big-tech is flawed, or how to properly develop software. Surprisingly, from my experience, smaller companies get this right, whereas big corporations tend to converge to a hand holding solution.

tl;dr
#

  • Startups cannot afford engineers to baby sit software, big tech does.
  • Wrong incentives in big-tech create a culture of unreliable software development, which leads to decline in software quality and eventually to temporary on-calls that run indefinitely.
  • On-calls are here to stay, not in the current form though. “AI” or ML are already changing the scope of the on-call and the responsibility of the engineer in the loop.
  • Contact me for consultation.

Guard tower
A guard tower via Pintrest

During my military service, one of the things I disliked the most was guard duty. Sitting in a tower, overlooking an empty field, being woken up in a frenzy from an unintentional drowse by sneezing cows.

I disliked it, as well as any other peer of mine. However, we all agreed that it was a necessary task. We were the first responders to alarm our friends and colleagues in case of an intrusion - a fatal responsibility.

I finished my military service and joined the commercial IT workforce. I truly believed that I left the emergency response obligations behind. But lo and behold, I encountered a plethora of tasks, called on-calls, masquerading as fatally important duties - usually established in order to reassure the customer while the technology or software service broke down.

This on-call manifested itself in different ways: a pager app, ticketing system and emails.

Startups
#

In one of the startups where I worked as a software engineer, the on-call was during work hours. An assignee would answer customer questions, follow up on previous conversations with the customer, and file a bug internally if it was necessary. This duty was created due to limited startup resources, in poor managerial judgement in my opinion, which did affect the moral of the engineers. Nevertheless, at no point in time was it about holding the software by the hand: requesting the customer to restart services or doing it remotely ourselves - true cloud computing.

IT Crowd
Have you? via The IT Crowd©

You would imagine that big leading tech companies, would posses the resources to solve, or avoid the issue all at once. However, they chose to scale the scope of the on-call duty, to normalize and entrench it in the software engineering culture, inside and outside the company.

FAANG, MAANG, SCHMANG
#

My experience
#

If I could define my experience in big tech companies with one word, I would choose the word friction. Filing for a new feature implementation would require a thorough documentation, rightfully so, followed by a political campaign to convince the political party of principal engineers and managers to accept the new feature. These stakeholders carry incentives and principals of their own - that do not necessarily always align with the true engineering spirit of solving the problem, nor with satisfying the customer.

The same friction applied to fixing a bug or a flawed process. I would reproduce the bug: spin up the entire environment, the appropriate binary artifact and the reproduced state of application, create the test cases and pin-point the exact problem for the stakeholders as well as present the possible solutions. I would get sent to a dozen of meetings, bouncing my ideas back and forth until receiving the dire verdict - rejection to fix this bug altogether.

For a flawed process, it will usually start with writing a document, finding agreement with a small percentage of the developers and their managers, followed up by complete ignorance or rejection for the new process by the rest of the engineers. In some cases, I had to resort to developing a script in Jenkins to block certain change requests that do not adhere to the agreement, citing and linking to the agreement - I guess it’s a different kind of working backwards. But then again, some teams would leverage politics to unblock their change requests, at the cost of the progress of our entire organization and it’s code base.

boyscouts_clean.jpg
Leave the code better than you found it. Photo by William ‘Bud’ Hardy / Neuse News

The promotion incentive
#

I think on-call in big tech is the product of the incentives. These incentives do not include:

  • Timely maintenance and improvement of the project.
  • Nurturing a culture of perpetual ownership.

When working for big tech companies, the trend was that people join a project for a year, introduce the design for the new feature or improvement, (half) finish it and jump ship to the next project. That was not the case in startups: code that I wrote in Java years ago would still haunt me while working on my new responsibilities as a C++ developer. Luckily the software I wrote was robust, and that was a rare occasion.

There’s no incentive in big tech to write software with no bugs. It sounds far fetched and arrogant but it is possible. Most of the software I wrote requires mostly no interference or fix-ups, unless of course the requirements have changed. It comes at a cost, a boy-scout mentality (or drive-by fixes) and a proactive approach is necessary. In the long term, this approach pays dividends.

Leave the code in a better shape than you found it.
#

While working on one change, I will note down the parts that will need an improvement for the future. With this approach, in many instances I caught bugs and vulnerabilities ahead of the customer and my colleagues. I will cover more about my procedure for developing robust software in a different post.

The management’s silver bullet
#

Let’s look at the bigger picture, software engineers follow orders that originate somewhere right? Throughout my career at various startups, there was no hierarchy among employees and leadership. We practically worked with the tech lead and the CTO. The objective was pretty obvious, make the life of the customer and ours, the developers, as simple as possible. I would strive to write code that would make myself redundant, the next day a new employee could take over my chair and continue from the same on-going commit with minimal effort. The management’s incentives in these small companies aligned as well, motivate the developers to deliver fully baked code.

In big-tech I felt the story was different. Hopeless managers were crushed by the sheer amount of issues/feature requests, in addition to social concerns of their employees: burnouts, family and health issues, as well as promotion requests. Some managers relayed the pressures from upper management down to the employees, making things worse. Better managers acted as a buffer, a barrier, pushed back when the team’s capacity was full and reallocated team members to different projects when necessary.

Both good and bad managers in big-tech succumbed to running around putting fires out. Managers in big-tech are measured by the amount of targets they timely hit as well as the amount of initiatives they start. Delivering a feature is more valuable than fixing a technical problems in the project, as it is directly measurable. Creating an on-call process to manually inspect errors in test suites is more valuable than improving the project to be more reliable, as you can directly measure the amount of tests that failed on a weekly basis. It is measurable and presentable to the upper management.

Baked cake
Fully baked. kingarthurbaking.com©

AI
#

We are promised daily that AI is just around the corner, that our work as software engineers is obsolete, AI will write code for us as well as fix all our technical debt. Google mentioned that 25% of all new code in the company is generated by AI, then reviewed and accepted by engineers(!). The paragraph in Google’s article is followed up immediately by the product that they want to sell - Gemini. How much of this code is below average boilerplate? How much have these google engineers intervened in the final generated code? What is the impact of the tiny generated chunks of code on the entire system that you develop? Luckily, you can test Gemini yourself. In my experience, LLM’s for coding act as rubber ducky or a strong motivational tool for me to actually read the documentation of the software that I’m not experienced in, after constantly being lead astray by the so called intelligence.

But what does it have to do with on-call ref?

AI, what is it actually good for
#

LLMs are very capable at translation, as originally intended, by using the attention mechanism. There are great uses that exist today for these systems:

  • Translating back and forth between english and SQL queries
  • Simplifying logs and errors into human readable language.

Moreover, for a more robust solution, you can use the embedding models themselves, these stand at the heart of the LLMs. These can be used to search for identical stack traces and/or issues. A very simplified example of a solution, that would save you time searching for identical customer reported issues would look the following way:

(The footnotes unfortunately could not be applied to the graph. You can find them here12)

flowchart TD A[Tokenize and embed the title and content of your tickets/issues/bug reports]-->B[Store the embedding vectors in a vector database¹]; B-->|New ticket arises| C[Tokenize and embed it using the same model]; C-->D[Search for similar issues, using the vector distance function²];

But what does it have to do with on-call ref?

Implications
#

Well, these tools automate the mundane tasks of an on-call engineer: searching for issues related to a customer report, tracking related software (or hardware) crashes, verifying if the current issue that arose during an on-call is a regression or a known bug and so on. You could even go a step forward and automate the allocation of the assignments to the responsible engineer, the author or the recent modifier of the code that might cause the regression (hit me up for a solution for you business).

The exception
#

I understand that in some cases on-call is unavoidable, especially if your product is bound to an SLA with a very tight margin of error in quality and availability. However you must nurture a culture of engineering where the on-call is the exception to the norm. Your engineers will join the fight for the cause, as it will liberate their time (and your money) to concentrate on meaningful tasks and will make your customers happier.

Contact me
#

If you’re interested in improving your company’s engineering structure or exploring options for automating your on-call workflow, I’m here to help - just contact me for a consultation.


  1. Use pgvector or other in memory solutions like Annoy↩︎

  2. Many tickets might appear relevant to the new one. You need to decide on the threshold (the distance) at which you cutoff results. Consider a re-ranking method, to get the most relevant issues related to the queried one. ↩︎