Live Site Issues - Root Cause

Root Causes of Live Site Issues

Firstly thank you for taking time to complete this survey, which should take approximately ~5 minutes to complete.

Here on the tools team at Microsoft we are researching how developers like yourself tackle live site issues. That is issues that crop up and impact your live service be it a bug or an outage - just something that wasn't captured before in earlier testing. For this survey we'd like you to think about one service you've worked on in the last year and answer the questions in the context of that service.

Microsoft respects your privacy. To learn more, please read our privacy statement.

1.Think about all the live site incidents that have occurred on a service you've worked on in the last year. Could you please rate how common the following probable root cases were?

We're interested in learning more about what components are the source of issues regardless if the issue occurred because of a change in the service (e.g. a new deployment), changes in the environment, or for any other reason.
Extremely unlikely
Somewhat unlikely
Neither likely nor unlikely
Somewhat likely
Extremely Likely
Network - Issues that were caused by a problem in the network stack. For example: a routing rule was incorrect and misdirected traffic, or a DNS server was misconfigured or a firewall was blocking traffic.
Physical - Issues that were caused by a problem with the physical infrastrucure. For example: Misconfigured hardware, a power outage, the destruction or damage of resources the service depended on.
Database - Issues that were caused by how the service interacts with a database. For example: Data from the production database was different from testing environments that led to an unforeseen condition (e.g. a null record that resulted in an exception in the service) or a configuration on the database server caused an issue (e.g. the limit for connections was too low)
Capacity - Issues that were caused by your service running out of compute resources. For example: unexpected load that exceeded its capacity (CPU/Memory/Network) and was not correct by auto scaling.
Software Updates - Issues that were caused because of an update to components the service depends on. For example: a patch to the OS that was incompatible with your service, an update to a 3rd party package (NPM, nuget, gem, etc.) that contained a breaking change, or an update to a runtime (e.g. Java or .NET) that changed the behaviour of your service.
Environment Configuration - Issues that were caused by the service running in an environment that wasn't configured correctly. For example; a connection string or environment variable that was incorrect or configuration settings for components such as web servers being wrong.
Broken Service Dependencies - Issues that were caused by dependencies that your service relies on failing or returning unexpected results. For example: a REST or web service returning empty results instead of expected values, or a REST/Web service that your service depends on being unavailable.
Orchestration Issues - Issues caused by how the service is managed/orchestrated when deployed. For example: Issues caused by incorrect docker compose files or by bugs in orchestrators like Kubernetes.
App Code - Issues caused by bugs in app code, outside of issues caused by interacting with databases or other services (which are covered in the categories above).
2.Thinking about the last 12 months, what would you say the was the most painful live site issue that you worked on? And what made it so hard?
3.In the last 12 months approximately how many live site incidents have impacted the service?
0
1 - 5
6 - 15
15 - 50
50+
4.Approximatley how many consumers (users, devices, etc.) are there of the service?
< 1,000
1000 - 10,000
10,000 - 100,000
100,000 - 1,000,000
> 1,000,000
5.Where is the production service deployed to? (Check all that apply)
On Premises
Amazon Web Service (AWS)
Google Cloud Platform (GCP)
Microsoft Azure
Other
Serverless architecture (e.g. Azure Functions, GCP Cloud Functions, AWS Lamda)
Microservice architecture (e.g. Azure Kubernetes, Google Kubernetes Engine, AWS Elastic Container Service for Kubernetes)
Platform as a Service architecture (e.g. Azure App Service, GCP App Engine, AWS Beanstalk)
Infrastructure as a Service architecture (e.g. Azure VM, GCP Compute Engine, AWS EC2)
6.What langauges and frameworks is the backend of the service built with?
7.On which platform(s) is the backend (server) of the service hosted on?
8.Which persistent storage technologies does the service use?
9.Which tools, if any, do you use to monitor/diagnose issues with your live site? (Check all that apply)
10.Our engineering team is interested in speaking with customers about how they diagnose live site issues. If you would be interested in technical follow-up conversations with our project engineering team, please leave your email below (optional):
Privacy & Cookie Notice