Live Site Issues - Root Cause Survey

1.Think about all the live site incidents that have occurred on a service you've worked on in the last year. Could you please rate how common the following probable root cases were?

We're interested in learning more about what components are the source of issues regardless if the issue occurred because of a change in the service (e.g. a new deployment), changes in the environment, or for any other reason.

Extremely unlikely

Somewhat unlikely

Neither likely nor unlikely

Somewhat likely

Extremely Likely

Network - Issues that were caused by a problem in the network stack. For example: a routing rule was incorrect and misdirected traffic, or a DNS server was misconfigured or a firewall was blocking traffic.

Extremely unlikely

Somewhat unlikely

Neither likely nor unlikely

Somewhat likely

Extremely Likely

Physical - Issues that were caused by a problem with the physical infrastrucure. For example: Misconfigured hardware, a power outage, the destruction or damage of resources the service depended on.

Extremely unlikely

Somewhat unlikely

Neither likely nor unlikely

Somewhat likely

Extremely Likely

Database - Issues that were caused by how the service interacts with a database. For example: Data from the production database was different from testing environments that led to an unforeseen condition (e.g. a null record that resulted in an exception in the service) or a configuration on the database server caused an issue (e.g. the limit for connections was too low)

Extremely unlikely

Somewhat unlikely

Neither likely nor unlikely

Somewhat likely

Extremely Likely

Capacity - Issues that were caused by your service running out of compute resources. For example: unexpected load that exceeded its capacity (CPU/Memory/Network) and was not correct by auto scaling.

Extremely unlikely

Somewhat unlikely

Neither likely nor unlikely

Somewhat likely

Extremely Likely

Software Updates - Issues that were caused because of an update to components the service depends on. For example: a patch to the OS that was incompatible with your service, an update to a 3rd party package (NPM, nuget, gem, etc.) that contained a breaking change, or an update to a runtime (e.g. Java or .NET) that changed the behaviour of your service.

Extremely unlikely

Somewhat unlikely

Neither likely nor unlikely

Somewhat likely

Extremely Likely

Environment Configuration - Issues that were caused by the service running in an environment that wasn't configured correctly. For example; a connection string or environment variable that was incorrect or configuration settings for components such as web servers being wrong.

Extremely unlikely

Somewhat unlikely

Neither likely nor unlikely

Somewhat likely

Extremely Likely

Broken Service Dependencies - Issues that were caused by dependencies that your service relies on failing or returning unexpected results. For example: a REST or web service returning empty results instead of expected values, or a REST/Web service that your service depends on being unavailable.

Extremely unlikely

Somewhat unlikely

Neither likely nor unlikely

Somewhat likely

Extremely Likely

Orchestration Issues - Issues caused by how the service is managed/orchestrated when deployed. For example: Issues caused by incorrect docker compose files or by bugs in orchestrators like Kubernetes.

Extremely unlikely

Somewhat unlikely

Neither likely nor unlikely

Somewhat likely

Extremely Likely

App Code - Issues caused by bugs in app code, outside of issues caused by interacting with databases or other services (which are covered in the categories above).

Extremely unlikely

Somewhat unlikely

Neither likely nor unlikely

Somewhat likely

Extremely Likely

Other (please specify)

2.Thinking about the last 12 months, what would you say the was the most painful live site issue that you worked on? And what made it so hard?

4.Approximatley how many consumers (users, devices, etc.) are there of the service?

7.On which platform(s) is the backend (server) of the service hosted on?

Windows

Linux/Unix

Other (please specify)

8.Which persistent storage technologies does the service use?

None (The service has no persistent storage)

SQL as a service offerings (e.g. Azure MS SQL, GCP Cloud SQL, AWS RDS for SQL)

SQL (Oracle SQL, Postgres SQL, MS SQL)

NoSQL as a service offerings (e.g. Azure CosmosDb, Google Firebase, AWS DynamoDB)

NoSQL (Apache Casssandra, MongoDB)

Big data as a service (Azure Data Lake, GCP Big Table, AWS EMR)

'File' storage in the cloud (e.g. Azure Blob Storage, Google Filestore, AWS S3)

Other (please specify)

10.Our engineering team is interested in speaking with customers about how they diagnose live site issues. If you would be interested in technical follow-up conversations with our project engineering team, please leave your email below (optional):

Email Address

Live Site Issues - Root Cause

Root Causes of Live Site Issues