Bridging the Gap Between Business Analytics & DevOps

Shaun Chaudhary
The BetterCloud Tech Blog
4 min readOct 29, 2018

--

Data has become an indispensable part of a business’s ability to arm employees with key metrics to make informed decisions. This type of operational data is produced every day — from the individual Salesforce activities logged by sales reps to the actions customers are taking within a product. Analytics teams need access to data, securely, across the entire organization, but this is generally easier said than done. The value proposition of obtaining this data is clear, as an analytics team can provide a measurable ROI in the following areas (as an example):

  • Customer engagement/renewals (Success)
  • Proof Of Concepts (Sales)
  • Marketing
  • Product

This data, including customers’ personally identifiable information, is protected with secure safeguards to prevent data leakage and malicious intrusions. This blog post will discuss how the three-year-old analytics team at BetterCloud has worked in harmony with our DevOps team to ensure we are accessing the data we need without sacrificing security.

Access to Production Data

BetterCloud’s DevOps and security team follow the “least privileged access” policy for our production data. This means that the security team often wanted to give the analytics team access to only non-production environments. Like many companies, BetterCloud does not have data dictionaries with detailed schema/column explanations, and the dummy data found in non-production is not very useful. Analytics teams need access to production environments for data exploration purposes to understand what exists in each database/schema in order to build outputs appropriately.

Take this real life example: Your organization is planning to release a new product with brand new features and functionality. The CEO asks the analytics team to start collecting data points around adoption and usage to deliver one month after the general availability release date.

It’s a fairly straightforward request, but how do we identify the key columns and data points without seeing the actual data? We could set up meetings with the engineers and architects that built the back-end, but often times they are more knowledgeable about the structure of the data than the actual substance. Dummy data in non-production environments won’t help us either, as it won’t allow us to trace real-life events through the data and see how it is being stored or maintained. Data in non-production environments is often very misleading because automated tests and quality assurance teams perform many edge cases to test the system, and tests are very rarely meant to model real-life user behavior.

For these reasons, access to production data is paramount for analytics teams, but how can we ensure that DevOps/Security will sleep soundly knowing the infrastructure is not being compromised? The guidelines below are the result of a cross-functional effort to create query patterns and rules that govern how this data will be handled in production environments.

Lightweight Exploration

These interactions can be done at any time, with little concern for performance impact to production systems.

  • Query a single datastore row (LIMIT 1)
  • Query a single datastore page (LIMIT < 25)

Data Integrity Exploration

These interactions should be done with extreme care since they could have adverse performance implications. These queries should (1) be performed during off-peak hours defined by DevOps and (2) the analytics team member should schedule an agreed upon time with DevOps to monitor the specific server/cluster.

Beware of the following queries:

  • Querying distinct values of column
  • Querying counts
  • Especially SELECT COUNT(index_column) FROM MY_DB for larger databases
  • Time-based histograms and aggregations

Note: These queries are being run from a box inside the environment, and not from my local laptop/desktop. This ensures that the data stays within our security perimeter instead of being run from a local computer.

These are excellent rules to follow when interacting with production data, but there are still ways to optimize this workflow utilizing non-production data:

Development of Query

Development of a query inherently includes repeatedly running similar queries many times in succession. This should NOT be done against live production datastores that are serving customer requests.

  • MySQL: These queries MUST be run against read replicas, or non-production databases.
  • Elasticsearch: These queries MUST be run against non-production clusters.
  • Google Cloud Datastore: Since each query to Google Cloud costs money incrementally, make sure you consider the COGS impact while running these queries.

Note: Allowing dev queries to be run against a live production datastore is an exception.

The analytics developer should also consider the environment impact of running these queries against non-critical datastores. This could have performance implications for non-production environments, or could delay replication in production which could cause unrecoverable loss of production data during an outage.

These concerns can be reduced by adding a delay between testing expensive queries. The best practice is a delay that is (10 * average response time) for expensive queries (e.g., a query that takes five seconds to serve should be delayed 50 seconds between executions).

Before development of a query can be called complete, each query MUST be reviewed and approved by the architect responsible for design and maintenance of each datastore. Any query that has the potential to significantly impact cost and stability MUST also be reviewed and approved by Platform Architecture.

______

Overall, we believe that this crucial issue is not talked about enough amongst the broader engineering and data communities; however, instituting clear patterns and guidelines on how to interact with production and non-production data can help companies extract the most value from analytics teams without sacrificing security or stability from a DevOps viewpoint. The second part of this piece will focus on where the business analytics database should sit and what data is allowed.

Note: If you are interested in reading more about production data, I recommend this excellent post by Michael Kaminsky titled What is Production?

Looking for a new career opportunity? Join BetterCloud. We’re hiring.

--

--