Why is every company a data company?
The next wave of dominant companies in every segment will be data companies. This requires a data platform that drives the decisions of every employee and, just as important, powers data products. What are data products? A financial instrument, such as a credit card with a credit limit, can become a data product. Its competitive edge comes from crunching enormous amounts of data. Genomic sequencing is a data product. Finding life on Mars is a data product.
To enable the massive data transformation we talk about, you need to bring all your users and all of your data together. And then give them the tools and infrastructure they need to draw insights while following the enterprise security protocols. You need an Enterprise Data platform that scales across every department and every team. So why is that getting harder, not easier?
Your data has become more sensitive — The scale of data is increasing exponentially, yet it’s siloed across different systems in different departments. How do you make sure the right users have access to the right data, and it’s all monitored and audited centrally? And, at the same time, how do you stay in compliance with international regulations?
Your costs are difficult to control – Every organization is under pressure to do more with less. Exponential growth in data does not justify exponential growth in data infrastructure costs. When you have no visibility into who is doing what with what data, it results in uncontrolled costs — infrastructure costs, data costs and labor costs.
Data projects are difficult to manage – How do you track an initiative from start to finish when disparate teams — business analysts, data scientists, and data engineering — deploy disparate technologies managed by IT, Security and DevOps? Which projects are in production? How are we monetizing them? What happens if an app goes down?
Executives need a holistic strategy to scale data across the organization
Enterprises new to these challenges may take an incremental approach or take on-premises solutions and move them to the cloud. But without a holistic approach, you are setting yourself up for replacing one outdated architecture for another that is not up to the challenge long term. The following 5 steps can ensure you are progressing towards a system that can stand the test of time.
Step 1: Bring all your data together
Data warehouses have been used for decades to aggregate structured business data and make decisions by creating BI dashboards on visualization tools. The arrival of data lakes —with their attractive scaling properties and suitability for unstructured data — were vital for enabling data science and machine learning. Today, the Data Lakehouse model combines the reliability of data warehouses with the scalability of data lakes using an open format such as Delta Lake. Regardless of your specific architecture choices, choose a structure that can store all of your data — structured and unstructured —in open formats for long-term control, suitable for processing by a rapidly evolving set of technologies.
Step 2: Enable users to securely access the data
Make sure every member of your data team (data engineers, data scientists, ML engineers, BI analysts, and business decision-makers) across various roles and business units have access to the data they need and none of the data they’re not authorized to access). This means complying with various regulations, including GDPR, CCPA, HIPAA, and PCI). It is important that all of your data — and all people that interact with it — remain together, in one place. If you are fragmenting the data by copying it into a new system for a subset of users (e.g. a data warehouse for your BI users) you have data drift, which leads to issues in Step 3. It also means you have drift of “truth”, where some information in your organization is stale or of a different quality, leading to (at best) organizational mistrust and (more likely) bad business outcomes.
Step 3: Manage your data platform like you manage your business
When you onboard a new employee, you set them up for success. They get the right computer, access to the right systems, etc. Your data platform should be the same. Since all of your data is in one place, every employee can see a different facet of the data, according to their roles and responsibilities. And this data access needs to be aligned with how you manage other employee onboarding; everything must be tied to your onboarding systems, automated, and audited.
Step 4: Leverage cloud-native security
As cloud computing has become the de facto destination for massive data processing and ML, core security principles have been reformulated to Cloud-Native security. The DMZ and perimeter security of “on-premise” security are replaced with “zero-trust” and “software-defined networking.” Locks on physical doors have transformed into modern cryptography. So, you must ensure your data processing platform is designed for the cloud and leverages best-in-class cloud-native controls.
Moreover, the cloud auditing and telemetry provide a record of data access and modification through the cloud-native tools, since every user accesses data with their own identity. This makes Step 3 possible – the groups that you manage your company with are enforced and auditable down to the cloud-native security primitives and tools.
Step 5: Automate for scale
Whether rolling out your platform to hundreds of business units, or many thousands of customers, it needs to be automated from the ground up. This requires that your data platform can be deployed with zero human intervention. Further, for each workspace (environment for a business unit), data access, machine learning models, and other templates must be configured in an automated fashion to be ready for your business.
But powering this scale also demands powerful controls. With the compute of millions of machines at your fingertips, it is easy to run up a massive bill. To deploy to departments across the enterprise the right spent policies and chargebacks need to be designed to ensure the power is being deployed as the business expects.
APIs can automate everything from provisioning users and team workspaces to automating production pipelines, controlling costs, and measuring business outcomes. A fully automatable platform is necessary to power your enterprise.
Who is Databricks
Databricks is an organization and big data processing platform founded by the creators of Apache Spark. The company was created for data scientists, engineers and analysts to help users integrate the fields of data science, engineering and the business behind them across the machine learning lifecycle. The integration helps to ease the processes from data preparation to experimentation and machine learning application deployment. By unifying the pipeline involved with developing machine learning tools, Databricks accelerates development and innovation and increases security. Data processing clusters can be configured and deployed with just a few clicks. The platform includes varied built-in data visualization features to graph data. Databricks is headquartered in San Francisco, California and was founded by Ali Ghodsi, Reynold Xin and Matei Zaharia. Today the company has operations in Munich, Germany as well.