We recently looked at the idea of the data lake, so now it’s time to head downstream and look at data warehouses.
We’ll define data warehouses, look at the data types they comprise, the storage they need, and the products and services that have been available, on-premise but increasingly from the cloud.
Key to defining the data warehouse is to recap on the source of data that flows into it. That is, the data lake.
Data lakes are like the wild west, unsuited to access by users or even most IT staff. Data may be searchable and to some extent queryable by its metadata to determine its use downstream, but it is not the place where operational analysis takes place. It is where data resides before it is processed and presented for analytics work.
That’s what occurs in the data warehouse. Compared to the anarchy of the data lake, the data warehouse is an ordered environment, comprising structured data in databases.
As historically defined, data warehouses are almost always dedicated to analytics, and kept quite separate from transaction processing for performance reasons.
Data warehouse storage
The data lake, as we saw, is a largely unorganised environment and access does not need to be terribly fast. Data can reside in myriad forms and getting to grips with it will often involve schema-on-read tools such as Hadoop and Apache Spark, or Amazon Athena (in the cloud) to help with the ingest/analyse process.
By the time data gets to the data warehouse it will have been assessed, wrangled, and usually subject to an extract, transform, load (ETL) process and kept in one or more databases.
Access is for analytics purposes, so while it doesn’t need to be as rapid in access terms as for transactional databases, it should be expected that input/output (I/O) will comprise reasonable amounts of largely sequential traffic as datasets are accessed or copied for analytics processing.
Those requirements have often meant data warehouse storage has been reasonably performant (higher RPM, and SAS) spinning disk or flash. Today, if flash-like access speeds are needed, QLC flash could fit the bill with its suitability to sequential access.
Data warehouse appliances
It’s possible to build your own data warehouse, and specifying storage is a relatively easy part of the process. But hardware specification pales next to overall design, which can be very complex with implications that stretch far into the future.
To mitigate those challenges, numerous vendors have offered data warehouse appliances. These offer – or maybe offered – appliances tailored to data warehouse workloads that could often be scaled out, with preconfigured hardware, operating system, DBMS software, storage and connectivity.
The first came from Netezza in 2001. It was acquired by IBM in 2010 and by mid-decade was re-branded out of existence. That changed in 2019, when IBM bought Red Hat and revived the Netezza brand with flash storage and FPGA processing as well as the ability to run on-premise or in the cloud.
Teradata was a pioneer of the data warehouse appliance. Today it offers cloud and hardware-based data warehousing, business analytics, and consulting services. Teradata Everywhere allows users to submit queries to public and private databases using massively parallel processing (MPP) across on-premise data warehouses and multi- and hybrid-cloud storage. IntelliFlex is Teradata’s data warehouse platform which scales to hundreds of PB with flash drives, while intelliCloud is its secure managed cloud for data and analytics-as-a-service.
For a while EMC sold open source Greenplum software capability bundled with its hardware, but now Greenplum is software only, centred on its data warehousing platform and based on a highly parallelised PostgreSQL database. It competed with the big players and is heavily targeted at cloud use, although it will run on-premise and can be containerised.
Oracle used to sell data warehouse appliances, but that’s now in the past. Currently, Autonomous Data Warehouse is Oracle’s data warehouse offering, which is based on the company’s database of the same name. It is a cloud-based technology designed to automate many of the routine tasks required to manage Oracle databases.
Evolution to the cloud
Data warehouse appliances were the best solution to the challenges of running database-centric analytics on-premise in an era before the cloud really started to come of age.
But essentially they are big iron. That meant they were costly to acquire, run and maintain. When it comes to scaling, further challenges arise. Upgrades couldn’t be made in small increments so big chunks of capacity that could lie unused for quite a while needed to be bought. And they’re not just iron. As an appliance they are a complex bundle of software and connectivity out to other data sources.
In the past decade the provision of cloud services has matured to such an extent that data warehouse provision is a natural fit.
In place of costly Capex outlays and ongoing maintenance and running costs, running a data warehouse from the cloud allows the provider to take the strain.
All the big three – AWS, Azure and Google Cloud – provide data warehouse offerings that provide core functionality around a database, with added tools such as ETL and data viz and others.
Amazon Redshift is AWS’s managed data warehouse service in the cloud. You can start with a few hundred GB of data and scale to petabytes. To create a data warehouse you launch a set of nodes, called a Redshift cluster. Here you can upload data sets and perform data analysis queries using SQL-based tools and business intelligence applications. Redshift can be managed from a dedicated console or a CLI, with APIs to write into applications.
Amazon specifically targets customers that may want to migrate from Oracle, and also offers packages that come with Matillion ETL and Tableau data visualisation.
Redshift Spectrum also allows data stored in S3 to be analysed in place.
Azure SQL Data Warehouse
Azure SQL Data Warehouse is Microsoft’s managed petabyte-scale service that uses either symmetric multi-processing or MPP to deal with data, dependent on the volumes involved. Microsoft’s cloud offering makes a point of its ability to manage compute and storage independently and to pause the compute layer while persisting the data to reduce costs.
It is based on the Azure SQL database service. Data Warehouse abstracts away physical machines and represents compute in the form of data warehouse units that allow users to and easily scale compute resources at will.
ETL comes from Azure Data Factory.
BigQuery is Google Cloud Platform’s data warehouse offering. Like the rest, it offers petabyte-scale data warehousing, with querying by ANSI SQL.
Big Query has software modules that target machine learning, geographic information systems and business information use cases, and can even use Google Sheets as a substitute for a true database.
BigQuery access is via console or CLI and APIs in application code.
Google Cloud marketing materials specifically target customers that might want to migrate from on-premise Teradata deployments as well as those using Amazon Redshift.