One significant challenge with maintaining a large, enterprise data estate, leveraging multiple data sources, both structured and unstructured, is the inherent performance issues related to data flow and data movement. While traditional extract-transform-load (ETL) processes still have relevant business and technical use cases, the delays and performance degradation often encountered with these processes can be significant, time consuming, resource challenging and carry the potential for significant technical debt. While newer extract-load-transform (ELT) processes alleviate this in some instances, it can still present technical challenges in the quest to be nimble and responsive to the business in providing quality data across the enterprise. Where the use case for traditional ETL or ELT processes is not as strong or compelling, there may be other options.
Other issues inherent in traditional data warehousing, large enterprise data estates are managing sensitive data sets from multiple locations, storage costs from maintaining multiple large data sets in a variety of formats on various platforms, as well as significant data governance complexity.
What is another option for addressing these challenges? Enter data virtualization.
Data virtualization, conceptually, has been around since the early 1980s. In 1982, computer scientist, Edgar Codd, proposed, what he called a “data independence objective,” separating the logical and physical aspects of database management. (https://en.wikipedia.org/wiki/Edgar_F._Codd).
Unfortunately, data virtualization as a technology or technological process, did no grow beyond the conceptual for several decades. It has languished in the scrap heap of the enterprise data model until very recently. In the last decade or so, data virtualization has come of age and several products have entered the market to deliver on the early promise of what data virtualization can offer.
Industry analysts such as Gartner and Forrester, have noted that the recent development and early proliferation of commercial data virtualization products is evidence that data virtualization has grown beyond the conceptual.
So, what is data virtualization?
Data virtualization is any approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted at source, or where it is physically located, and can provide a single customer view (or single view of any other entity) of the overall data.
Unlike the traditional extract, transform, load (“ETL”) process, the data remains in place, and real-time access is given to the source system for the data. This reduces the risk of data errors, of the workload moving data around that may never be used, and it does not attempt to impose a single data model on the data. (https://en.wikipedia.org/wiki/Data_virtualization).
This allows for the integration of data from multiple disparate sources, locations and formats, without the need for data replication of complete ETL/ELT, to create a single “virtual” data,
Other advantages of this approach are provision of complete data lineage from the source to the presentation layer, the ability to quickly add new data sources without changing ETL, ELT or staging configurations, availability of all data through common SQL interface, provision of a single data access point for structured and unstructured data, unified data security model, among many others.
In the last 5 – 10 years, numerous of products have surfaced to address this need. Some of the more common tools that the reader may be aware of are: Denodo, Informatica, Tibco and others.
There are several common business or technical use cases for data virtualization, including prototyping for physical integration, semantic layer for analytics, logical EDW architecture, data prep, application system access, creating data stored for operational data reporting, among others.
If you are a Microsoft centric organization however, there has not been an option for data virtualization on the Microsoft data and analytics stack until very recently. If your enterprise data estate leveraged large data sets across various file formats and applications and you had a business need for data virtualization, you incurred additional cost, licensing and complexities to use a 3rd party product outside of Microsoft.
An eagerly anticipated new feature of Microsoft SQL Server in the new release of SQL Server 2019, is data virtualization.
SQL Server has provided features like linked servers, replication, and integration services to integrate data from other data sources. SQL Server 2016 added PolyBase that provides data virtualization capabilities against data stored in Hadoop and Azure Blob Storage. Extend the querying capability to data sources like SQL Server, Oracle, Teradata, MongoDB, and other data sources.
Data virtualization in SQL Server 2019 is accomplished using some significant enhancements made to PolyBase. PolyBase was a feature introduced in SQL Server 2016 to allow for querying tables and files using TSQL. Is used to query, import and export data. In the 2016 version, it was limited primarily to Hadoop, Azure Blob Storage and Azure Data Lake. In 2019, this functionality has been expanded to include SQL Server, Oracle, Teradata and MongoDB. PolyBase is used to connect to numerous data sources and file formats.
In addition to PolyBase, the other feature set related to data virtualization that allows for the combination of large volumes of relational and non-relational data, is Big Data clusters.
SQL Server 2019 big data clusters with the enhancements to PolyBase act as a virtual data layer to integrate structured and unstructured data from across the entire data estate (SQL Server, Azure SQL Database, Azure SQL Data Warehouse, Azure Cosmos DB, MySQL, PostgreSQL, MongoDB, Oracle, Teradata, HDFS, Blob Storage, Azure Data Lake Store) using familiar programming frameworks and data analysis tools: (James Serra)
If your organization already owns or is licensed for QL Server 2019 Enterprise Edition, you already own PolyBase. Configuring and enabling PolyBase on a SQL Server instance is a straightforward activity. When setting up the metadata to access the external data sources, you will need a master key is used to secure access. Other steps include creating a database scoped credential, to authenticate against sources, specifying external data sources, and optionally, create an external schema to keep the external table separate from internal tables. Once set up, start querying your data to create the virtualized tables.
As powerful as this feature is, like anything else in technology, it is not a panacea. There are still many use cases or situations where the more traditional data warehouse is a preferred approach, such as:
- Where speed and performance are paramount (Power BI Dashboard where sub-second response is desired)
- Ready access to historical data (not typically stored in production operational systems)
- Performance degradation on source systems
It is important to conduct a thorough data estate strategy: understand the business, technical and functional requirements and to establish appropriate use cases.
In summary, data virtualization is a powerful tool in your overall data and analytics arsenal, and a potential game changer in driving digital transformation. Leveraging virtualization appropriately can significantly enhance analytics operations within an organization and drive additional business value, while reducing costs, complexity and technical debt.
With SQL Server 2019, Microsoft has now provided a tool to deliver on the promise of virtualization without incurring additional costs for new licensing and hardware.