top of page

Data Catalog, an essential repository for the data-driven company

  • Jan 30
  • 3 min read

Defining a "Data Catalog" is a tricky exercise, as leaders and experts generally envision different use cases and characteristics.


However, all Data Catalogs on the market seem to share two key characteristics: the management of business metadata (context, business, solution, technology, etc.) and the management of metadata for technical solutions or implementations (operational data storage systems, digital applications, transactional systems, analytical systems, etc.).


What are the definitions of a Data Catalog?


The catalog, often referred to as a "glossary" or "business dictionary," primarily provides a description of the concepts, terms, and sometimes definitions associated with the business domain and/or the initiative considering the use of data. Launched in the 1980s, "data dictionaries" were the first technologies created to collect, store, and manage basic information (type, length) from Database Management Systems (DBMS), such as Oracle. Following the trend of data dictionaries, numerous tools emerged in the 1990s, notably from IBM (IBM's Repository Manager MVS) and from Platinum and Microsoft (Platinum Repository).


These data catalogs effectively enabled product integration with all or part of the vendor's products, as well as providing a form of functional extension. This extension is characterized by its dependence on data usage (for example, access control based on a user profile). It relied primarily on simply augmenting the product's initial metadata with metadata associated with its use; the term "repository" is often used to refer to this type of data catalog. Furthermore, "repositories" were introduced very early on in several leading products on the market. They were offered by publishers who had distinguished themselves by their ability to domesticate metadata, publishers perceived as pioneers in fields such as business intelligence, data integration (e.g., Business Object Repository, Informatica Metadata Manager, etc.) or others.


Their respective innovative approaches to metadata control focused more on product usage than on the product itself, readily recognizing the value in the use of data rather than the data itself. This applied both to their own product innovation efforts and, more importantly, to their customers, through various initiatives (360-degree view, personalization, regulatory compliance, etc.). The use of data dictionaries combined with these new technologies has broadened the definition of a data dictionary, transforming it into a system that catalogs business, operational, and system metadata.


Example


We can specifically mention:

  • The definitions and descriptions of business data,

  • The sources and origins of operational data,

  • The use of system data to understand how the organization's tools use data.


This largely constitutes the initial definition of today's Data Catalog. However, difficulties arose regarding the management and updating of metadata, which required time, money, and a clear, organized process with centralized data management. We can also consider that organizational culture was not yet sufficiently data-driven to undertake such work. It is with new technologies that we can talk about automating metadata updates and automatically discovering new data sources.


Thus, it is thanks to these solutions currently available on the market and a strong data culture within companies that we can now speak of a new generation of Data Catalogs.


Market Definition


While the solutions currently on the market offer a similar core set of features, many functionalities still differ between each solution.


Below are six proposed definitions from market vendors and experts that we have compiled:



Market leaders and experts generally envision different use cases and characteristics for Data Catalogs, making any attempt at a single definition risky. Nevertheless, an examination of their respective definitions reveals two recurring types of functionality.


The first concerns the collection of metadata associated with the use of data within a specific context (business, solution, technology, etc.), a use independent of any implementation method.


The second involves the collection of metadata associated with the use of data within its implementation context (operational data storage system, digital application, transactional system, analytical system, etc.).


Gabriel Greenfield's Definition


As we saw in previous sections, there are several definitions of what is now called a "Data Catalog."


Therefore, for the remainder of this document, we have proposed to write our own definition of a Data Catalog. We will adopt this definition for the rest of this document.


A Data Catalog is a data repository that captures the business context for the company.


It can be an application or an application suite consisting, among other things, of a business process modeling module centered around data, a data integration layer, or even a search engine.


It is used by companies to:


  • Inventory and organize the available data in their system,

  • Centralize and catalog business terms and technical data,

  • Track data and enable control of the data lifecycle,

  • Link different levels of data modeling,

  • Enable data searching using a business vocabulary.

  • It can also be used to implement management rules tailored to different data categories.


These features allow companies, through their use of the Data Catalog, to maximize the value that data provides.

Comments


© Gabriel Greenfield

© Gabriel Greenfield

© Gabriel Greenfield

© Gabriel Greenfield

bottom of page