Meta Data : That is the data
Metadata in data warehousing
Metadata in data warehousing is different from photography, In data warehousing, metadata contains the definitions of data (the meaning and source of each column), the definition of the data warehouse itself (in other words, the data store structure, the ETL processes, and the data quality), the definition of the related systems (for example, the source systems), the audit information (what processes ran and when they ran), and the usage (which reports and cubes are used by whom and when). So, we have seven kinds of metadata:
• Data definition and mapping metadata contains the meaning of each fact and dimension column and where the data is coming from.
• Data structure metadata describes the structure of the tables in each data store.
• Source system metadata describes the data structure of source system databases.
• ETL process metadata describes each data flow in the ETL processes.
• Data quality metadata describes data quality rules, their risk levels, and their actions.
• Audit metadata contains a record of processes and activities in the data warehouse.
• Usage metadata contains an event log of application usage.
The data definition metadata contains the meaning of each fact, dimension key, and dimension attribute within the data warehouse. "Meaning" is the business description of the data in the context of the organization where the data warehouse is implemented or used.
Having a definitive meaning of title availability stored in a metadata database and viewable by the users is useful to avoid confusion and misperception. The users can view
the data definitions either via reports created using SSRS or via direct query using Enterprise Manager. Data definition metadata also contains the business description of each column in the NDS or ODS (operational data store).
Data Definition and Mapping Metadata
Data definition metadata is a list of all columns from every table in the DDS, ODS, and NDS, along with their meanings and sample values. Instead of mentioning the data store names, table names, and column names, data definition metadata uses the table key and column key defined in the data structure metadata.
• A normalized data store (NDS) is an internal master data store in the formof one or more normalized relational databases for the purpose of integrating data from various source systems captured in a stage, before the data is loaded to a user-facing data store.
• An operational data store (ODS) is a hybrid data store in the form of one or more normalized relational databases, containing the transaction data and the most recent
version of master data, for the purpose of supporting operational applications.
• A dimensional data store (DDS) is a user-facing data store, in the formof one or more
relational databases, where the data
Mapping metadata describes where each piece of data comes from in the source system.
Mapping metadata is also known as data lineage metadata. If the mapping metadata contains only the source system column name, it can be put on the same table as the data definition metadata. But if the mapping metadata contains the complete data lineage between the DDS, the NDS/ODS, the stage, and the source systems, then it is usually placed in a separate table. The benefit of putting the mapping metadata in the same table as the data definition metadata is that the structure is simpler. The benefit of putting mapping metadata in separate tables is that the structure is more flexible and has less data redundancy.
Column Types
Column Type Location Description
Column
Type Location
Description
Surrogate key
DDS dimension tables
A single not null column that uniquely identifies arow in a dimension table.
Natural key
DDS dimension tables
Uniquely identifies a dimension row in the source system.
Dimensional attribute
DDS dimension tables
Describes a particular property of a dimension.
Degenerate Degenerate
DDS fact tables
Identifies a transaction in the source system. A natural key of a dimension without any attributes.
Degenerate
DDS dimension tables
Columns that support slowly changing dimension such as is_active, effective_date, and expiry_date.
Degenerate
DDS fact tables
Columns in the fact table that contain business measurements or transaction values.
Fact key
DDS fact tables
A single not null column that uniquely identifies a row on a fact table.
System
All data stores
Auxiliary columns created by the system for system usage such as create_timestamp and update_timestamp.
Transaction
ODS and NDS tables
Column in normalized tables containing business transaction values, such as order tables.
Master
ODS and NDS tables
Columns in normalized tables that contain masterdata such as stores, products, and campaigns.
Stage
Stage tables
Columns in stage tables containing business data.
DDS tables are populated from the ODS. The ODS tables are populated from the stage
tables. The stage tables are populated from the source systems' tables. Because of that, the data mapping table is a linked list.
It is better if the source system metadata is put in the same place with the data warehouse
data structure metadata. This way, source_column_key in the data mapping table needs
to refer to only one table. Source system metadata contains a list of tables and columns in all source systems, along with their data types. Data structure metadata contains a list of tables and columns in the data warehouse (for example, stage, ODS, DDS), along with their data types. If we put both of them in the same table, the data mapping table needs to refer to this one table. But if we put them in two different tables, the data mapping needs to go to two different tables to get the data structure. If it is a source system data store, it goes to table 1, and if it is a data warehouse data store, it goes to table 2.
Data Structure Metadata
Data structure metadata contains the tables and columns in all data stores in the data warehouse (for example, DDS, NDS/ODS, and stage) along with their data types. Data structure metadata also contains table definitions, indexes, referential integrity (primary keys, foreign keys), constraints (null, unique), identity details, partition details, and views. Data structure metadata when combined with data definition metadata is also known as the data dictionary.
The main purpose of creating data structure metadata is to expose and explain the structure of the data warehouse to the users, and as such, there is no need to go into the details such as the full-text indexes, stored procedures, custom data types, and trigger's definition. Database and data warehouse administrators, developers, and architects have access to SQL Server object catalog views, and therefore there is no need to include the minute details such as the physical location of database files, statistics settings of the database, shrink settings, database access mode, and recovery mode. The system (the data warehouse) does not use data structure metadata, because the data structure of each database is available internally to the system.
We do need to describe, however, at the very least, the data store, tables, and columns. In
other words, we need to describe—with good business descriptions—how many data stores there are in the data warehouse; how many tables are in the DDS, NDS, and stage; and how many columns there are in each table. It is absolutely useless for the business users if we describe dim_communication as communication dimension and dim_channel as channel dimension. What the business users need, especially those who are new to CRM, is an explanation of what communication and channel are. So, descriptions such as these are much more useful: "A communication is printed or electronic media issued at a regular or irregular interval that customers can subscribe to, such as a newsletter or an RSS feed" and "A channel is a medium to deliver communications to customers, such as an e-mail, an RSS feed, a cell/mobile phone text message, and a telephone call."
The second purpose of creating data structure metadata is for a reference to other metadata. Data structure metadata contains the name, data type, and description of all the tables and columns in all data stores. Other metadata needs to refer to data structure metadata to get this information.
Data structure metadata consists of five tables: ds_data_store, ds_table, ds_table_type,
ds_column, and ds_column_type. (DS stands for data structure.)
Data StructureMetadata Tables
Table Name
Description
ds_data_store
Lists all data stores in the data warehouse
ds_table
Lists tables in each data store
ds_table_type
Lists table types such as dimension table, fact table, and so on
ds_column
Lists all columns in each table
ds_column_type
Lists all columns in each table
It is, of course, possible to populate the data structure metadata manually, but it is handy
to be able to populate it automatically using scripts. The output of the previous script needs to be "upserted" into the column table, in other words, updated when it exists and inserted when it does not exist. But, regardless of the script, the table description columns still need to be filled in manually with meaningful and useful business descriptions. There are tools that can produce data structure information of a database, such as table and column names, data types, and data length, but we still have to type the business description of each column and table manually.
Source System Metadata
Source system metadata is similar to data structure metadata. In other words, it contains the data structure of the source system. Like data structure metadata, at the least it needs to contain the database level, table level, and column level. This information is required because data mapping metadata stores only the column key. When the application needs to display the column names and data types, it gets this information from the source system metadata. Additionally, source system metadata can also contain other information such as the refresh frequency for particular tables. The refresh frequency is useful when we try to optimize the ETL process. Other information that is useful on source system metadata is a data profile, such as minimum and maximum values, row count, and the number of NULLs.
The main purpose of creating the source system metadata is to expose and explain the structure of the source systems to the data warehouse users. This, in a way, acts as an extension to the data warehouse structure exposed by the data structure metadata.
The second purpose of creating source system metadata is to serve as a reference for the
other metadata. For example, the source column on the data mapping metadata refers to the columns on the source systems, which exist in the source system metadata. ETL processes metadata and data quality metadata also contain the source system columns, and therefore they need to refer to the source system metadata. This other metadata stores only the column key. When the application needs to display the column name and data types, it gets this information from the source system metadata.
It is best to place the source system metadata in the same place as the data structure metadata, rather than creating it on its own set of tables. They have the same structure, so they can be put in the same set of tables. This way, it is easier to query because we don't have to conditionally join two different set of tables, which will make the query slower. For example, on the full data lineage/mapping table, if we place the source system
metadata columns in the data structure metadata column table, we have only the table
to link to, regardless of whether it is a stage, NDS, or source system column.
Data structure metadata consists of three core tables: ds_data_store contains a list of data stores, ds_table contains a list of tables, and ds_column contains a list of columns. In these tables we put the data stores, tables, and columns of both the data warehouse and the source systems.
How do we populate source system metadata? Unlike data structure metadata, which can
be obtained automatically in most cases we are not able to populate the source system metadata automatically, because it is not under our control. We need administrative rights to the source system databases, and this kind of access is usually possessed only by the DBA. So, we need to populate the source system metadata manually. We usually get the information that we need to populate the source system metadata when we do source system analysis. For some of us who are lucky enough (or unlucky, depending on how you look at it) to be the DBA of the source systems, we will be able to automate the population by accessing the system information schema or table definitions and be able to create a script in the source system to export its structure into a text file, which we can then import into metadata tables.
There is one thing that source system metadata has but that does not exist in data structure metadata: a source data profile. A source data profile contains statistics and volumetric information describing the source data such as minimum value, maximum value, average value, and number of distinct values for each column. It also contains the number of rows, row size, and number of columns for each table. A source data profile is useful when creating ETL processes to extract the data out of the source system. It is also useful when we need to modify the ETL processes after the data warehouse has been deployed in production.
ETL Process Metadata
ETL process metadata consists of a data flow table, a package table, and a status table. The data flow table describes the name and description of each SSIS data flow, what table they extract from, the destination table, what transformations are applied (only a descriptive text, not the exact formula), the parent package, their current status, the last successful extraction time (LSET), and the current extraction time (CET). This information is used by the ETL processes to determine which records to extract from the source system. Sometimes on the data flow table, we have a column called order. This column contains the sequential order when the data flow needs to run, in relative position to other data flow. When using SSIS, we don't actually need this column, because we have precedence constraints defining the workflow between data flows. We need it when we use stored procedures for ETL.
The package table contains the name and description of each SSIS package, their schedule (only descriptive text; the actual schedule is set when creating the SQL Server Agent Job to execute the SSIS package), their status, and the last time they ran. The status table contains the status code and description. To be able to restart an ETL batch from the point of failure (not from the beginning), we need to know exactly where it failed, what was executed, and what was not executed. The status column on both the data flow table and the package table enables this. In its simplest form, the status values are success, failed, and in progress. In a more sophisticated implementation, other status values are "about to run" and "canceled." These two additional statuses enable us to understand the more precise state of each ETL process, which helps us troubleshoot when there is a problem with the ETL.
The main purpose of having ETL metadata is to control the ETL packages in SSIS.
The status columns enable the SSIS packages or the whole ETL batch to restart from the point of failure. When we restart an ETL package, we can use the status column to determine which ETL processes have not been executed, and we can restart from that process onward, rather than from the beginning. To be able to restart from failure, the ETL processes need to be re-runnable. In other words, we need to be able to run the same process over and over without causing any problem.
The second purpose of having ETL metadata is to describe the ETL processes. It describes what the ETL process or package does, such as "Stage daily incremental: this SSIS package extracts the following data from the source system incrementally and loads it into the stage: customer, permission, product, store, product purchases, package subscription, and communication subscription. We can automate it a little bit, such as by querying sysjobs, sysjobsteps, sysjobschedules, and sysschedules tables in the msdb database to determine the SQL Server Agent job steps and schedule. We can also query
sysjobhistory to determine what jobs have been executed.
Data Quality Metadata
Data quality metadata contains the data quality rules that we discussed in the previous chapter, including the rule name and description, rule type (for example, error or warning), rule category (for example, incoming data validation, cross reference validation, or internal data warehouse validation), risk level (for example, business or financial risk if this rule is violated, on the scale of 1 to 5), status (active or not), action (reject, allow, or fix), notification, create timestamp, and update timestamp. The notification column contains a link to the DQ notification table that contains who to notify and how to notify them (for example, an e-mail or a cell/mobile phone text message). We can have other features such as how many times to notify, at what interval, acknowledgment status, escalation timeframe, escalation person/second level, and so on. We also need to have a third table containing the user details, in other words, name, department/business area, role/position, e-mail address, cell phone number, group number, and so on. And potentially we need a fourth table for describing the user groups when we have a lot of users and we want to assign the data quality tasks to a group of users rather than to individual users. It is usually better to start with something simple, such as without escalation/second level. If it is proven to be successful, we can
enhance it further.
Data quality metadata contain three main tables: DQ rules, DQ notification, and DW user. The DQ rules table contains the data quality rules. The DQ notification table specifies which DW users to notify for each rule. The DW user table describes the data warehouse users including their e-mail addresses.
The rule_type column is linked to the rule_type table, which has two entries: E for errors and W for warnings. This enables us to determine whether it is an error or warning. The rule_category column is linked to the rule_category table, which has three entries: I for incoming data validation rules, C for cross-reference validation rules, and D for internal data warehouse validation rules. This enables us to determine the type of data quality issue we are facing.
The risk_level column is linked to the rule_risk_level table, which has five entries:
level 1 for no business impact to level 5 for severe damage to business financial positions.
Audit Metadata
Audit metadata contains the results of every process and activity in the data warehouse,
including data loading process (ETL), creation of specific purpose DDS (mart), manual data modification (for example, updating DQ user table), DQ rule validation, security log/breach (for example, web portal login), system enhancements and modifications (for example, patches, schema changes and upgrades), data purging and archiving, database tuning, and maintenance (indexing and partitioning activities). Audit metadata is used for administering the data warehouse, including monitoring ETL processes, monitoring data quality, enhancements requests, installation patches, and monitoring security access. For example, the purpose of one of the daily operation reports (or manual query using SQL script) is to verify whether the ETL processes ran successfully. This report or script accesses the audit metadata to get the number of rows loaded for each table. It is used to see when the last purging, indexing, or partitioning was done on a particular table. Another example of administrative usage is to support enhancement requests. By examining or querying the audit metadata, we know what patch was applied and when.
The secondary usage of audit metadata is for troubleshooting. When there is a problem
or issue, you can query audit metadata to know what happened in the last few hours. Say
you got a call from a user saying that a package subscription report or cube is missing some data. You know that this cube is fed from the Subscription Sales fact table in the DDS, and it turns out that today this fact table was not loaded. In this case, you can query how many records were loaded into the stage, how many into NDS, and how many into DDS. It helps you to pinpoint where the cause of the problem lies.
In its simplest form, audit metadata is an event log table containing every single event
that happened in the data warehouse populated by every single process and activity in the
data warehouse. The processes use inserts SQL statements to add rows into the event log
table. It contains the event type, event category, which data warehouse object is involved in the event (linked to data structure metadata), the data flow initiating the event (linked to the data flow table), the number of rows involved, the timestamp when the event occurred, and a note.
Usage Metadata
Usage metadata is similar to the audit metadata in that it contains an event log. But this time, the events are the usage of the Reporting Services reports, Analysis Services cubes, and data mining model. In one implementation, the reports, cubes, and mining model could be accessed via a web portal. Before the link on the portal redirects to the report (or cubes or mining model), it inserts a record into this usage metadata first. So, we know which reports or cubes were accessed and when they were accessed. The access patterns are useful for doing performance tuning and capacity planning. For example, they enable us to avoid the busy times when processing the cubes.
Another implementation is to populate the usage metadata from the SQL Server stored
procedures, which are invoked by the Reporting Services reports. For this, the reports must not contain direct SQL statements; instead, they must call the stored procedures.
Usage metadata is used to understand (for example) how many times a particular
report is accessed and by which users or departments. We can use an IIS log to record the
number of hits on the links on the data warehouse portal page, but usage metadata is more flexible. It could be reported using SSRS, so it would have a more consistent look with other data warehouse reports.
Maintaining Metadata
It is important to keep the metadata up-to-date. An out-of-date data structure or outdated
data mapping metadata, for example, may lead the users to the wrong understanding of the data warehouse. Also, the system may take an incorrect action if the metadata is incorrect.
Conducting a change request impact analysis based on out-of-date data mapping metadata could lead to ETL problems, such as because we misinterpreted where a piece of data is coming from or because the data type was not correct.
Data definition and mapping metadata is usually defined and created at an early phase of
a data warehousing project. Once created, they are then maintained through change request procedures. Rather than changing it directly on the metadata tables, the change needs to be implemented using a data change script. The data change script is a Transact SQL script containing the rollback section, the change request details (which go to the event log table), and the roll forward.
The DW administrator needs to prepare a script to change the metadata and submit this
script to be executed under a change request process. The script contains three sections: log, rollback, and implementation.
• The log section records the change request details into the event log table. The reason
for doing this is for traceability and troubleshooting if something goes wrong.
• The rollback section tries to "undo" the change. This is necessary to prove that we can
revert the changes.
• The implementation section applies the change to the metadata database.
Some data structure metadata can be maintained automatically using SQL Server object
catalog views. A query gets the data from the object catalog views and inserts or updates the appropriate metadata tables.
Some data structure metadata still needs to be maintained manually, such as object business descriptions, dimensional hierarchy, and SCD details. Source system metadata is usually created and populated during the source system analysis phase and then maintained during the development phase by using a data change script to update the metadata tables (illustrated earlier), according to the change request process. After the data warehouse is released, the source system metadata is maintained using data change scripts, governed by a change request procedure.
Some data structure metadata still needs to be maintained manually, such as object business descriptions, dimensional hierarchy, and SCD details. Source system metadata is usually created and populated during the source system analysis phase and then maintained during the development phase by using a data change script to update the metadata tables (illustrated earlier), according to the change request process. After the data warehouse is released, the source system metadata is maintained using data change scripts, governed by a change request procedure.
ETL process metadata is created and populated during the development of the SSIS packages. We need to update the ETL process metadata every time we modify the SSIS packages. Similarly, data quality metadata (DQ rule table, and so on) needs to be updated every time wemodify the SSIS packages. Ideally, we need to use a standardized data change script to update the metadata tables, rather than using direct SQL statements to update the tables. All updates to metadata need to be recorded on the audit metadata event log table. So, this needs to be included in the data change scripts.
Audit metadata and usage metadata are maintained as the data warehouse is used. Audit
metadata is populated by the ETL processes, patch release processes, and security processes. Usage metadata is populated by the stored procedure or the web portal.
Summary
We defined what metadata in a data warehouse is and what it contains. It listed seven kinds of metadata: data definition and mapping metadata, data structure metadata, source system metadata, ETL processes metadata, data quality metadata, audit metadata, and usage metadata.
Metadata helps us understand the structure and definition/meaning of the data in the
Data warehouse. It also helps us understand the structure and definition of the data warehouse itself. Metadata is also used for auditing and troubleshooting purposes, in other words, to understand what happened (and what is happening) in the data warehouse, where it happened, and when it happened.
Metadata in data warehousing is different from photography, In data warehousing, metadata contains the definitions of data (the meaning and source of each column), the definition of the data warehouse itself (in other words, the data store structure, the ETL processes, and the data quality), the definition of the related systems (for example, the source systems), the audit information (what processes ran and when they ran), and the usage (which reports and cubes are used by whom and when). So, we have seven kinds of metadata:
• Data definition and mapping metadata contains the meaning of each fact and dimension column and where the data is coming from.
• Data structure metadata describes the structure of the tables in each data store.
• Source system metadata describes the data structure of source system databases.
• ETL process metadata describes each data flow in the ETL processes.
• Data quality metadata describes data quality rules, their risk levels, and their actions.
• Audit metadata contains a record of processes and activities in the data warehouse.
• Usage metadata contains an event log of application usage.
The data definition metadata contains the meaning of each fact, dimension key, and dimension attribute within the data warehouse. "Meaning" is the business description of the data in the context of the organization where the data warehouse is implemented or used.
Having a definitive meaning of title availability stored in a metadata database and viewable by the users is useful to avoid confusion and misperception. The users can view
the data definitions either via reports created using SSRS or via direct query using Enterprise Manager. Data definition metadata also contains the business description of each column in the NDS or ODS (operational data store).
Data Definition and Mapping Metadata
Data definition metadata is a list of all columns from every table in the DDS, ODS, and NDS, along with their meanings and sample values. Instead of mentioning the data store names, table names, and column names, data definition metadata uses the table key and column key defined in the data structure metadata.
• A normalized data store (NDS) is an internal master data store in the formof one or more normalized relational databases for the purpose of integrating data from various source systems captured in a stage, before the data is loaded to a user-facing data store.
• An operational data store (ODS) is a hybrid data store in the form of one or more normalized relational databases, containing the transaction data and the most recent
version of master data, for the purpose of supporting operational applications.
• A dimensional data store (DDS) is a user-facing data store, in the formof one or more
relational databases, where the data
Mapping metadata describes where each piece of data comes from in the source system.
Mapping metadata is also known as data lineage metadata. If the mapping metadata contains only the source system column name, it can be put on the same table as the data definition metadata. But if the mapping metadata contains the complete data lineage between the DDS, the NDS/ODS, the stage, and the source systems, then it is usually placed in a separate table. The benefit of putting the mapping metadata in the same table as the data definition metadata is that the structure is simpler. The benefit of putting mapping metadata in separate tables is that the structure is more flexible and has less data redundancy.
Column Types
Column Type Location Description
Column
Type Location
Description
Surrogate key
DDS dimension tables
A single not null column that uniquely identifies arow in a dimension table.
Natural key
DDS dimension tables
Uniquely identifies a dimension row in the source system.
Dimensional attribute
DDS dimension tables
Describes a particular property of a dimension.
Degenerate Degenerate
DDS fact tables
Identifies a transaction in the source system. A natural key of a dimension without any attributes.
Degenerate
DDS dimension tables
Columns that support slowly changing dimension such as is_active, effective_date, and expiry_date.
Degenerate
DDS fact tables
Columns in the fact table that contain business measurements or transaction values.
Fact key
DDS fact tables
A single not null column that uniquely identifies a row on a fact table.
System
All data stores
Auxiliary columns created by the system for system usage such as create_timestamp and update_timestamp.
Transaction
ODS and NDS tables
Column in normalized tables containing business transaction values, such as order tables.
Master
ODS and NDS tables
Columns in normalized tables that contain masterdata such as stores, products, and campaigns.
Stage
Stage tables
Columns in stage tables containing business data.
DDS tables are populated from the ODS. The ODS tables are populated from the stage
tables. The stage tables are populated from the source systems' tables. Because of that, the data mapping table is a linked list.
It is better if the source system metadata is put in the same place with the data warehouse
data structure metadata. This way, source_column_key in the data mapping table needs
to refer to only one table. Source system metadata contains a list of tables and columns in all source systems, along with their data types. Data structure metadata contains a list of tables and columns in the data warehouse (for example, stage, ODS, DDS), along with their data types. If we put both of them in the same table, the data mapping table needs to refer to this one table. But if we put them in two different tables, the data mapping needs to go to two different tables to get the data structure. If it is a source system data store, it goes to table 1, and if it is a data warehouse data store, it goes to table 2.
Data Structure Metadata
Data structure metadata contains the tables and columns in all data stores in the data warehouse (for example, DDS, NDS/ODS, and stage) along with their data types. Data structure metadata also contains table definitions, indexes, referential integrity (primary keys, foreign keys), constraints (null, unique), identity details, partition details, and views. Data structure metadata when combined with data definition metadata is also known as the data dictionary.
The main purpose of creating data structure metadata is to expose and explain the structure of the data warehouse to the users, and as such, there is no need to go into the details such as the full-text indexes, stored procedures, custom data types, and trigger's definition. Database and data warehouse administrators, developers, and architects have access to SQL Server object catalog views, and therefore there is no need to include the minute details such as the physical location of database files, statistics settings of the database, shrink settings, database access mode, and recovery mode. The system (the data warehouse) does not use data structure metadata, because the data structure of each database is available internally to the system.
We do need to describe, however, at the very least, the data store, tables, and columns. In
other words, we need to describe—with good business descriptions—how many data stores there are in the data warehouse; how many tables are in the DDS, NDS, and stage; and how many columns there are in each table. It is absolutely useless for the business users if we describe dim_communication as communication dimension and dim_channel as channel dimension. What the business users need, especially those who are new to CRM, is an explanation of what communication and channel are. So, descriptions such as these are much more useful: "A communication is printed or electronic media issued at a regular or irregular interval that customers can subscribe to, such as a newsletter or an RSS feed" and "A channel is a medium to deliver communications to customers, such as an e-mail, an RSS feed, a cell/mobile phone text message, and a telephone call."
The second purpose of creating data structure metadata is for a reference to other metadata. Data structure metadata contains the name, data type, and description of all the tables and columns in all data stores. Other metadata needs to refer to data structure metadata to get this information.
Data structure metadata consists of five tables: ds_data_store, ds_table, ds_table_type,
ds_column, and ds_column_type. (DS stands for data structure.)
Data StructureMetadata Tables
Table Name
Description
ds_data_store
Lists all data stores in the data warehouse
ds_table
Lists tables in each data store
ds_table_type
Lists table types such as dimension table, fact table, and so on
ds_column
Lists all columns in each table
ds_column_type
Lists all columns in each table
It is, of course, possible to populate the data structure metadata manually, but it is handy
to be able to populate it automatically using scripts. The output of the previous script needs to be "upserted" into the column table, in other words, updated when it exists and inserted when it does not exist. But, regardless of the script, the table description columns still need to be filled in manually with meaningful and useful business descriptions. There are tools that can produce data structure information of a database, such as table and column names, data types, and data length, but we still have to type the business description of each column and table manually.
Source System Metadata
Source system metadata is similar to data structure metadata. In other words, it contains the data structure of the source system. Like data structure metadata, at the least it needs to contain the database level, table level, and column level. This information is required because data mapping metadata stores only the column key. When the application needs to display the column names and data types, it gets this information from the source system metadata. Additionally, source system metadata can also contain other information such as the refresh frequency for particular tables. The refresh frequency is useful when we try to optimize the ETL process. Other information that is useful on source system metadata is a data profile, such as minimum and maximum values, row count, and the number of NULLs.
The main purpose of creating the source system metadata is to expose and explain the structure of the source systems to the data warehouse users. This, in a way, acts as an extension to the data warehouse structure exposed by the data structure metadata.
The second purpose of creating source system metadata is to serve as a reference for the
other metadata. For example, the source column on the data mapping metadata refers to the columns on the source systems, which exist in the source system metadata. ETL processes metadata and data quality metadata also contain the source system columns, and therefore they need to refer to the source system metadata. This other metadata stores only the column key. When the application needs to display the column name and data types, it gets this information from the source system metadata.
It is best to place the source system metadata in the same place as the data structure metadata, rather than creating it on its own set of tables. They have the same structure, so they can be put in the same set of tables. This way, it is easier to query because we don't have to conditionally join two different set of tables, which will make the query slower. For example, on the full data lineage/mapping table, if we place the source system
metadata columns in the data structure metadata column table, we have only the table
to link to, regardless of whether it is a stage, NDS, or source system column.
Data structure metadata consists of three core tables: ds_data_store contains a list of data stores, ds_table contains a list of tables, and ds_column contains a list of columns. In these tables we put the data stores, tables, and columns of both the data warehouse and the source systems.
How do we populate source system metadata? Unlike data structure metadata, which can
be obtained automatically in most cases we are not able to populate the source system metadata automatically, because it is not under our control. We need administrative rights to the source system databases, and this kind of access is usually possessed only by the DBA. So, we need to populate the source system metadata manually. We usually get the information that we need to populate the source system metadata when we do source system analysis. For some of us who are lucky enough (or unlucky, depending on how you look at it) to be the DBA of the source systems, we will be able to automate the population by accessing the system information schema or table definitions and be able to create a script in the source system to export its structure into a text file, which we can then import into metadata tables.
There is one thing that source system metadata has but that does not exist in data structure metadata: a source data profile. A source data profile contains statistics and volumetric information describing the source data such as minimum value, maximum value, average value, and number of distinct values for each column. It also contains the number of rows, row size, and number of columns for each table. A source data profile is useful when creating ETL processes to extract the data out of the source system. It is also useful when we need to modify the ETL processes after the data warehouse has been deployed in production.
ETL Process Metadata
ETL process metadata consists of a data flow table, a package table, and a status table. The data flow table describes the name and description of each SSIS data flow, what table they extract from, the destination table, what transformations are applied (only a descriptive text, not the exact formula), the parent package, their current status, the last successful extraction time (LSET), and the current extraction time (CET). This information is used by the ETL processes to determine which records to extract from the source system. Sometimes on the data flow table, we have a column called order. This column contains the sequential order when the data flow needs to run, in relative position to other data flow. When using SSIS, we don't actually need this column, because we have precedence constraints defining the workflow between data flows. We need it when we use stored procedures for ETL.
The package table contains the name and description of each SSIS package, their schedule (only descriptive text; the actual schedule is set when creating the SQL Server Agent Job to execute the SSIS package), their status, and the last time they ran. The status table contains the status code and description. To be able to restart an ETL batch from the point of failure (not from the beginning), we need to know exactly where it failed, what was executed, and what was not executed. The status column on both the data flow table and the package table enables this. In its simplest form, the status values are success, failed, and in progress. In a more sophisticated implementation, other status values are "about to run" and "canceled." These two additional statuses enable us to understand the more precise state of each ETL process, which helps us troubleshoot when there is a problem with the ETL.
The main purpose of having ETL metadata is to control the ETL packages in SSIS.
The status columns enable the SSIS packages or the whole ETL batch to restart from the point of failure. When we restart an ETL package, we can use the status column to determine which ETL processes have not been executed, and we can restart from that process onward, rather than from the beginning. To be able to restart from failure, the ETL processes need to be re-runnable. In other words, we need to be able to run the same process over and over without causing any problem.
The second purpose of having ETL metadata is to describe the ETL processes. It describes what the ETL process or package does, such as "Stage daily incremental: this SSIS package extracts the following data from the source system incrementally and loads it into the stage: customer, permission, product, store, product purchases, package subscription, and communication subscription. We can automate it a little bit, such as by querying sysjobs, sysjobsteps, sysjobschedules, and sysschedules tables in the msdb database to determine the SQL Server Agent job steps and schedule. We can also query
sysjobhistory to determine what jobs have been executed.
Data Quality Metadata
Data quality metadata contains the data quality rules that we discussed in the previous chapter, including the rule name and description, rule type (for example, error or warning), rule category (for example, incoming data validation, cross reference validation, or internal data warehouse validation), risk level (for example, business or financial risk if this rule is violated, on the scale of 1 to 5), status (active or not), action (reject, allow, or fix), notification, create timestamp, and update timestamp. The notification column contains a link to the DQ notification table that contains who to notify and how to notify them (for example, an e-mail or a cell/mobile phone text message). We can have other features such as how many times to notify, at what interval, acknowledgment status, escalation timeframe, escalation person/second level, and so on. We also need to have a third table containing the user details, in other words, name, department/business area, role/position, e-mail address, cell phone number, group number, and so on. And potentially we need a fourth table for describing the user groups when we have a lot of users and we want to assign the data quality tasks to a group of users rather than to individual users. It is usually better to start with something simple, such as without escalation/second level. If it is proven to be successful, we can
enhance it further.
Data quality metadata contain three main tables: DQ rules, DQ notification, and DW user. The DQ rules table contains the data quality rules. The DQ notification table specifies which DW users to notify for each rule. The DW user table describes the data warehouse users including their e-mail addresses.
The rule_type column is linked to the rule_type table, which has two entries: E for errors and W for warnings. This enables us to determine whether it is an error or warning. The rule_category column is linked to the rule_category table, which has three entries: I for incoming data validation rules, C for cross-reference validation rules, and D for internal data warehouse validation rules. This enables us to determine the type of data quality issue we are facing.
The risk_level column is linked to the rule_risk_level table, which has five entries:
level 1 for no business impact to level 5 for severe damage to business financial positions.
Audit Metadata
Audit metadata contains the results of every process and activity in the data warehouse,
including data loading process (ETL), creation of specific purpose DDS (mart), manual data modification (for example, updating DQ user table), DQ rule validation, security log/breach (for example, web portal login), system enhancements and modifications (for example, patches, schema changes and upgrades), data purging and archiving, database tuning, and maintenance (indexing and partitioning activities). Audit metadata is used for administering the data warehouse, including monitoring ETL processes, monitoring data quality, enhancements requests, installation patches, and monitoring security access. For example, the purpose of one of the daily operation reports (or manual query using SQL script) is to verify whether the ETL processes ran successfully. This report or script accesses the audit metadata to get the number of rows loaded for each table. It is used to see when the last purging, indexing, or partitioning was done on a particular table. Another example of administrative usage is to support enhancement requests. By examining or querying the audit metadata, we know what patch was applied and when.
The secondary usage of audit metadata is for troubleshooting. When there is a problem
or issue, you can query audit metadata to know what happened in the last few hours. Say
you got a call from a user saying that a package subscription report or cube is missing some data. You know that this cube is fed from the Subscription Sales fact table in the DDS, and it turns out that today this fact table was not loaded. In this case, you can query how many records were loaded into the stage, how many into NDS, and how many into DDS. It helps you to pinpoint where the cause of the problem lies.
In its simplest form, audit metadata is an event log table containing every single event
that happened in the data warehouse populated by every single process and activity in the
data warehouse. The processes use inserts SQL statements to add rows into the event log
table. It contains the event type, event category, which data warehouse object is involved in the event (linked to data structure metadata), the data flow initiating the event (linked to the data flow table), the number of rows involved, the timestamp when the event occurred, and a note.
Usage Metadata
Usage metadata is similar to the audit metadata in that it contains an event log. But this time, the events are the usage of the Reporting Services reports, Analysis Services cubes, and data mining model. In one implementation, the reports, cubes, and mining model could be accessed via a web portal. Before the link on the portal redirects to the report (or cubes or mining model), it inserts a record into this usage metadata first. So, we know which reports or cubes were accessed and when they were accessed. The access patterns are useful for doing performance tuning and capacity planning. For example, they enable us to avoid the busy times when processing the cubes.
Another implementation is to populate the usage metadata from the SQL Server stored
procedures, which are invoked by the Reporting Services reports. For this, the reports must not contain direct SQL statements; instead, they must call the stored procedures.
Usage metadata is used to understand (for example) how many times a particular
report is accessed and by which users or departments. We can use an IIS log to record the
number of hits on the links on the data warehouse portal page, but usage metadata is more flexible. It could be reported using SSRS, so it would have a more consistent look with other data warehouse reports.
Maintaining Metadata
It is important to keep the metadata up-to-date. An out-of-date data structure or outdated
data mapping metadata, for example, may lead the users to the wrong understanding of the data warehouse. Also, the system may take an incorrect action if the metadata is incorrect.
Conducting a change request impact analysis based on out-of-date data mapping metadata could lead to ETL problems, such as because we misinterpreted where a piece of data is coming from or because the data type was not correct.
Data definition and mapping metadata is usually defined and created at an early phase of
a data warehousing project. Once created, they are then maintained through change request procedures. Rather than changing it directly on the metadata tables, the change needs to be implemented using a data change script. The data change script is a Transact SQL script containing the rollback section, the change request details (which go to the event log table), and the roll forward.
The DW administrator needs to prepare a script to change the metadata and submit this
script to be executed under a change request process. The script contains three sections: log, rollback, and implementation.
• The log section records the change request details into the event log table. The reason
for doing this is for traceability and troubleshooting if something goes wrong.
• The rollback section tries to "undo" the change. This is necessary to prove that we can
revert the changes.
• The implementation section applies the change to the metadata database.
Some data structure metadata can be maintained automatically using SQL Server object
catalog views. A query gets the data from the object catalog views and inserts or updates the appropriate metadata tables.
Some data structure metadata still needs to be maintained manually, such as object business descriptions, dimensional hierarchy, and SCD details. Source system metadata is usually created and populated during the source system analysis phase and then maintained during the development phase by using a data change script to update the metadata tables (illustrated earlier), according to the change request process. After the data warehouse is released, the source system metadata is maintained using data change scripts, governed by a change request procedure.
Some data structure metadata still needs to be maintained manually, such as object business descriptions, dimensional hierarchy, and SCD details. Source system metadata is usually created and populated during the source system analysis phase and then maintained during the development phase by using a data change script to update the metadata tables (illustrated earlier), according to the change request process. After the data warehouse is released, the source system metadata is maintained using data change scripts, governed by a change request procedure.
ETL process metadata is created and populated during the development of the SSIS packages. We need to update the ETL process metadata every time we modify the SSIS packages. Similarly, data quality metadata (DQ rule table, and so on) needs to be updated every time wemodify the SSIS packages. Ideally, we need to use a standardized data change script to update the metadata tables, rather than using direct SQL statements to update the tables. All updates to metadata need to be recorded on the audit metadata event log table. So, this needs to be included in the data change scripts.
Audit metadata and usage metadata are maintained as the data warehouse is used. Audit
metadata is populated by the ETL processes, patch release processes, and security processes. Usage metadata is populated by the stored procedure or the web portal.
Summary
We defined what metadata in a data warehouse is and what it contains. It listed seven kinds of metadata: data definition and mapping metadata, data structure metadata, source system metadata, ETL processes metadata, data quality metadata, audit metadata, and usage metadata.
Metadata helps us understand the structure and definition/meaning of the data in the
Data warehouse. It also helps us understand the structure and definition of the data warehouse itself. Metadata is also used for auditing and troubleshooting purposes, in other words, to understand what happened (and what is happening) in the data warehouse, where it happened, and when it happened.
Source...