Dimensional Model – Contains facts and dimensions. The level of detail in the fact table is called the grain
Types of Schemas
- Fact tables are joined to related dimensions, called a star schema
- Dimensions that relate to multiple fact tables, should be conformed dimensions
- Relating other dimension tables with lookup tables to the fact tables is a snowflake schema
Operational Data Store (ODS) -Works where we have to massage the data before it makes its way into the data warehouse. It takes data from the OLTP system and handles de-duplicating records.
Third Normal Form (3NF) – Way of breaking down the data to its lowest possible form without null value. In star schemas, we combine tables from the third normal form. This is de-normalizing the data. Kimball thought that we need star schemas to answer questions faster than 3NF. Inserting and updating schemas needs to take place through the ETL process. No direct inserts or updates into the Data Warehouse
Metadata – Data about data. We might have different definitions of Net Sales in the Data Warehouse. There is also technical metadata – DDL (Tables, Views, etc.). Larger companies start their data warehousing projects by looking at the metadata.
Include load times – This is how we know how long each step takes. Where is the data coming from? Flat file or database import? By tracking the stats, we can tell that staging took this long, ODS took this long, etc. Track each step of the process.
Latency of data – How often do we refresh the data and how long do we retain records?
When building a conceptional / logical model, first create a logical BUS model. For example, we need to look at customers by sales, by location, and by gender.
Sales | Location | Gender | |
Customers |
BUS Diagram
Conformed dimensions have been used to consistently label and constrain separate data sources. The idea behind conformed dimensions is very simple: two dimensions are conformed if they contain one or more common fields, whose contents are drawn from the same domains. That results in constraints and labels having the same content and meaning when applied against separate data sources.
Conformed facts are simply numeric measures that have the same business and mathematical interpretations so that they may be compared and computed against each other consistently. When you combine the list of EDW subject areas with the notion of conformed dimensions, a powerful diagram emerges, which we call the enterprise data warehouse bus matrix.
Conformed Dimensions |
||||||
Business Process | Locations | Customers | Policy Holder | Stores | Inventory | Vendors |
Qty (Sales) | X | X | X | X | X | |
Sales | X | X | X | X | X | |
Expenses | X | X | X | |||
Claims | X | X | X | X | ||
Policy | X | |||||
Campaigns | X | |||||
Returns | X | X | ||||
Loss | X | X | ||||
Policies Sold | X | X |
The business process subject areas are shown along the left side of the matrix and the dimensions are shown across the top.
Shared Dimensions – One dimension that is shared among all the fact tables. When we create our dimension model, we are creating something that looks like a constellation schema. Time maps to every dimension, so we can join everything through time.
Measures for School
- # of Sections
- # of Sections Cancelled
- # of Section Offered
- Capacity
- Enrollment
- Final Count
- Census Count
- <GRADES>
- # of Completers
- # of Degrees
- # of Applicants
- # of Students
- # Receiving FinAid
- Financial Aid $
- Contact Hours
- Credit Hours
- GPA
Time (Degenerate Dimension)
- Ship Dates
- Sale Dates
Data Access – Can restrict certain bits of data like phone numbers, SSNs, etc. IT should take care of governance and let the data owners access their own data
Star Schema – No subproducts. All product information is rolled into one table. Create just one date and tie to the DimDate by the OrderDateKey. The point of sale will always have the date on it, but the grain might be different. The US might reflect the sales to the minute, but Hawaii might just have the day. For reporting, we need to have them both at the same level.
– Grain – Business either upgrades Hawaii’s POS to hourly or tracks everything by day
Location | Latency | Order Date |
Hawaii | Day | 1/21/2012 |
Houston | Day, Hour | 1/21/2012 4:00pm |
Dealing with Time Zone – Can have both a Local Time and a Central Time (Greenwich Mean Time). Time is one of the most critical factors. Date Warehouse is worthless without a date
The more dimensions we define, the more context we have and the better we are able to query the info. EDW can be used to identify trends. TYPICAL QUERY: Need the colors that were sold in this sales region over the past five years. Can track customer behaviors – bright colors or plaid? How well do hats sell in Hawaii?
Snowflake Schema – Snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake in shape. The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions.
The snowflake schema is similar to the star schema. However, in the snowflake schema, dimensions are normalized into multiple related tables, whereas the star schema’s dimensions are normalized with each dimension represented by a single table. A complex snowflake shape emerges when the dimensions of a snowflake schema are elaborate, having multiple levels of relationships, and the child tables have multiple parent tables (“forks in the road”). The “snowflaking” effect only affects the dimension tables and NOT the fact tables.[
Expands product in two child tables. Kimball’s preference would be to create a Star Schema
Slowly Changing Dimensions (SCD)
When changes occur to a dimensional record, it is called a Slowly Changing Dimension. Should I delete and insert or preserve the old record? Do I want to track history? We can’t track everything. Have to decide which items warrant preservation.
Type I – Overwrites all changes to related dimensional records
Type II – Inserts a new record, preserving the old record
Maybe track changes on:
– Last name
– Marital status
– Location
Always track on the key. Never track on a slowly changing dimension
Surrogate key (usually the primary key) of the dimension is used to keep the record unique.
Business key (natural key) is usually an id from the source system. Can be used to trace back to the system of record.
BE SURE TO HAVE AN IMMUTABLE ID
Uses a CHANGE_IND to track the current version
Metadata – All databases have informational schema. These include changes in precisions of scale. (Converting from a decimal to an integer or vice versa)
Junk Dimensions – Used to store data collected after the fact. For example, customer filled out a survey on a transaction three weeks later.
Three Types of Facts
- Transactional – When the event occurs
- Periodic – Snapshot of events at a given point of time
- Accumulating Snapshot – Tracks the measures by various dates
Three Fact Table Types
Transaction | Periodic Snapshot | Accumulating Snapshot |
DateProductStoreCustomerCashier
Manager Promotion Weather Basket |
MonthAccountBranchHousehold | Order DateShip DateDelivery DatePayment DateReturn Date
Warehouse Customer Promotion Order Status |
PARTS OF A DIMENSION
- Foreign Keys – Links to Other Dimensions
- Measures
- Degenerative Dimension – Additional information stored with the record (i.e. Carrier Tracking Number)
- Revision Info
ETL (Extract, Transform, Load)
Audit Data Model
– Load Times on the various steps
– How many records were kicked out?
– What kinds of errors were recorded?
– Better have the metadata reports ready
Have to have a control table for process metadata
– Need to have thresholds for data collected
– Alerts on data counts – only 2 records received instead of 2000
– Why?
Business Transformation Rules
ETL Components
EXTRACT – Just dumping the data at this point. Only extract what you need, what changed.
– Record Counts
– Load Times
– Check Thresholds
CLEAN – All of this goes into an Admin database.
– Validate Missing Data
– Deduping
– Metadata Rules for Completeness
– Record Counts
TRANSFORM
– Aggregation
– Derived Values
– Check Completeness
– Validate Business Rules
– Record Counts
DELIVER
– Dimensional Maintenance
– Load Facts
– Record Counts
Loading Dimensions
Use an incremental value (sequence) as a Surrogate key for new records. Natural key is the key from the Source System
- SCD1 – Overwrite descriptive values (Lose Previous Tracking)
- SCD2 – Changes on attributes inserts new record, preserving history
- SCD3 – Involved schema changes (Alternate View)
Loading Facts
Conformed Dimension Surrogate Key
Joined by Natural Key
Late Arriving Facts – Know upfront how far to go back
Semi-Additive Facts – Different Aggregate Value (Min, Max, Account, Average)
Referential Integrity
Master Data Management (MDM)
Master Data Management (MDM) comprises a set of processes, governance, policies, standards and tools that consistently defines and manages the master data (i.e. non-transactional data entities) of an organization (which may include reference data).
An MDM tool can be used to support Master Data Management by removing duplicates, standardizing data (Mass Maintaining), incorporating rules to eliminate incorrect data from entering the system in order to create an authoritative source of master data. Master data are the products, accounts and parties for which the business transactions are completed.
– Serves as the authoritative source for applications and analytics
– Single Version of all Entities
– Must be owned by each Business Unit
– Must have versioning and auditing Capabilities –Data Lineage.
– Must be able to handle hierarchies
– Must have a governed process (cleaning process)