Dimensional Model – Contains facts and dimensions.  The level of detail in the fact table is called the grain

Types of Schemas

  • Fact tables are joined to related dimensions, called a star schema
  • Dimensions that relate to multiple fact tables, should be conformed dimensions
  • Relating other dimension tables with lookup tables to the fact tables is a snowflake schema

 

Operational Data Store (ODS) -Works where we have to massage the data before it makes its way into the data warehouse.  It takes data from the OLTP system and handles de-duplicating records.

Third Normal Form (3NF) – Way of breaking down the data to its lowest possible form without null value.  In star schemas, we combine tables from the third normal form.  This is de-normalizing the data.  Kimball thought that we need star schemas to answer questions faster than 3NF.  Inserting and updating schemas needs to take place through the ETL process.  No direct inserts or updates into the Data Warehouse

Metadata – Data about data.  We might have different definitions of Net Sales in the Data Warehouse.  There is also technical metadata – DDL (Tables, Views, etc.).  Larger companies start their data warehousing projects by looking at the metadata.

Include load times – This is how we know how long each step takes.  Where is the data coming from?  Flat file or database import?  By tracking the stats, we can tell that staging took this long, ODS took this long, etc.  Track each step of the process.

Latency of data – How often do we refresh the data and how long do we retain records?

When building a conceptional / logical model, first create a logical BUS model.  For example, we need to look at customers by sales, by location, and by gender.

 

Sales Location Gender
Customers


BUS Diagram

Conformed dimensions have been used to consistently label and constrain separate data sources. The idea behind conformed dimensions is very simple: two dimensions are conformed if they contain one or more common fields, whose contents are drawn from the same domains.  That results in constraints and labels having the same content and meaning when applied against separate data sources.

Conformed facts are simply numeric measures that have the same business and mathematical interpretations so that they may be compared and computed against each other consistently.   When you combine the list of EDW subject areas with the notion of conformed dimensions, a powerful diagram emerges, which we call the enterprise data warehouse bus matrix.

 

Conformed Dimensions

Business Process Locations Customers Policy Holder Stores Inventory Vendors
Qty (Sales) X X X X X
Sales X X X X X
Expenses X X X
Claims X X X X
Policy X
Campaigns X
Returns X X
Loss X X
Policies Sold X X

 

The business process subject areas are shown along the left side of the matrix and the dimensions are shown across the top.

 

Shared Dimensions – One dimension that is shared among all the fact tables.  When we create our dimension model, we are creating something that looks like a constellation schema.  Time maps to every dimension, so we can join everything through time.

 

Measures for School

  • # of Sections
  • # of Sections Cancelled
  • # of Section Offered
  • Capacity
  • Enrollment
  • Final Count
  • Census Count
  • <GRADES>
  • # of Completers
  • # of Degrees
  • # of Applicants
  • # of Students
  • # Receiving FinAid
  • Financial Aid $
  • Contact Hours
  • Credit Hours
  • GPA

 

Time  (Degenerate Dimension)

  • Ship Dates
  • Sale Dates

 

Data Access – Can restrict certain bits of data like phone numbers, SSNs, etc.  IT should take care of governance and let the data owners access their own data

Star Schema – No subproducts.  All product information is rolled into one table.  Create just one date and tie to the DimDate by the OrderDateKey.  The point of sale will always have the date on it, but the grain might be different.  The US might reflect the sales to the minute, but Hawaii might just have the day.  For reporting, we need to have them both at the same level.

–          Grain – Business either upgrades Hawaii’s POS to hourly or tracks everything by day

 

Location Latency Order Date
Hawaii Day 1/21/2012
Houston Day, Hour 1/21/2012 4:00pm

 

Dealing with Time Zone – Can have both a Local Time and a Central Time (Greenwich Mean Time).  Time is one of the most critical factors.  Date Warehouse is worthless without a date

The more dimensions we define, the more context we have and the better we are able to query the info.  EDW can be used to identify trends.  TYPICAL QUERY: Need the colors that were sold in this sales region over the past five years.  Can track customer behaviors – bright colors or plaid? How well do hats sell in Hawaii?

 

Snowflake Schema – Snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake in shape. The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions.

The snowflake schema is similar to the star schema.  However, in the snowflake schema, dimensions are normalized into multiple related tables, whereas the star schema’s dimensions are normalized with each dimension represented by a single table.  A complex snowflake shape emerges when the dimensions of a snowflake schema are elaborate, having multiple levels of relationships, and the child tables have multiple parent tables (“forks in the road”).  The “snowflaking” effect only affects the dimension tables and NOT the fact tables.[

Expands product in two child tables.  Kimball’s preference would be to create a Star Schema

 

Slowly Changing Dimensions (SCD)

When changes occur to a dimensional record, it is called a Slowly Changing Dimension.  Should I delete and insert or preserve the old record?  Do I want to track history?  We can’t track everything.  Have to decide which items warrant preservation.

Type I – Overwrites all changes to related dimensional records

Type II – Inserts a new record, preserving the old record

Maybe track changes on:
–          Last name
–          Marital status
–          Location

 

Always track on the key.  Never track on a slowly changing dimension
Surrogate key (usually the primary key) of the dimension is used to keep the record unique.
Business key (natural key) is usually an id from the source system.  Can be used to trace back to the system of record.

BE SURE TO HAVE AN IMMUTABLE ID
Uses a CHANGE_IND to track the current version

 

Metadata – All databases have informational schema.  These include changes in precisions of scale.  (Converting from a decimal to an integer or vice versa)

Junk Dimensions – Used to store data collected after the fact.  For example, customer filled out a survey on a transaction three weeks later.

Three Types of Facts

  1. Transactional – When the event occurs
  2. Periodic – Snapshot of events at a given point of time
  3. Accumulating Snapshot – Tracks the measures by various dates

Three Fact Table Types

Transaction Periodic Snapshot Accumulating Snapshot
DateProductStoreCustomerCashier

Manager

Promotion

Weather

Basket

MonthAccountBranchHousehold Order DateShip DateDelivery DatePayment DateReturn Date

Warehouse

Customer

Promotion

Order Status

 

PARTS OF A DIMENSION

  • Foreign Keys – Links to Other Dimensions
  • Measures
  • Degenerative Dimension – Additional information stored with the record  (i.e. Carrier Tracking Number)
  • Revision Info

 

ETL (Extract, Transform, Load)

Audit Data Model
–          Load Times on the various steps
–          How many records were kicked out?
–          What kinds of errors were recorded?
–          Better have the metadata reports ready

 

Have to have a control table for process metadata

–          Need to have thresholds for data collected
–          Alerts on data counts – only 2 records received instead of 2000
–          Why?

 

Business Transformation Rules


ETL Components

BI1.1

EXTRACT – Just dumping the data at this point.  Only extract what you need, what changed.

–          Record Counts
–          Load Times
–          Check Thresholds

 

CLEAN – All of this goes into an Admin database.

–          Validate Missing Data
–          Deduping
–          Metadata Rules for Completeness
–          Record Counts

 

TRANSFORM

–          Aggregation
–          Derived Values
–          Check Completeness
–          Validate Business Rules
–          Record Counts

 

DELIVER

–          Dimensional Maintenance
–          Load Facts
–          Record Counts

 

Loading Dimensions

Use an incremental value (sequence) as a Surrogate key for new records.  Natural key is the key from the Source System

  • SCD1 – Overwrite descriptive values (Lose Previous Tracking)
  • SCD2 – Changes on attributes inserts new record, preserving history
  • SCD3 – Involved schema changes (Alternate View)

 

Loading Facts

Conformed Dimension Surrogate Key
Joined by Natural Key
Late Arriving Facts – Know upfront how far to go back
Semi-Additive Facts – Different Aggregate Value (Min, Max, Account, Average)
Referential Integrity

 

Master Data Management (MDM)

Master Data Management (MDM) comprises a set of processes, governance, policies, standards and tools that consistently defines and manages the master data (i.e. non-transactional data entities) of an organization (which may include reference data).

An MDM tool can be used to support Master Data Management by removing duplicates, standardizing data (Mass Maintaining), incorporating rules to eliminate incorrect data from entering the system in order to create an authoritative source of master data. Master data are the products, accounts and parties for which the business transactions are completed.

–          Serves as the authoritative source for applications and analytics
–          Single Version of all Entities
–          Must be owned by each Business Unit
–          Must have versioning and auditing Capabilities –Data Lineage.
–          Must be able to handle hierarchies
–          Must have a governed process (cleaning process)