2.1 Dimensional Modeling Basics
Dimensional modeling gets its name from
the business dimensions we need to incorporate into the logical data model. It
is a logical design technique to structure the business dimensions and the
metrics that are analyzed along these dimensions. This modeling technique is
intuitive for that purpose. The model has also proved to provide high
performance for queries and analysis. The multidimensional information package
diagram we have discussed is the foundation for the dimensional model.
Therefore, the dimensional model consists of the specific data structures
needed to represent the business dimensions. These data structures also contain
the metrics or facts.
To build a
dimensional database, you start with a dimensional data model. The dimensional
data model provides a method for making databases simple and understandable.
You can conceive of a dimensional database as a database cube of
three or four dimensions where users can access a slice of the database along
any of its dimensions. To create a dimensional database, you need a model that
lets you visualize the data.
Suppose your
business sells products in different markets and evaluates the performance over
time. It is easy to conceive of this business process as a cube of data, which
contains dimensions for time, products, and markets. Shows this dimensional
model. The various intersections along the lines of the cube would contain the measures of
the business. The measures correspond to a particular combination of product,
market, and time data.
FIGURE – 2.1: (A Dimensional Model of a Business
That Has Time, Product and Market Dimensions)
Another
name for the dimensional model is the star-join
schema. The database designers use this name because the diagram for this
model looks like a star with one central table around which a set of other
tables are displayed. The central table is the only table in the schema with
multiple joins connecting it to all the other tables. This central table is
called the fact table and the other tables are called dimension
tables. The dimension tables all have only a single join that attaches them
to the fact table, regardless of the query. shows a
simple dimensional model of a business that sells products in different markets
and evaluates business performance over time.
FIGURE – 2.2: (A Typical Dimensional Model)
2.2 Data
Warehousing - Fact and Dimension Tables
Data
warehouses are built using dimensional data models which consist of fact and
dimension tables. Dimension tables are used to describe dimensions; they
contain dimension keys, values and attributes. For example, the time dimension
would contain every hour, day, week, month, quarter and year that has occurred
since you started your business operations. Product dimension could contain a
name and description of products you sell, their unit price, color, weight and
other attributes as applicable.
Dimension tables are typically small,
ranging from a few to several thousand rows. Occasionally dimensions can grow
fairly large, however. For example, a large credit card company could have a
customer dimension with millions of rows. Dimension table structure is
typically very lean, for example customer dimension could look like following:
Table – 1 (Customer Dimension)
Customer_key
|
Customer_city
|
Customer_state
|
Customer_country
|
Although there might be other attributes that
you store in the relational database, data warehouses might not need all of
those attributes. For example, customer telephone numbers, email addresses and
other contact information would not be necessary for the warehouse. Keep in mind
that data warehouses are used to make strategic decisions by analyzing trends.
It is not meant to be a tool for daily business operations. On the other hand,
you might have some Reports that do include data elements that isn’t necessary
for data analysis.
Most
data warehouses will have one or multiple time dimensions. Since the warehouse
will be used for finding and examining trends, data analysts will need to know
when each fact has occurred. The most common time dimension is calendar time.
However, your business might also need a fiscal time dimension in case your
fiscal year does not start on January 1st as the calendar year.
Most
data warehouses will also contain product or service dimensions since each
business typically operates by offering either products or services to others.
Geographically dispersed businesses are likely to have a location dimension.
Fact
tables contain keys to dimension tables as well as measurable facts that data
analysts would want to examine. For example, a store selling automotive parts
might have a fact table recording a sale of each item. The fact table of An
educational entity could track credit hours awarded to students. A bakery could
have a fact table that records manufacturing of various baked goods.
Fact
tables can grow very large, with millions or even billions of rows. It is
important to identify the lowest level of facts that makes sense to analyze for
your business this is often referred to as fact table "grain". For
instance, for a Healthcare billing company it might be sufficient to track
revenues by month; daily and hourly data might not exist or might not be relevant. On the other hand,
the assembly line warehouse analysts might be very concerned in number of
defective goods that were manufactured each hour. Similarly a marketing data
warehouse might be concerned by the activity of a consumer group with a
specific income-level rather than purchases made by each individual.
2.2.1 Fact Table
Facts: A fact
is a measurement captured from an event (transaction) in the marketplace. It is
the raw materials for knowledge – observations.
A customer buys a product at a certain location at a certain time. When the
intersection of these four dimensions occurs, a sale is made. The sale is
describable as amount of dollars received, number of items sold, weight of
goods shipped, etc. – a quantity that can be added to other sales similar in
definition. Thus, a meaningful and measurable event of significance to the business
occurs at the intersection point of business dimensions. It is the fact. We use
the fact to represent a business measure. A data warehouse fact is defined as an
intersection of the dimensions constituting the basic entities of the business
transaction. It is not easy to show the intersection of more than three
dimensions in a diagram, but facts in a data warehouse may originate from many
dimensions.
Fact Table: A fact
table is used in dimensional model in data warehouse design.
A fact table is found at the center of a star schema or snowflake schema surrounded by dimension tables. A fact
table consists of facts of a particular business process e.g., sales revenue by
month by product. Facts are also known as measurements or metrics. A fact table
record captures a measurement or a metric.
Example of fact table:
In the schema below, we have fact table
FACT_SALES
that has a grain
which gives us a number of units W sold by date, by store and by product. All
other tables such as DIM_DATE
, DIM_STORE
and DIM_PRODUCT
are
dimensions tables. This schema is known as star schema.
Here is overview
of four steps to design a fact table described by Kimball:
- Choosing business process to model – The first step
is to decide what business process to model by gathering and understanding
business needs and available data
- Declare the grain – by declaring a grain means describing exactly
what a fact table record represents
- Choose the dimensions – once grain of fact table is stated clearly, it
is time to determine dimensions for the fact table.
- Identify facts – identify carefully which facts will appear in
the fact table.
FIGURE – 2.3: (Fact Table with Star Scheme)
2.2.2 Dimensional Table
A dimension is a
structure, often composed of one or more hierarchies, that categorizes data.
Dimensional attributes help to describe the dimensional value. They are
normally descriptive, textual values. Dimension tables are generally small in
size as compared to fact table.
Hierarchies and categories are included in
the information packages for each dimension.
Let us examine the product dimension. Here, the product is the basic
automobile. Therefore, we include the data elements relevant to product as
hierarchies and categories. These would be model name, model year, package
styling, product line, product category, exterior color, interior color, and first
model year. Looking at the other business dimensions for the auto sales
analysis, we summarize the hierarchies and categories for each dimension as
follows:
1.
Product: Model name, model
year, package styling, product line, product category, exterior Color, interior
color, first model year.
2.
Dealer: Dealer name, city,
state, single brand flag, date first operation.
3.
Customer demographics: Age, gender, income range,
marital status, household size,
Vehicles owned, home value, own or
rent.
4.
Payment method: Finance type, term in months,
interest rate, agent.
5.
Time: Date, month, quarter, year, day of week, day of month, season,
holiday flag Let us go back to the hotel occupancy analysis. We have included
three business dimensions. Let us list the possible hierarchies and categories
for the three dimensions.
6.
Hotel: Hotel line, branch
name, branch code, region, address, city, state, Zip Code, Manager,
construction year, renovation year
7.
Room type: Room type, room
size, number of beds, type of bed, maximum occupants, Suite, refrigerator,
kitchenette
8.
Time: Date, day of month,
day of week, month, quarter, year, holiday flag.
Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!data science certification
ReplyDeleteWhat a really awesome post this is. Truly, one of the best posts I've ever witnessed to see in my whole life. Wow, just keep it up.
ReplyDeletefull stack developer course