Create a Star Schema with Power Query

In this blog post I show how to create a star schema from a flat data set. This data will also serve as the source data set for future blogs.

What is a Star Schema? A star schema is a type of database schema that is a key concept in dimensional modelling. It organizes data into one or more fact tables referencing any number of dimension tables, which helps in simplifying the database design and enhancing reporting performance.

Star schemas are the preferred choice for Power BI data models for several compelling reasons, primarily revolving around performance, simplicity, and efficiency in data analysis and reporting. Here are key factors that make star schemas particularly well-suited for Power BI data models:
Optimized Performance
Quicker Queries: The design simplifies queries, leading to faster execution due to fewer necessary joins than in normalized models, enhancing query speed.
Streamlined Data Updates: Incremental updates are easier and more efficient, improving data refresh rates and performance in applications like Power BI.
Simplified Data Model
Simplified Structure: Makes it easier for users to grasp and work with the data in Power BI, improving usability for both developers and end-users.
Improved Data Discovery: Exploration and insight generation in Power BI is easy and enables users to delve into specifics and/or aggregate data easily.
High Compatibility with DAX
Efficient Calculations: The Data Analysis Expressions (DAX) language used in Power BI for creating calculations and measures performs optimally with star schema data models. This is because DAX is designed to work well with columnar databases and tables with clear hierarchies and relationships, as found in star schemas.
Scalability
Supports Large Datasets: Star schemas can effectively support large datasets by optimizing how data is stored and accessed. This scalability is crucial in Power BI environments where the volume of data can grow rapidly as the business evolves.

Our example data set is generated via Power Query and consists of the following attributes:
Account Year, Policy Ref, Parent Business Unit, Child Business Unit, Policy Start Date, Policy End Date, Is Active and Premium Amount. Example Below:

M Code for synthetic data is below:

let
Source = List.Numbers(1,250),
#”Converted to Table” = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#”Added Custom4″ = Table.AddColumn(#”Converted to Table”, “AccountYear”, each {2010..2023}),
#”Expanded Year” = Table.ExpandListColumn(#”Added Custom4″, “AccountYear”),
#”Add PolID” = Table.AddColumn(#”Expanded Year”, “PolicyRef”, each “PolID” & Number.ToText([Column1], “00”) & “_” & Number.ToText([AccountYear])),
#”Add Business Unit Parent” = Table.AddColumn(#”Add PolID”, “BusUnit_Parent”, each “BusUnitP” & Number.ToText(Number.RoundUp(Number.RandomBetween(0,5)))),
#”Add Business Unit Child” = Table.AddColumn(#”Add Business Unit Parent”, “BusUnit_Child”, each if Text.Contains([BusUnit_Parent], “P1”) then
“BusUnitC” & Number.ToText(Number.RoundUp(Number.RandomBetween(0,5))) else if Text.Contains([BusUnit_Parent], “P2”) then
“BusUnitC” & Number.ToText(Number.RoundUp(Number.RandomBetween(5,10))) else if Text.Contains([BusUnit_Parent], “P3”) then
“BusUnitC” & Number.ToText(Number.RoundUp(Number.RandomBetween(10,15))) else if Text.Contains([BusUnit_Parent], “P4”) then
“BusUnitC” & Number.ToText(Number.RoundUp(Number.RandomBetween(15,20))) else if Text.Contains([BusUnit_Parent], “P5”) then
“BusUnitC” & Number.ToText(Number.RoundUp(Number.RandomBetween(20,25))) else “None”),
#”Added Custom” = Table.AddColumn(#”Add Business Unit Child”, “Temp_CountryID”, each Number.ToText(Number.RoundUp(Number.RandomBetween(0,5)))),
#”Add Country” = Table.AddColumn(#”Added Custom”, “Country”, each if [Temp_CountryID] = “1” then “UK” else if [Temp_CountryID] = “2” then “France” else if [Temp_CountryID] = “3” then “Spain” else if [Temp_CountryID] = “4” then “Italy” else “Denmark”),
#”Removed Columns1″ = Table.RemoveColumns(#”Add Country”,{“Temp_CountryID”}),
#”Add Policy Start Date” = Table.AddColumn(#”Removed Columns1″, “PolicyStartDate”, each #date([AccountYear],Number.RoundUp(Number.RandomBetween(0,12)),Number.RoundUp(Number.RandomBetween(0,28)))),
#”Add Policy End Date” = Table.AddColumn(#”Add Policy Start Date”, “PolicyEndDate”, each Date.AddMonths([PolicyStartDate],
if Number.RoundUp(Number.RandomBetween(0,2)) = 1 then 6 else 12)),
#”Changed Type1″ = Table.TransformColumnTypes(#”Add Policy End Date”,{{“PolicyStartDate”, type date}, {“PolicyEndDate”, type date}}),
#”Add Is Active” = Table.AddColumn(#”Changed Type1″, “IsActive”, each if [PolicyStartDate] =
DateTime.Date(DateTime.LocalNow()) then “Y” else “N”),
#”Add Random Premium” = Table.AddColumn(#”Add Is Active”, “Premium_Rand”, each Number.RandomBetween(150,500)),
#”Removed Columns” = Table.RemoveColumns(#”Add Random Premium”,{“Column1”}),
#”Changed Type” = Table.TransformColumnTypes(#”Removed Columns”,{{“AccountYear”, Int64.Type}, {“PolicyRef”, type text}, {“BusUnit_Parent”, type text}, {“BusUnit_Child”, type text}, {“Premium_Rand”, type number}, {“Country”, type text}, {“IsActive”, type text}})
in
#”Changed Type”

We will be creating the following dimensions first:
DimAccountYear, DimPolicyRef, DimCountry, DimIsActive, DimBusUnit_Parent, DimBusUnit_Child

While working in Power Query, start by generating a new dataset query through referencing the PolicyYearData query. Next, follow these steps in Power Query to create the DimAccountYear dimension.

1. Highlight AccountYear, right-click and remove other columns.
2. Highlight AccountYear, right-click and remove duplicates. Rename column to AccountYearAsc.
3. Add index column from the Add Column tab. Rename column AccountYearAsc_OrderByAsc.
4. Add a custom column from AccountYearAsc. Rename column AccountYearDesc.
5. Sort by column AccountYearDesc. Add another index column and rename AccountYearDesc_OrderByDesc.
6. Sort by column AccountYearAsc.

Now let’s create the DimPolicyRef in a similar way.

1. Create a dataset query by referencing the PolicyYearData query.
2. Highlight PolicyRef, right-click and remove other columns.
3. Highlight PolicyRef, right-click and remove duplicates. Sort column.
4. Add index column from the Add Column tab. Rename column PolicyRefKey.
5. Reorder columns with PolicyRefKey first.

Repeat the process for the other dimensions

We now need to establish a date dimension. To do this, we will construct a date dimension that spans from the minimum policy start date to the maximum policy end date, ensuring that we only generate the necessary rows. Additionally, we will create other date-related columns that will be useful for the date dimension.

Following the method described previously, we’ll make a slight alteration by omitting the addition of an index and including a line to convert to a parameter. Begin by opening a new blank query, then copy and paste the following code into the advanced editor window for the Min Policy Start Date.

let
Source = PolicyYearData,
#”Removed Other Columns” = Table.SelectColumns(Source,{“PolicyStartDate”}),
#”Sorted Rows” = Table.Sort(#”Removed Other Columns”,{{“PolicyStartDate”, Order.Ascending}}),
#”Kept First Rows” = Table.FirstN(#”Sorted Rows”,1),
PolicyStartDate = #”Kept First Rows”{0}[PolicyStartDate]
in
PolicyStartDate

We do the same for the Max Policy End Date.

let
Source = PolicyYearData,
#”Removed Other Columns” = Table.SelectColumns(Source,{“PolicyEndDate”}),
#”Sorted Rows” = Table.Sort(#”Removed Other Columns”,{{“PolicyEndDate”, Order.Descending}}),
#”Kept First Rows” = Table.FirstN(#”Sorted Rows”,1),
PolicyEndDate = #”Kept First Rows”{0}[PolicyEndDate]
in
PolicyEndDate

In a new blank query window add the following code to create the date dimension.

let
Source = List.Generate(
() => qMinPolicyStartDate, // initial value
each _ <= qMaxPolicyEndDate, // run until EndDate
each Date.AddDays(_, 1 ) // increase 1 day for each step
),
#"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Renamed Columns" = Table.RenameColumns(#"Converted to Table",{{"Column1", "DateKey"}}),
#"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"DateKey", type date}}),
#"Added Custom" = Table.AddColumn(#"Changed Type", "Year", each Date.Year( [DateKey] )),
#"Changed Type1" = Table.TransformColumnTypes(#"Added Custom",{{"Year", Int64.Type}}),
#"Added Custom1" = Table.AddColumn(#"Changed Type1", "Month", each Date.Month( [DateKey] )),
#"Changed Type2" = Table.TransformColumnTypes(#"Added Custom1",{{"Month", Int64.Type}}),
#"Added Custom2" = Table.AddColumn(#"Changed Type2", "MonthName", each Date.MonthName( [DateKey] )),
#"Changed Type3" = Table.TransformColumnTypes(#"Added Custom2",{{"MonthName", type text}}),
#"Added Custom3" = Table.AddColumn(#"Changed Type3", "MonthNameShort", each Text.Start(Date.MonthName( [DateKey] ),3)),
#"Changed Type4" = Table.TransformColumnTypes(#"Added Custom3",{{"MonthNameShort", type text}}),
#"Added Custom4" = Table.AddColumn(#"Changed Type4", "Quarter", each Date.QuarterOfYear( [DateKey] )),
#"Changed Type5" = Table.TransformColumnTypes(#"Added Custom4",{{"Quarter", Int64.Type}}),
#"Added Custom5" = Table.AddColumn(#"Changed Type5", "QuarterName", each "Q" & Number.ToText(Date.QuarterOfYear( [DateKey] )))
in
#"Added Custom5"

The date dimension table should now appear as follows:

Finally, let’s create our fact table, FactPolicyPremiums. Fact tables typically handle transactional data and include foreign keys that link back to dimension keys in a many-to-one relationship. They also house numeric measurements that result from transactions or events, such as premiums.

1. Start by creating a dataset query by referencing the PolicyYearData query. Rename this query to FactPolicyPremiums.
2. Begin with the PolicyRef column. From the Home tab, select Merge Queries and join using the PolicyRef column from both tables. Make sure all rows from both tables match and confirm by clicking OK.

3. In the resulting column, use the table expander to select only the PolicyRefKey attribute. Once the new PolicyRefKey column is formed, remove the original PolicyRef column—this newly formed column will now link to the DimPolicyRef dimension via the foreign key.

Repeat the same three steps: Merge, Expand, and Remove for all other dimensions, excluding the date. The final configuration of the fact dataset should appear as follows:

Once done, you can click on “Close and Apply” to finalize your changes.

Before creating our data model, we need to identify our date dimension and Mark as a Date Table.

We can now link our dimension tables in a 1 to many relationships joining on key attributes. Drag the key attribute from the dimension table over the equivalent foreign key in the fact table.

In our example, we’ve connected the date dimension to both the PolicyStartDate and PolicyEndDate. However, only one active relationship can exist between two tables at a time, so the other relationship is marked as inactive (indicated by a dashed line). This inactive relationship can be activated during the calculation of a measure by employing the DAX function USERRELATIONSHIP().

With these connections established, we are now well-prepared to start crafting measures and visualizations. Our data model architecture is robust and optimized for the Power BI query engine, also known as VertiPaq, ensuring efficient performance.

Leave a comment