Generate Synthetic Data with Power Query

In this blog post I show how to generate synthetic data using Power Query. This data will then serve as the source for the Power BI blog on Insurance Triangles.

Synthetic data offers remarkable benefits:
1. Eliminates the need for actual data.
2. Allows for the simulation of numerous scenarios akin to real-world data.

In this example the following guidelines will apply:
– Generate data spanning nine years, from 2015 to 2023, encompassing both premiums and claims.
– Simulate the initiation of 100 new policies annually, with premiums paid semi-annually.
– For simplicity, 3 claims will be randomly assigned to selected policies, distributed across various months up to the end of 2023.

Three data sets are needed:
1. A combination of Years and Months, aiming for at least 108 monthly intervals per year.

2. Premiums – detailing the year and month when premiums are received.

3. Claims – specifying the year and month when claims are paid.

Starting with the development month table, we will need 108 months (9 years x 12 months) per year to form the matrix visual essential for our triangles data. Example below:

Open Power BI Desktop and navigate to Power Query to initiate a new Blank Query.
1. Use = List.Numbers(2015,9) to generate a list covering 2015 to 2023.
2. Transform this list into a table and label the column as “Year” (we’ll adjust the data types later).

3. Introduce a custom column named “Month” with = {1..108} to produce a list of 108 entries per year.

4. Expand this column so each year breaks down into 108 distinct rows.

5. Insert an index column, rename it “YearMonthKey”, and rearrange the columns accordingly.

6. Finally, adjust the data types for Year and Month to whole numbers, renaming the table to “YearMonth.”

For Policy data:
Begin another Blank Query.
1. Enter = List.Numbers(1,100) to create a list of 100 rows and convert it into a table.
2. Add a custom “Year” column with = {2015..2023} to include 9 years’ data, expanding to new rows as necessary.

3. Introduce a derived policy ID (PolID) column.

4. Insert two custom columns: the first to duplicate each entry for the two halves of the year, and the second to allocate a random month for the payment within each respective half-year period.

5. Insert a final custom column called “Premium_Rand” to randomly assign premium, for instance, anywhere between 100 and 1000.

6. Finalize the dataset by adjusting data types and organizing columns, renaming the table to “PolicyYearMonth_Premiums.”

For Claims data:
1. Begin another blank query and start with the policy data table created earlier.
2. Narrow down to Year and PolID columns, remove duplicates.

3. Add a custom column to randomly generate 5 rows per policy. Reduce the number of policies, for instance, by filtering for 2 and 4.

4. Add a custom column to generate 3 rows per policy.

5. Generate random claim amounts, for instance, between 100 and 250.

6. Add a custom column named “MaxInterval” to calculate the maximum month interval starting from the year the policy was initiated.

7. Add a custom column named MonthInterval to randomly choose between 1 and MaxInterval, which will represent the month in which a claim was paid.

8. After cleaning and organizing, name the table “PolicyYearMonth_Claims.”

To finalize our data model, we merge the Premiums and Claims datasets with the initial YearMonth table to extract the YearMonthKey ensuring coherence across the model.

1. Select the Premiums dataset you’ve created and make sure the final action is highlighted in the Power Query steps pane. Then, navigate to the Home tab on the ribbon and choose Merge Queries.
2. Pick the Year and MonthInterval columns from the PolicyYearMonthPremiums dataset. Similarly, select Year and Month from the YearMonth dataset. Use a Left Outer join and ensure that rows align perfectly between both tables.

3. Generate a selection from the newly formed YearMonth column to extract the YearMonthKey.

With the policy dataset finalized and ready to be linked with the YearMonth in the data model, we can apply the same process to the Claims dataset.

You can now leave Power Query by selecting “Close & Apply.”

In the data model view, it’s possible to establish a one-to-many relationship linking YearMonth with both Premiums and Claims using YearMonthKey.

With this, our Power BI Triangles data model stands complete. For guidance on crafting measures and additional elements, please consult the triangles blog.

I hope you found this information valuable. Additionally, I’ve written a blog post on transforming a flat data set into a star schema that might interest you.

Generate Synthetic Data with Power Query

Published by Victor Mendes

Leave a comment Cancel reply

Share this:

Related

Published by Victor Mendes

Leave a comment Cancel reply