The Art of Business Intelligence: Data Warehouse

Sunday, January 15, 2017

Business Intelligence Opportunities: Common Interview Questions

What steps would you take to ensure data integrity when introducing a new data set into a data warehouse (please explain with an example).

Data integrity requires handling of uniqueness integrity, domain integrity, and referential integrity. Introduction of new data-set into a data warehouse demands careful examination of existing data warehouse schema and data, as well as new data-set schema and data. If task it to load a completely new data-set, then loading it into a data warehouse will be relatively simple in comparison with a data-set that is an increment to existing data in a data warehouse. Incremental data-set may contain additional records or dimensions for existing data in a data warehouse. In such a scenario, we must identify new records as well as modification of existing records. Furthermore, if new data-set contains changes for dimensions then we should use mechanisms available for changing dimensions, what we call as slowly changing dimensions. Generally, data warehouse designers prefer to maintain data warehouse design using time-stamps where they don’t need to update changes for existing records in a data warehouse except for the case of dimensions. This simplifies the addition of new data-set.

Another important consideration for data integrity while loading new data-set into a data warehouse is either the data-set is from single source or multiple sources. If the data is from multiple sources then we much check for uniqueness, domain, and references for data across sources and reconcile them into integrated form. Loading process will require the creation of schema mapping of new data set schema with data warehouse schema. During the transformation, we must ensure the necessary data conversions that may be needed, such as date format and domain of discrete/categorical data. Common transformations include:

* Resolving domain conflicts: Data type, length or size, is unique, is null allowed

* Parsing into standard formats

* Measurement units and scaling

* Date handling

* Unique key into composite key

It is also very important to handle auto-generated keys/columns appropriately. SQL Server Integration Services provides us with a handful of features that we can be used to ensure data integrity, while adding new data-set into a data warehouse.

Scenario: One of the business units has requested a new dashboard build. As the assigned project lead, please explain how you would go about delivering on the business unit's dashboard?

Dashboard is an important tool for management to keep track of their business. For similar scenario, I will follow following steps:

* Requirement gathering

Meeting with relevant stakeholders to understand the dashboard and related business requirements is critical. It requires understanding what they need and validate their need with a thorough discussion. Often stakeholders are themselves not clear about their requirement and it is our responsibility to clarify any ambiguity they may have. Document the requirements and get approval from stakeholders.

* Analysis and design

Once we have the precise requirements, next step will be source schema analysis. We should identify either we have the data available in our data source that is needed for our dashboards. If we don’t have the data, then we must make it available from relevant sources. If/once data is available, we must make sure that it is in required format. If it is not in the required format, we must identify the necessary transformations. We must also take into consideration any optimization measures that may be needed to visualize dashboard in optimum time. We should also identify the dashboard update frequency both at the data level and at the visualization level. And the most important part of our analysis and design will be identification of right number and types of visualizations to meet the business requirements.

* Prototyping and validation

I prefer prototyping to ensure that we are meeting the client needs. For this purpose, I will extract a subset of data and I will create the initial prototype of the dashboard on it. Then I will get this working dashboard validated from the client to ensure that we and client both are at the same page in terms of business and dashboard requirements. It is preferable to get the consent from the stakeholders.

* Implementation

Implementation of the dashboard using production system

* Testing

Testing of the dashboard to ensure it fulfills all requirements and is working correctly for all know scenarios as well as random scenarios.

* End-user training

Train the end-user/client/stake holder for the usage of dashboard.

* Acceptance trials

Allow end-user/client/stake holder to make use of dashboard for some time to fulfill their business needs. Once end-user satisfied, get the acceptance for the completion of the task.

Describe the system databases that are present with SQL Server.

master: The master database holds instance-wide metadata information, server configuration, information about all databases in the instance, and initialization information

Resource: The Resource database is a hidden, read-only database that holds the definitions of all system objects. When you query system objects in a database, they appear to reside in the sys schema of the local database, but actually their definition resides in the Resource database.

model: The model database is used as a template for new database. Every new database that you create is initially created as a copy of model. Alter it for changes that you want for all newly created databases.

tempdb: SQL Server stores temporary data in tempdb. It is destroyed and created as a copy of the model database every time an instance of SQL Server is restarted.

For what purpose is the collation property at the database-level?

You can define a property called collation at the database level that will determine language support, case sensitivity, and sort order for character data in that database. If you do not specify a collation for the database when you create it, the new database will use the default collation of the instance (chosen upon installation).

Do SQL Server benefit from multiple log files?

Although SQL Server can write to multiple data files in parallel, it can write to only one log file at a time, in a sequential manner. Therefore, unlike with data files, having multiple log files does not result in a performance benefit. You might need to add log files if the disk drive where the log resides runs out of space.

References:

Answers taken from: MSSS12 T-SQL Fundadmentals from Itzik Ben-Gan

Thursday, February 21, 2013

Building Data Cubes: An Introduction to SQL Server Analysis Services

SQL Server Analysis Services is a good option for experts working on data warehousing and business intelligence technologies, specifically OLAP and Data Mining. SQL Server Analysis Services provide you features to design, create, and manage multi-dimensional data cubes using any of your favorite data sources. Data Mining feature is also available, and it is also very good, but my focus in this blog will be OLAP using SQL Server Analysis Services. I prefer RapidMiner for my data mining tasks :-). To start working with SQL Server Analysis Services, it must be installed in your computer along with client components. This tutorial uses SQL Server 2008. The most important component for you will be SQL Server Business Intelligence Development Studio and SQL Server Analysis Services. You can find SQL Server Business Intelligence Development Studio in your programs menu as show in figure below:

After starting the development studio, go to File -> New -> Project menu. You will find many different project types as show in figure below. For OLAP purpose, Analysis Services Project is of interest for you. Please select the appropriate folder and name for your project.

Once you are done with project selection, you will end up at a screen similar to the one show in the figure below:

On the right side of the studio, you can view a Solution Explorer. In Solution Explorer, there is a project name MyOLAP, which we have just created. The folders under the MyOLAP project are important. Each of them have significant meaning and purpose. For our OLAP task, we will start with Data Sources. Data Source folder contains the list of Data Sources that we add to our project and we intend to use for analysis. Here you can add any data source for which you have driver available in your computer for database connectivity. Just right click on the Data Sources folder and click New Data Source. This bring you the Data Source Wizard.

As you move next, you can view the list of existing Data Connections. You can use existing ones, or can create a new one using the new button.

To create a new database connection, you have to specify the provider that you want to use for database connectivity. For example, in the figure below, I am using SQL Server Native Client to connect to local SQL Server on my computer. Provide the server name, user id, and password. If you have provided correct connection information, then you can view the list of existing databases in the combo of "Select or enter a database name". You can select the appropriate database or write its name.

Once you are done with your database connection creation, your newly created connection will be in the list of Data Connections as shown in the figure below.

As you move next, wizard will ask you for Impersonation Information. This information is used to define the Windows credentials that Analysis Services will use to connect to the data source. More details can be found at Set Impersonation Options (SSAS - Multidimensional). I have used the same credentials as of current user. However, you can also use a specific account for your project.

At the end, wizard ask you to specify a data source name to finish.

As you finish the wizard, you can see the Data Source beneath Data Sources folder.

The second step will be to create Data Source Views. Data Source Views allow us to select database object of our interest from the Data Source. A Data Source may contain many tables or views. Using Data Source Views we select the objects that we intend to use in our project. Right click on Data Source Views and click New Data Source View. This will bring you the Data Source View Wizard.

As you move next, you can view the data source that you have created in previous step.

On the next screen, you can view the objects available at your selected data source. This view allows you to select the objects of your interest. For example, I have selected the FactInternetSales table from AdventureWorksDW database. This screen provides a very useful feature to auto-select the related tables. Simply click on Add Related Tables. Wizard will automatically add the related tables with FactInternetSales table. During definition of dimensions and cube in our project, only these tables will be visible/available to us.

On the next screen, provide a title for the view and finish the wizard.

This will bring a data source view beneath Data Source View folder. IDE also shows a diagram of tables with their inter-connectivity as shown in figure below.

Once you are done with creating the data source view, the next step will be to create dimensions for your cube. To create a dimension, right click on the Dimensions folder and click new Dimensions.

As you move next, wizard will ask you about the dimension creation method. As we are using the AdventureWorksDW database, we will be using the use an existing table option. For more information on other options, please refer to Select Creation Method (Dimension Wizard).

Now we have to select the main table for the dimension. For example, in the figure below, I have selected Product as dimension. A key column is mandatory as this key column will be used to connect/link dimension with the fact table.

On the next screen, you can select the attributes that you want to use in your dimension. For example, to display or to generate concept hierarchies. In the figure below, I have selected the English Product Name, which I will use to display this dimension during my OLAP analysis.

Once you are done, please specify a valid name for your dimension to finish.

All the dimensions that you create are available underneath the Dimensions folder in the solution explorer.

Please create all dimensions that are required for your Cube. For example, in this tutorial, we have created three dimensions, i..e, Product, Region, and Time. To create a data cube, right click on Cubes folder and click New Cube. This will bring you the cube creation wizard.

As you move next, this wizard will also ask you the cube creation method as was the case with dimensions creation. We will select the use existing tables option. For other creation methods, please refere to Select Creation Method (Cube Wizard).

As we move next, we can view the tables that we selected in our data source view. We will use the FactInternetSales as our fact table.

On next screen, wizard ask us to select the measure for our Cube. We have selected only two, but according to your requirements, you can select any of them.

On the next screen, wizard will ask you to select existing dimensions that you want to use in your cube. We have created three dimensions and we will be using all three of them.

On the next screen, wizard also allows us to create new dimensions, if required. We will skip this screen because we have already created dimensions of our interest.

On the next screen name the cube and finish the wizard.

After finishing the wizard, the IDE will display you the Data Source View of the cube in the middle of the screen as well as measures and dimensions on the left side of the window.

All you need to do is to process your cube. Just click the process icon, second from left on the tool bar or you can use Cube -> Process menu option. This will perform the necessary processing of cube using a wizard.

During the processing, you can observe the progress. As long as there is not red color, everything is fine.

Once cube is processed. You can move to browser tab to browse your cube data. You will only need to drag and drop the dimensions and measure in the white area in the middle.

As you drop each field, you will get preview of your data.Once all field are dropped then you can analyze the Internet Sales using multi-dimensional analysis as show in figure below.