What steps would you take to ensure data integrity when introducing a new data set into a data warehouse (please explain with an example).
Data integrity requires handling of uniqueness integrity, domain integrity, and referential integrity. Introduction of new data-set into a data warehouse demands careful examination of existing data warehouse schema and data, as well as new data-set schema and data. If task it to load a completely new data-set, then loading it into a data warehouse will be relatively simple in comparison with a data-set that is an increment to existing data in a data warehouse. Incremental data-set may contain additional records or dimensions for existing data in a data warehouse. In such a scenario, we must identify new records as well as modification of existing records. Furthermore, if new data-set contains changes for dimensions then we should use mechanisms available for changing dimensions, what we call as slowly changing dimensions. Generally, data warehouse designers prefer to maintain data warehouse design using time-stamps where they don’t need to update changes for existing records in a data warehouse except for the case of dimensions. This simplifies the addition of new data-set.
Another important consideration for data integrity while loading new data-set into a data warehouse is either the data-set is from single source or multiple sources. If the data is from multiple sources then we much check for uniqueness, domain, and references for data across sources and reconcile them into integrated form. Loading process will require the creation of schema mapping of new data set schema with data warehouse schema. During the transformation, we must ensure the necessary data conversions that may be needed, such as date format and domain of discrete/categorical data. Common transformations include:
* Resolving domain conflicts: Data type, length or size, is unique, is null allowed
* Parsing into standard formats
* Measurement units and scaling
* Date handling
* Unique key into composite key
It is also very important to handle auto-generated keys/columns appropriately. SQL Server Integration Services provides us with a handful of features that we can be used to ensure data integrity, while adding new data-set into a data warehouse.
Scenario: One of the business
units has requested a new dashboard build. As the assigned project lead, please
explain how you would go about delivering on the business unit's dashboard?
Dashboard is an important tool
for management to keep track of their business. For similar scenario, I will
follow following steps:
* Requirement gathering
Meeting with relevant
stakeholders to understand the dashboard and related business requirements is
critical. It requires understanding what they need and validate their need with
a thorough discussion. Often stakeholders are themselves not clear about their
requirement and it is our responsibility to clarify any ambiguity they may
have. Document the requirements and get approval from stakeholders.
* Analysis and design
Once we have the precise
requirements, next step will be source schema analysis. We should identify
either we have the data available in our data source that is needed for our
dashboards. If we don’t have the data, then we must make it available from
relevant sources. If/once data is available, we must make sure that it is in
required format. If it is not in the required format, we must identify the
necessary transformations. We must also take into consideration any
optimization measures that may be needed to visualize dashboard in optimum
time. We should also identify the dashboard update frequency both at the data
level and at the visualization level. And the most important part of our
analysis and design will be identification of right number and types of
visualizations to meet the business requirements.
* Prototyping and validation
I prefer prototyping to ensure
that we are meeting the client needs. For this purpose, I will extract a subset
of data and I will create the initial prototype of the dashboard on it. Then I
will get this working dashboard validated from the client to ensure that we and
client both are at the same page in terms of business and dashboard requirements.
It is preferable to get the consent from the stakeholders.
* Implementation
Implementation of the dashboard
using production system
* Testing
Testing of the dashboard to
ensure it fulfills all requirements and is working correctly for all know
scenarios as well as random scenarios.
* End-user training
Train the end-user/client/stake
holder for the usage of dashboard.
* Acceptance trials
Allow end-user/client/stake
holder to make use of dashboard for some time to fulfill their business needs.
Once end-user satisfied, get the acceptance for the completion of the task.
master: The master database holds instance-wide metadata information, server configuration, information about all databases in the instance, and initialization information
You can define a property called collation at the database level that will determine language support, case sensitivity, and sort order for character data in that database. If you do not specify a collation for the database when you create it, the new database will use the default collation of the instance (chosen upon installation).
Do SQL Server benefit from multiple log files?
Although SQL Server can write to multiple data files in parallel, it can write to only one log file at a time, in a sequential manner. Therefore, unlike with data files, having multiple log files does not result in a performance benefit. You might need to add log files if the disk drive where the log resides runs out of space.
References:
Describe the system databases that are present with SQL Server.
master: The master database holds instance-wide metadata information, server configuration, information about all databases in the instance, and initialization information
Resource: The Resource database is a hidden, read-only database that holds the definitions of all system objects. When you query system objects in a database, they appear to reside in the sys schema of the local database, but actually their definition resides in the Resource database.
model: The model database is used as a template for new database. Every new database that you create is initially created as a copy of model. Alter it for changes that you want for all newly created databases.
tempdb: SQL Server stores temporary data in tempdb. It is destroyed and created as a copy of the model database every time an instance of SQL Server is restarted.
For what purpose is the collation property at the database-level?
Do SQL Server benefit from multiple log files?
Although SQL Server can write to multiple data files in parallel, it can write to only one log file at a time, in a sequential manner. Therefore, unlike with data files, having multiple log files does not result in a performance benefit. You might need to add log files if the disk drive where the log resides runs out of space.
References:
Answers taken from: MSSS12 T-SQL Fundadmentals from Itzik Ben-Gan