Test data

Data warehouse – Test Data Mangement

Posted on Updated on

Test Data Management

Hi All,

Hope all are doing great….it’s very long time since I posted through my blog… my sincere apologies  😦

Happy Reading……:-)

Software-testing-trends-2013

Test Data availability is the one of and most significant issue that will lead to Schedule slippage in DWH projects.

So the Testers and Test Managers should know what all are requirements for the test data and they need to define the test data strategy.

The Test Data Management should cover the below points:

  1. Subset of PROD data catered for Testing Requirements

 Suppose if you are working in an Enhancement projects then you can avail the existing data that is already loaded into warehouse. In this instance you can’t use the data that’s pulled directly from Production as it’s sensitive to the Bank and its Customers.

  1. Mask the data as per regulatory and data privacy requirements 

All the customer information related information should be masked. It is difficult to test the Masked data, suppose if the Name column should accept 20 (CHAR) and if the data is masked then we cannot do the BVA for the masked column.

  1. Fabricate Test Data in case of un availability 

Your source system cannot provide you the data for all your Testing Scenarios. Lack of Test data is the major issue in the Data warehouse projects. A tester should analyze the system and need to create the test data in the form of Insert Scripts. The test data should be extracted correctly by the query before its loaded into the Target tables.

Typical Challenges involved in Test Data Management:

  1. Delay in Schedule

The test data from Upstream should be their UAT else our testing will not effective, Suppose if we are in SIT and our Up Stream have not Completed UAT then our Schedule will be postponed as unavailable of UAT tested data from Up Stream.

  1. Issues to cover Business Cases

Test Data for few Business Scenarios cannot be produced in Up Stream or through Data fabricating. Take this example, reopening of an Account, a Customer was having an account and he closed the same few years before and his Records are Kept Active in ACCT_STATUS Closed Records. When the Customer comes to Bank and the Bank will somehow Identify his existing account and will try to Reopen across all the tables to stop not having one more record in the warehouse, for this kind of scenarios it’s difficult to manufacture the data in a given time.

  1. Data Privacy & Compliance

Banks are curious about the Customer Data because of the Data theft in any means. So when we will get the data from PROD for our Regression Testing, most of the Customer related data will be masked as per Compliance policy. So we cannot produce the data or test those scenarios in our Environment.

  1. Dependency of SME’s

SME’s availability is a biggest issue in Data warehouse world.

System Architect – He is accountable for the System Design based on the Business Requirements. He will be closely working with Data Modeler (Data Scientist) to design the Source to Target. System Architect needs to fit the current requirement into the Existing or he should create an Application Landscape for the new requirement.

Data Modeler – He/She is accountable for designing the S2T. Data Modeler should know the Physical and Logical Design of the System and Database. For any Data warehouse project, the S2T is the important document.

ETL Developers – He/She should be accountable for High Level and Low Level design of the Requirements into Code. An ETL developer should be capable enough to design the ETL Code without compromising the performance of the Extraction and Lodging mechanism. He/She should know what kind of Transformation mechanism should be designed for the particular requirement.

Test Analyst / Test Managers – Test Managers should foresee all the technical and business requirements can be tested in given time frame, because the Test Data and System Behaviors might be changing which might cause the Schedule slippage.

  1. Availability of Adequate Data

We should start our System Testing with the Source System’s UAT Tested Data. If in case due to some issue if the Source System is not completed their System Testing, and they can provide only their System Tested data, then these data is not sufficient to continue our System Testing, so we need to fabricate the Test Data in order to kick off out Testing.

  1. Volume of Test Data

The Data threshold requirement will be declared by the System Owners. Whereas we cannot create that declared volume of data in our environment and test the performance beyond the threshold limit and below the threshold limit, so there might be Spool Space issues when we go to Production.

  1. Multiple Source & format of data

In case of Multiple Source Systems are participating in ETL, then it’s difficult to get the Data from different source systems on the same time to being our Testing, in this case again we need to create the Mock Test files to being our testing.

 8.     Version Control of Data

Versioning of Data is very difficult in the Data warehouse world whereas we can see the history of the data using few housekeeping columns (EFFT_D, EXPY_D). But the data loads and extracts should be versioned to avoid the confusions.

Cheers,

Asik

Chapter 14 – A Sample ETL Project – Customer Management in a Hospital

Posted on Updated on

Hi All,

Have a please week ahead !!!

In this post let me explain a Project Structure.

asik_world

I have a Multi-Specialty Hospital located across the world. My hospital is famous for Vaccination. Patients who comes to my hospital across globe will be given a User Card with which they can access any of my hospital in the world.

Business Need

Now we maintain all customers in one Database. There are heaps and heaps of customers related to my hospital.So I decided to split up the customers based on the country and load them into corresponding country tables. Elaborated details of this requirements will be mentioned in the Business Requirement Specification.

Design Need 

To pull the customers as per Country, a designer should know what are all the places the Customer Data is available. So the data extracting will be done by our Source System. They will pull the all the relevant customer data and will give us a Data file.

In Design Documents you will have

Source File Specification 

1)   File Name Specification – Name String , Extension of the files
2)   Date and Time format of the File – YYYYMMDD, HHMMSSTT or any other format
3)   Control File Specification – Name String , Extension of the files
5)   Header Records Layout – |H|Customer_Records|20131012|1245872|
6)   Details Record Layout – |D|Asik|123456|20101012|20121013|MVD|Sanjay|NSW|AU|06031987|A
7)   Trailer Record Layout – |T|10|

Detail Records will tell you what data you are getting from source, what data type, is it mandatory or not and the length of the column.

File Position Column_Name Filed Length Data Type Mandatory (Y/N) Key Column
1 Customer Name 255 VARCHAR Y Y
2 Customer ID 18 VARCHAR Y N
3 Customer Open Date 8 DATE Y N
4 Last Consulted Date 8 DATE N N
5 Vacination Type 5 CHAR N N
6 Doctor Consulted 255 CHAR N N
7 State 5 CHAR N N
8 Country 5 CHAR N N
9 Post Code 5 INT N N
10 Date of Birth 8 DATE N N
11 Active Customer 1 CHAR N N

Click below

Please refer my Post – Source File Validation to know how validate the above details.

The sample file format will be

|H|Customer_Records|20131012|1245872|
|D|Asik|123456|20101012|20121013|MVD|Sanjay|NSW|AU|06031987|A
|D|Alex|123457|20101012|20121013|MVD|Sanjay1|SA|USA|06031987|A
|D|John|123458|20101012|20121013|MVD|Sanjay|TN|IND|06031987|A
|D|Mathew|123459|20101012|20121013|MVD|Sanjay|WAS|PHIL|06031987|A
|D|Matt|12345|20101012|20121013|MVD|Sanjay|BOS|NYC|06031987|A
|D|Jacob|1256|20101012|20121013|MVD|Sanjay|VIC|AU|06031987|A
|D|Arun|1456|20101012|20121013|MVD|Sanjay|QA|AU|06031987|A
|D|Wague|23456|20101012|20121013|MVD|Sanjay|AOL|CAN|06031987|A
|D|Steve|126|20101012|20121013|MVD|Sanjay|NSW|AU|06031987|A
|D|Neil|156|20101012|20121013|MVD|Sanjay|NSW|AU|06031987|A
|T|10|

Now using the ETL process now we loaded the data into Staging Tables. Intermediate tables will look like below

Staging Table Load 

Click below

Please Check my ETL Process post how the data is loaded and what we need to verify from this Step :

Name Cust_I Open_Dt Consul_Dt VAC_ID DR_Name State County DOB FLAG
Asik 123456 20101012 20121013 MVD Sanjay NSW AU 6031987 A
Alex 123457 20101012 20121013 MVD Sanjay1 SA USA 6031987 A
John 123458 20101012 20121013 MVD Sanjay TN IND 6031987 A
Mathew 123459 20101012 20121013 MVD Sanjay WAS PHIL 6031987 A
Matt 12345 20101012 20121013 MVD Sanjay BOS NYC 6031987 A
Jacob 1256 20101012 20121013 MVD Sanjay VIC AU 6031987 A
Arun 1456 20101012 20121013 MVD Sanjay QA AU 6031987 A
Wague 23456 20101012 20121013 MVD Sanjay AOL CAN 6031987 A
Steve 126 20101012 20121013 MVD Sanjay NSW AU 6031987 A
Neil 156 20101012 20121013 MVD Sanjay NSW AU 6031987 A

Now my Staging Data is ready and we need to load them into corresponding Target Tables.

As a project we will test data load from file to Staging as Phase-I and Staging to Target tables to Phase II.

In this project we are not having any transformation rules , so we will insert the records from Staging tables to corresponding target tables.

Like

All customers related to India will go to Table_India and so on. In my next blog let me come up with few more complex examples for Target tables load.

The Source to Target Document will looks like

 Customers From India will be loaded by below logic
S.No Source Column Source Table Transformation Rule Target Table KEY / Value Target Column
1 Customer Name CUST_STG If Source value is Null Do not Load Else load CUST_IND Key CUST_NAME
2 Customer ID CUST_STG If Source value is Null Do not Load Else load CUST_IND Key CUST_ID
3 Customer Open Date CUST_STG Assign Source Value CUST_IND Value CUST_OP_DT
4 Last Consulted Date CUST_STG Assign Source Value CUST_IND Value CUST_CONS_DT
5 Vacination Type CUST_STG Assign Source Value CUST_IND CUST_VAC_TYPE
6 Doctor Consulted CUST_STG Assign Source Value CUST_IND CUST_DR_CONS
7 State CUST_STG If Source value is Null Do not Load Else load CUST_IND Value CUST_STATE
8 Country CUST_STG If Source value is Null Do not Load Else load CUST_IND Value CUST_COUNTRY
9 Post Code CUST_STG If Source value is Null Do not Load Else load CUST_IND CUST_POS_C
10 Date of Birth CUST_STG Assign Source Value CUST_IND CUST_DOB
11 Active Customer CUST_STG Assign Source Value CUST_IND CUST_STATUS

So based on above rule will load the data into IND_CUST tables.

Cheers

Asik

Chapter 6 – Test Data Fabricating in Business Intelligence

Posted on Updated on

Hi All,

Thanks for viewing my blog so far ! I wanted to keep sharing my knowledge as much as I can – Happy Reading 🙂

In this blog I am going to explain about Test Data. In other testing practices you don’t need predefined test data, as we are the user who updated the values either in the Front End applications or some other applications. But in Data warehouse we are the ones who are completely depend on our Upstream (front end application) applications. We will test what all are the data that used in the Upstream guys.

Image

We testers are the intermediate guys who buys the data from one team and validate , transform and sell it to other guys 🙂

Image

Let me explain you in normal terms, you are doing a business , you need to buy the goods from your buyer and sell to u r customer!

What all you need to do?

– Need to buy quality products

– Need to check the count what you ordered

– Need to check the specification what you asked for?

– Need their support in fixing the issues

– if they cant fix it, then you need to repair it and sell it to your customer

Now come to our Business Intelligence scenarios,

As a tester what you need to do with the data ?

– Check the quality of the data (is as per the specifications)

– Check the count is correct using Trailer records (refer my Source Data validation Blog)

– Check if the data is not as per requirement then ask upstream guys (the people who gave the data to you) to fix the data issues

– Check you have got all the data that you required, if not ask your upstream to provide more data, suppose if the deny, please manufacture your data 😦

As I said already in my earlier blogs we have below types of DWH testing

1. Loads from file to staging tables to IM tables
2. Loads from table to staging tables to IM tables
3. DIMN/FACT projects
4. Extract projects
5. Reports
Test Data fabricating for Staging table to Target tables are quite easy, but for Extract Testing and Reports testing are bit difficult 😦

For most of the positive scenarios Upstream will provide the data for us !! but we tester wants to break the code so we will have few negative scenarios for which we will not get data from upstream. And for the scenarios which will happen once in a while we will not get the data so we need to create the data.

For Extracts, the Teradata SQL codes will be written by joining more than 10 tables , each tables will have its own WHERE clauses. After extracting the records using the SQL ,  you found that few scenarios are missing then you need to set up a data that needs to flow across all the tables.( Will be explained in my Know about Extracts Blog)

As I said in my Test Environment blog about the access, the testers should have Create table access in Test Environment and Select access to View Data Base.

So using Create Table access we can create our own test data as follows:

Please read the image very carefully, if you don’t understand please send me an EmailTo : aashiqamrita@gmail.com

Image

1 Create a TEMP table in you test Environment using Create Table Qry.

2. Insert sample records into your TEMP tables using INSERT INTO Qry.

3. Modify the data as per your requirement

4.Insert back the modified data into your actual table.

5.Your test data is ready to use 🙂

Hope you guys understood the test data manufacturing technique . If you guys need more support please comment in this blog 🙂

Thanks –

Happy Reading

Asik

Chapter 3 – Setting up Test Environment for BI Projects

Posted on Updated on

Test Environment:

Image

In general BI Projects will have 3 environments

Production – where our tested code will function in real time

Development – where developers develops and Unit tests the code

Test Environment (System Test / SIT / E2E / UAT) – where the developed code will be deployed for testing

Setting up a Test Environment for Business Intelligence project is critical because the data for different level of testing is different. Developers will develop the ETL code in DEV (development) environment and when the testing phase kicks of they will point their codes to Testing environments.  

Why we need a separate environment? 

1. ACCESS and Test Environment

Because Developers are the one who designs the code and they will keep on changing the code until it works. We don’t have the version control in Data Stage or in Teradata. The development environment is the open space for all the Developers involved and there are high possibility of irregular updates on the code so the DEV environment is loosely controlled. Developers will have INSERT / UPDATE / DELETE access to all the designs in DEV and Test Environments. But testers will have only VIEW access on VIEWS (will explain what is Table and View in my next blog). Tester can only verify the data is as per ETL Code , they cant update any record to make them correct as per requirement. And developers also should agree what ever defects found in our environment should be open in DEV environments.

2. Data and Test Environments 

Data that used for Developing the code should be different to the data is used for testing why because developers used to create Test Data for UNIT testing, they are happy when the functionality is working fine as expected. So they are not interested source data quality. If any data causing trouble to them , they simply delete it and load the rest of the data. But testers should be very conscious about the data because data is the key for us to proceed the testing. So testers should have their own data in their own Environment. They should not depend on the source data that relies on Development environment.

 What all components needs access for a Tester ?

1. File Landing Directory – If your project is File to Table load then Source system will send the Files to a specified location. This location is different for both Development and Testing. Testers should get the access to this Directory.

2. Select ACCESS to Source Views – Tester can easily check whether they have Select access to the tables by simply querying the tables.If you are not having access to the tables then please create a Grants statement and send to your DBA.

[GRANT SEL ON DB_Name to User_ID;]

Good practice availing the access:

  1. A tester should run his queries in Views not in Tables, so Leads should make sure that the testers should not have Select access to tables before they start execution.
  2. Developers are creating Views over the tables, and when they deploy the code into PRODUCTION their code will points to the VIEWs not TABLEs.
  3. Developers might have introduced few filter statements in Views in order to stop duplicates, which might cause errors. Suppose if testers are tested the Target data using Tables then the errors related to Views could not be identified.
  4. Testers should not have Insert, Update, Delete access to Source and Target tables.
  5. Testers should have Create table access on Test Database for manufacturing data (will be explained in Test Data Management Blog)

Availability of the Source Data in Test Environment:

If the data load is into Existing warehouse tables then check the historical data is available in the Test environment by simply selecting the table. If there no records in test environment then request developer to copy sample records from PRODUCTION data into Test Environment.

[INSERT INTO TEST_DB.TABLE SEL * FROM PROD_DB.TABLE SAMPLE 1000;]

Q: Why we should have existing data into Test Environment if it is load into existing table?

A: Because our load into existing should not be delta records for the existing records. This can be verified using the record count       before and after the load – the count existing count should not be disturbed.
If the data load is into new warehouse tables then the tables should not contain any data in the Test Environment.

Q: Why we should not have any data  New target tables?

A: Because these tables do not exist in PRODUCTION and we are the one going to load the data into these tables.
So it should not contain any data.

Reference Tables in Test Environment 

MAP and TYPE tables are used for Referential Integrity. A project can use the existing MAP or TYPE tables or they
can create their own based on the project requirement

  1. Testers should verify all the MAP and TYPE tables mentioned in the S2T (even if it is not used in S2T transformation) are created in the Test Environment
  2. If the MAP or TYPE table exists then the testers should verify the data in Test Environment and PROD environment should be same.
  3. If the MAP or TYPE tables are created project specific then testers should verify the data is inserted correctly as per the Insert statements provide in S2T
  4. Tester should cross the values inserted into MAP and TYPE tables are matching with DDS (refer Appendix section for Reference data). If you find any data missing or added in the table please raise defect to Data designer.

Now we have everything in our Environment 🙂  Now we are going for Validation of Specification on next Blog.

See you @ my next Blog.

Regards – Asik