Test Data Management
Hope all are doing great….it’s very long time since I posted through my blog… my sincere apologies 😦
Test Data availability is the one of and most significant issue that will lead to Schedule slippage in DWH projects.
So the Testers and Test Managers should know what all are requirements for the test data and they need to define the test data strategy.
The Test Data Management should cover the below points:
- Subset of PROD data catered for Testing Requirements
Suppose if you are working in an Enhancement projects then you can avail the existing data that is already loaded into warehouse. In this instance you can’t use the data that’s pulled directly from Production as it’s sensitive to the Bank and its Customers.
- Mask the data as per regulatory and data privacy requirements
All the customer information related information should be masked. It is difficult to test the Masked data, suppose if the Name column should accept 20 (CHAR) and if the data is masked then we cannot do the BVA for the masked column.
- Fabricate Test Data in case of un availability
Your source system cannot provide you the data for all your Testing Scenarios. Lack of Test data is the major issue in the Data warehouse projects. A tester should analyze the system and need to create the test data in the form of Insert Scripts. The test data should be extracted correctly by the query before its loaded into the Target tables.
Typical Challenges involved in Test Data Management:
- Delay in Schedule
The test data from Upstream should be their UAT else our testing will not effective, Suppose if we are in SIT and our Up Stream have not Completed UAT then our Schedule will be postponed as unavailable of UAT tested data from Up Stream.
- Issues to cover Business Cases
Test Data for few Business Scenarios cannot be produced in Up Stream or through Data fabricating. Take this example, reopening of an Account, a Customer was having an account and he closed the same few years before and his Records are Kept Active in ACCT_STATUS Closed Records. When the Customer comes to Bank and the Bank will somehow Identify his existing account and will try to Reopen across all the tables to stop not having one more record in the warehouse, for this kind of scenarios it’s difficult to manufacture the data in a given time.
- Data Privacy & Compliance
Banks are curious about the Customer Data because of the Data theft in any means. So when we will get the data from PROD for our Regression Testing, most of the Customer related data will be masked as per Compliance policy. So we cannot produce the data or test those scenarios in our Environment.
- Dependency of SME’s
SME’s availability is a biggest issue in Data warehouse world.
System Architect – He is accountable for the System Design based on the Business Requirements. He will be closely working with Data Modeler (Data Scientist) to design the Source to Target. System Architect needs to fit the current requirement into the Existing or he should create an Application Landscape for the new requirement.
Data Modeler – He/She is accountable for designing the S2T. Data Modeler should know the Physical and Logical Design of the System and Database. For any Data warehouse project, the S2T is the important document.
ETL Developers – He/She should be accountable for High Level and Low Level design of the Requirements into Code. An ETL developer should be capable enough to design the ETL Code without compromising the performance of the Extraction and Lodging mechanism. He/She should know what kind of Transformation mechanism should be designed for the particular requirement.
Test Analyst / Test Managers – Test Managers should foresee all the technical and business requirements can be tested in given time frame, because the Test Data and System Behaviors might be changing which might cause the Schedule slippage.
- Availability of Adequate Data
We should start our System Testing with the Source System’s UAT Tested Data. If in case due to some issue if the Source System is not completed their System Testing, and they can provide only their System Tested data, then these data is not sufficient to continue our System Testing, so we need to fabricate the Test Data in order to kick off out Testing.
- Volume of Test Data
The Data threshold requirement will be declared by the System Owners. Whereas we cannot create that declared volume of data in our environment and test the performance beyond the threshold limit and below the threshold limit, so there might be Spool Space issues when we go to Production.
- Multiple Source & format of data
In case of Multiple Source Systems are participating in ETL, then it’s difficult to get the Data from different source systems on the same time to being our Testing, in this case again we need to create the Mock Test files to being our testing.
8. Version Control of Data
Versioning of Data is very difficult in the Data warehouse world whereas we can see the history of the data using few housekeeping columns (EFFT_D, EXPY_D). But the data loads and extracts should be versioned to avoid the confusions.
Have a please week ahead !!!
In this post let me explain a Project Structure.
I have a Multi-Specialty Hospital located across the world. My hospital is famous for Vaccination. Patients who comes to my hospital across globe will be given a User Card with which they can access any of my hospital in the world.
Now we maintain all customers in one Database. There are heaps and heaps of customers related to my hospital.So I decided to split up the customers based on the country and load them into corresponding country tables. Elaborated details of this requirements will be mentioned in the Business Requirement Specification.
To pull the customers as per Country, a designer should know what are all the places the Customer Data is available. So the data extracting will be done by our Source System. They will pull the all the relevant customer data and will give us a Data file.
In Design Documents you will have
Source File Specification
1) File Name Specification – Name String , Extension of the files
2) Date and Time format of the File – YYYYMMDD, HHMMSSTT or any other format
3) Control File Specification – Name String , Extension of the files
5) Header Records Layout – |H|Customer_Records|20131012|1245872|
6) Details Record Layout – |D|Asik|123456|20101012|20121013|MVD|Sanjay|NSW|AU|06031987|A
7) Trailer Record Layout – |T|10|
Detail Records will tell you what data you are getting from source, what data type, is it mandatory or not and the length of the column.
|File Position||Column_Name||Filed Length||Data Type||Mandatory (Y/N)||Key Column|
|3||Customer Open Date||8||DATE||Y||N|
|4||Last Consulted Date||8||DATE||N||N|
|10||Date of Birth||8||DATE||N||N|
Please refer my Post – Source File Validation to know how validate the above details.
The sample file format will be
Now using the ETL process now we loaded the data into Staging Tables. Intermediate tables will look like below
Staging Table Load
Now my Staging Data is ready and we need to load them into corresponding Target Tables.
As a project we will test data load from file to Staging as Phase-I and Staging to Target tables to Phase II.
In this project we are not having any transformation rules , so we will insert the records from Staging tables to corresponding target tables.
All customers related to India will go to Table_India and so on. In my next blog let me come up with few more complex examples for Target tables load.
The Source to Target Document will looks like
|Customers From India will be loaded by below logic|
|S.No||Source Column||Source Table||Transformation Rule||Target Table||KEY / Value||Target Column|
|1||Customer Name||CUST_STG||If Source value is Null Do not Load Else load||CUST_IND||Key||CUST_NAME|
|2||Customer ID||CUST_STG||If Source value is Null Do not Load Else load||CUST_IND||Key||CUST_ID|
|3||Customer Open Date||CUST_STG||Assign Source Value||CUST_IND||Value||CUST_OP_DT|
|4||Last Consulted Date||CUST_STG||Assign Source Value||CUST_IND||Value||CUST_CONS_DT|
|5||Vacination Type||CUST_STG||Assign Source Value||CUST_IND||CUST_VAC_TYPE|
|6||Doctor Consulted||CUST_STG||Assign Source Value||CUST_IND||CUST_DR_CONS|
|7||State||CUST_STG||If Source value is Null Do not Load Else load||CUST_IND||Value||CUST_STATE|
|8||Country||CUST_STG||If Source value is Null Do not Load Else load||CUST_IND||Value||CUST_COUNTRY|
|9||Post Code||CUST_STG||If Source value is Null Do not Load Else load||CUST_IND||CUST_POS_C|
|10||Date of Birth||CUST_STG||Assign Source Value||CUST_IND||CUST_DOB|
|11||Active Customer||CUST_STG||Assign Source Value||CUST_IND||CUST_STATUS|
So based on above rule will load the data into IND_CUST tables.
Thanks for viewing my blog so far ! I wanted to keep sharing my knowledge as much as I can – Happy Reading 🙂
In this blog I am going to explain about Test Data. In other testing practices you don’t need predefined test data, as we are the user who updated the values either in the Front End applications or some other applications. But in Data warehouse we are the ones who are completely depend on our Upstream (front end application) applications. We will test what all are the data that used in the Upstream guys.
We testers are the intermediate guys who buys the data from one team and validate , transform and sell it to other guys 🙂
Let me explain you in normal terms, you are doing a business , you need to buy the goods from your buyer and sell to u r customer!
What all you need to do?
– Need to buy quality products
– Need to check the count what you ordered
– Need to check the specification what you asked for?
– Need their support in fixing the issues
– if they cant fix it, then you need to repair it and sell it to your customer
Now come to our Business Intelligence scenarios,
As a tester what you need to do with the data ?
– Check the quality of the data (is as per the specifications)
– Check the count is correct using Trailer records (refer my Source Data validation Blog)
– Check if the data is not as per requirement then ask upstream guys (the people who gave the data to you) to fix the data issues
– Check you have got all the data that you required, if not ask your upstream to provide more data, suppose if the deny, please manufacture your data 😦
As I said already in my earlier blogs we have below types of DWH testing
1. Loads from file to staging tables to IM tables
2. Loads from table to staging tables to IM tables
3. DIMN/FACT projects
4. Extract projects
Test Data fabricating for Staging table to Target tables are quite easy, but for Extract Testing and Reports testing are bit difficult 😦
For most of the positive scenarios Upstream will provide the data for us !! but we tester wants to break the code so we will have few negative scenarios for which we will not get data from upstream. And for the scenarios which will happen once in a while we will not get the data so we need to create the data.
For Extracts, the Teradata SQL codes will be written by joining more than 10 tables , each tables will have its own WHERE clauses. After extracting the records using the SQL , you found that few scenarios are missing then you need to set up a data that needs to flow across all the tables.( Will be explained in my Know about Extracts Blog)
As I said in my Test Environment blog about the access, the testers should have Create table access in Test Environment and Select access to View Data Base.
So using Create Table access we can create our own test data as follows:
Please read the image very carefully, if you don’t understand please send me an EmailTo : email@example.com
1 Create a TEMP table in you test Environment using Create Table Qry.
2. Insert sample records into your TEMP tables using INSERT INTO Qry.
3. Modify the data as per your requirement
4.Insert back the modified data into your actual table.
5.Your test data is ready to use 🙂
Hope you guys understood the test data manufacturing technique . If you guys need more support please comment in this blog 🙂
In general BI Projects will have 3 environments
Production – where our tested code will function in real time
Development – where developers develops and Unit tests the code
Test Environment (System Test / SIT / E2E / UAT) – where the developed code will be deployed for testing
Setting up a Test Environment for Business Intelligence project is critical because the data for different level of testing is different. Developers will develop the ETL code in DEV (development) environment and when the testing phase kicks of they will point their codes to Testing environments.
Why we need a separate environment?
1. ACCESS and Test Environment
Because Developers are the one who designs the code and they will keep on changing the code until it works. We don’t have the version control in Data Stage or in Teradata. The development environment is the open space for all the Developers involved and there are high possibility of irregular updates on the code so the DEV environment is loosely controlled. Developers will have INSERT / UPDATE / DELETE access to all the designs in DEV and Test Environments. But testers will have only VIEW access on VIEWS (will explain what is Table and View in my next blog). Tester can only verify the data is as per ETL Code , they cant update any record to make them correct as per requirement. And developers also should agree what ever defects found in our environment should be open in DEV environments.
2. Data and Test Environments
Data that used for Developing the code should be different to the data is used for testing why because developers used to create Test Data for UNIT testing, they are happy when the functionality is working fine as expected. So they are not interested source data quality. If any data causing trouble to them , they simply delete it and load the rest of the data. But testers should be very conscious about the data because data is the key for us to proceed the testing. So testers should have their own data in their own Environment. They should not depend on the source data that relies on Development environment.
What all components needs access for a Tester ?
1. File Landing Directory – If your project is File to Table load then Source system will send the Files to a specified location. This location is different for both Development and Testing. Testers should get the access to this Directory.
2. Select ACCESS to Source Views – Tester can easily check whether they have Select access to the tables by simply querying the tables.If you are not having access to the tables then please create a Grants statement and send to your DBA.
[GRANT SEL ON DB_Name to User_ID;]
Good practice availing the access:
- A tester should run his queries in Views not in Tables, so Leads should make sure that the testers should not have Select access to tables before they start execution.
- Developers are creating Views over the tables, and when they deploy the code into PRODUCTION their code will points to the VIEWs not TABLEs.
- Developers might have introduced few filter statements in Views in order to stop duplicates, which might cause errors. Suppose if testers are tested the Target data using Tables then the errors related to Views could not be identified.
- Testers should not have Insert, Update, Delete access to Source and Target tables.
- Testers should have Create table access on Test Database for manufacturing data (will be explained in Test Data Management Blog)
Availability of the Source Data in Test Environment:
If the data load is into Existing warehouse tables then check the historical data is available in the Test environment by simply selecting the table. If there no records in test environment then request developer to copy sample records from PRODUCTION data into Test Environment.
[INSERT INTO TEST_DB.TABLE SEL * FROM PROD_DB.TABLE SAMPLE 1000;]
Q: Why we should have existing data into Test Environment if it is load into existing table?
A: Because our load into existing should not be delta records for the existing records. This can be verified using the record count before and after the load – the count existing count should not be disturbed.
If the data load is into new warehouse tables then the tables should not contain any data in the Test Environment.
Q: Why we should not have any data New target tables?
A: Because these tables do not exist in PRODUCTION and we are the one going to load the data into these tables.
So it should not contain any data.
Reference Tables in Test Environment
MAP and TYPE tables are used for Referential Integrity. A project can use the existing MAP or TYPE tables or they
can create their own based on the project requirement
- Testers should verify all the MAP and TYPE tables mentioned in the S2T (even if it is not used in S2T transformation) are created in the Test Environment
- If the MAP or TYPE table exists then the testers should verify the data in Test Environment and PROD environment should be same.
- If the MAP or TYPE tables are created project specific then testers should verify the data is inserted correctly as per the Insert statements provide in S2T
- Tester should cross the values inserted into MAP and TYPE tables are matching with DDS (refer Appendix section for Reference data). If you find any data missing or added in the table please raise defect to Data designer.
Now we have everything in our Environment 🙂 Now we are going for Validation of Specification on next Blog.
See you @ my next Blog.
Regards – Asik