Test Data Management
Hope all are doing great….it’s very long time since I posted through my blog… my sincere apologies 😦
Test Data availability is the one of and most significant issue that will lead to Schedule slippage in DWH projects.
So the Testers and Test Managers should know what all are requirements for the test data and they need to define the test data strategy.
The Test Data Management should cover the below points:
- Subset of PROD data catered for Testing Requirements
Suppose if you are working in an Enhancement projects then you can avail the existing data that is already loaded into warehouse. In this instance you can’t use the data that’s pulled directly from Production as it’s sensitive to the Bank and its Customers.
- Mask the data as per regulatory and data privacy requirements
All the customer information related information should be masked. It is difficult to test the Masked data, suppose if the Name column should accept 20 (CHAR) and if the data is masked then we cannot do the BVA for the masked column.
- Fabricate Test Data in case of un availability
Your source system cannot provide you the data for all your Testing Scenarios. Lack of Test data is the major issue in the Data warehouse projects. A tester should analyze the system and need to create the test data in the form of Insert Scripts. The test data should be extracted correctly by the query before its loaded into the Target tables.
Typical Challenges involved in Test Data Management:
- Delay in Schedule
The test data from Upstream should be their UAT else our testing will not effective, Suppose if we are in SIT and our Up Stream have not Completed UAT then our Schedule will be postponed as unavailable of UAT tested data from Up Stream.
- Issues to cover Business Cases
Test Data for few Business Scenarios cannot be produced in Up Stream or through Data fabricating. Take this example, reopening of an Account, a Customer was having an account and he closed the same few years before and his Records are Kept Active in ACCT_STATUS Closed Records. When the Customer comes to Bank and the Bank will somehow Identify his existing account and will try to Reopen across all the tables to stop not having one more record in the warehouse, for this kind of scenarios it’s difficult to manufacture the data in a given time.
- Data Privacy & Compliance
Banks are curious about the Customer Data because of the Data theft in any means. So when we will get the data from PROD for our Regression Testing, most of the Customer related data will be masked as per Compliance policy. So we cannot produce the data or test those scenarios in our Environment.
- Dependency of SME’s
SME’s availability is a biggest issue in Data warehouse world.
System Architect – He is accountable for the System Design based on the Business Requirements. He will be closely working with Data Modeler (Data Scientist) to design the Source to Target. System Architect needs to fit the current requirement into the Existing or he should create an Application Landscape for the new requirement.
Data Modeler – He/She is accountable for designing the S2T. Data Modeler should know the Physical and Logical Design of the System and Database. For any Data warehouse project, the S2T is the important document.
ETL Developers – He/She should be accountable for High Level and Low Level design of the Requirements into Code. An ETL developer should be capable enough to design the ETL Code without compromising the performance of the Extraction and Lodging mechanism. He/She should know what kind of Transformation mechanism should be designed for the particular requirement.
Test Analyst / Test Managers – Test Managers should foresee all the technical and business requirements can be tested in given time frame, because the Test Data and System Behaviors might be changing which might cause the Schedule slippage.
- Availability of Adequate Data
We should start our System Testing with the Source System’s UAT Tested Data. If in case due to some issue if the Source System is not completed their System Testing, and they can provide only their System Tested data, then these data is not sufficient to continue our System Testing, so we need to fabricate the Test Data in order to kick off out Testing.
- Volume of Test Data
The Data threshold requirement will be declared by the System Owners. Whereas we cannot create that declared volume of data in our environment and test the performance beyond the threshold limit and below the threshold limit, so there might be Spool Space issues when we go to Production.
- Multiple Source & format of data
In case of Multiple Source Systems are participating in ETL, then it’s difficult to get the Data from different source systems on the same time to being our Testing, in this case again we need to create the Mock Test files to being our testing.
8. Version Control of Data
Versioning of Data is very difficult in the Data warehouse world whereas we can see the history of the data using few housekeeping columns (EFFT_D, EXPY_D). But the data loads and extracts should be versioned to avoid the confusions.
In this post I would like to share my knowledge in Non Functional Testing in Data warehouse testing.
There are different types non-functional testing that we do in testing world, some of them is
- Baseline testing
- Compatibility testing
- Compliance testing
- Documentation testing
- Endurance testing
- Load testing
- Localization testing and Internationalization testing
- Performance testing
- Recovery testing
- Resilience testing
- Security testing
- Scalability testing
- Stress testing
- Usability testing
- Volume testing
To me Non Functional testing is something like which will not give any business values; It’s something like dealing with the environment. When we extract the data from heterogeneous source system, we might need to think of handling
Verifying the volume of the data
Any business can’t ensure what could be the volume of the data that they will send. They can say approximately, Our Code should have the capability of pulling the maximum number of data that they source system can send at any point of the time. To manage the Volume of the data, Teradata has the feature called M-Load and T-Pump. When developers designs the system they fix a limit by which data will be loaded into Warehouse.
- M-Load – If we get a data file with 100 records then the records will be loaded by M-Load functionality
- T-Pump – If we get a data file with less than 100 records then the records will be loaded by T-Pump
What we need to test here is, send a file with 100 records and check records are loaded by M-Load. This can be verified using the Load Job Names.
Verifying Date and Time of the Data file arrival to the Unix Landing directory
Most of the Companies will not function on Week Ends, Public Holidays so our source systems will not send any transactional data on those days. Because of the phenomenon developers will design their jobs to archive any files coming on these days.
Normally, Monday’s transactional data will come to us for loading on Tuesday early morning and it will end on Fridays transactional data will hit us on Saturday early morning.
We as testers need to verify these schedules are working as per the specification. This can be achieved
- sending a file on Week End and check this file is archived
- Sending a file on Public Holiday and check this file is archived
- Verifying Mondays transactional data received on Tuesday morning until on Saturday morning
Verifying Purging and Truncate Loads
I have already mentioned about Purging and Truncate loads in my earlier blogs.
Purging – The AutoSys jobs will Purge the data leaving the required data in staging table. Suppose if I have loaded 10th,11th ,12th of January data into staging table and when I load 13th of January data, the 10th of January data will be purged.
Truncate – Simple load day_01 data and when you load day_02 data they Day_01 data will be deleted
We as testers need to verify the Truncate and Purging is happening as per design requirement.
Verifying File Watcher Script
There will be File Watched Script that will look for files until it arrives the Unix Landing directory. Source system is promising us that they will send Day_01 file on 10-01-2013. So we have set the Date in File watcher Script. Source System sent the records on 10-01-2013 , now our File watcher Script will look the date from the file header, if both are matching then it will process the file into Staging table. Source system failed to send the data on 11-01-2013, our file watcher job will look for the file on 11-01-2013 for given time interval if its not arrived then automated Email will be sent to the concern source system saying the file is not arrived
So we as testers needs to verify the File watched job is working as expected.
Cheers – Asik.
Hi Guys ,
So we are ready with Test Environments, Test Plans, Test Cases to begin with Static Testing.
When we design our test cases we need to write test cases and test conditions right from Source validation till the Reports testing.
1. Test cases to validated the Source Data .
What you wanted to do with source data ? Why it is important ? Why do we need to know the importance of the source Data ?
Let me explain you with normal Scenario!!!
You wanted to purchase a Mobile so what you will do , start analyzing the markets with different channels like [Suggestions from friends] ,[Surfing in Internet], [Checking in news Papers Advertisements], [Checking @ Outlets]. In this you will not get the same details from all of these channels. So we will note down all the data retrieved from different channels and choose what you need !!!
Like the above scenario, a customer who approaches an Organization will go through multiple channels and share multiple information but at the end what you need for your Reports will be Designed , Developed , Tested and Extracted by Us 🙂
As far as I know, if you are 90% OK with the source data then there are high possibility of not finding even a single defect @ any of the test phases unless care less mistakes or code defects found by developers. The defects leaked by Developers can be easily fixed because ETL code is under our control. If a defect is leaked because of the data then it is a huge impact to the projects because we need to go back to the source system and inform them to fix !!!!
What all you need to verify in Source data ?
Just think you are intermediate between a seller (Source) and a target (Seller) . Do you buy defective product from your buyer? Does your buyer buy defective products from you ? No 😦 so please ensure you are buying a non defective product and a selling a 90% defect free product to your customer.
Coming to Business Intelligence 😉
1.The data that we need will be sent to our address (Unix Landing Directory) tester should verify the data has arrived to the correct address (Landing Directory).
2.Our source system will tell us on what Format they are sending the files. So the tester should verify the Name of the file is as what source said to us.
3.Date component in the file name is critical for us, this date is the date when the file is extracted for us.
4.The Source data will be sent to use is different formats , mostly it will be .DAT (data) or .DLY (daily) files.
5.Along with DATA files they will send CTL (Empty file) to make sure all the files arrived -as Acknowledgement copy 🙂
6.Now you validated the address (Landing Directory) and File names.
7. Now open the file you will see Numbers, Alpha-Numeric, Alpha characters separated by [,] or [|] – you need to verify how its separated
8.Most of the files will have 3 sections – Header , Detail and Trailer records.Testers should verify these sections are retrieved as per the specifications.
9.As a tester you should be interested on the data how it is Organized as per the Specification? the order of the column is matching with the order given in Specification ? Data types of the columns are matching with the data types mention in the Specification?
Order of the column:
In the spec you have this Order |Name|Place|Number|DOB| but in the file you are getting as |Name|DOB|Number|Place| —Wohooo its a Defect 🙂
Number of the Columns :
In the spec you have this Order |Name|Place|Number|DOB| but in the file you are getting as |Name|DOB|Place| —Wohooo its a Defect 🙂
Data Type of the Columns:
In the spec you have this Order |Name as VARCHAR |Place as VARCHAR |Number as INTEGER |DOB as DATE | but in the file you are getting as |Name as VARCHAR |Place as VARCHAR |Number as CHAR|DOB as DATE | Wohooo its a Defect 🙂
File name specification :
What are the file name, how it should be named, and extension of the file (.DAT), business time stamp?
File name should be as per the Spec, because developers mention the exact file name in their Data Stage File watched job. If the file name is not as per the Spec then the Job fail due to the mismatch. The time stamp filed (Date + Time) should be validated, example the Spec says the Date should be DDMMYYYY if source is sending the date as MMDDYYYY then our Data Stage job fails due to Invalid Date.
File landing location :
Each project has specific location to X-Com the files from Source System. Testers need to verify the file is arrived
to the correct location.
(Note: If tester cannot able to view the file due to File size, he can view the file using Unix Commands in PUTTY
File Header specification :
In general every file must have Header and the header record will contain when the file is processed. The testers
should aware of the Date in External file name and the date in the Header details.
i) Date in External file name: The date when the file is X-Com’d
ii) Date in Header record: The date when the file is processed.
Suppose if SAP system is extracting the file on 3/12/2013 and X-Com’ing on 4/12/2013, then External file name is
4/12/2013 and Header date is 3/12/2013. The Header is very important because Developer will use the Header
date as EFFECTIVE Date for Data loads.
File detailed records specification:
Detailed records section of the file contains the actual data that need to be loaded into Warehouse. Detail records
are separated by Pipe delimiter in general and it may change from source to source. Testers should be very
conscious about detailed records.
|File Position||Column Name||Key Column||Null / Not Null||Data Type||Filed Length|
The testers should verify:
1.The columns are in the correct order as per the Spec
2.If the columns are mentioned as Key then source should send us only one record for that combination, if
find more than one record after loading into the Staging tables then raise defect
3.If the columns are mentioned as Not Null in the spec then we should not raise Nulls (||) from the source,
suppose if we get NULLs , our Data Stage job fails due to NULLs presents in Not Null Columns , now
raise defect to Source system
4.Data type verification is the critical stuff, if a column is defined as INT then the source should send INT only
if not please raise defect to source system.
5. Filed length should be as per the spec, if the source sends less then what spec says then its OK, if it
exceeds then please raise defects to source system, because it will fail our Data Stage load jobs.
File Trailer record specification:
Trailers records will tell you how many records in the file. And the count could be including Header and Trailer. If
the file is related to transactional data then we might have a Hash total count column.
The tester should use the Trailer record count to verify the reconciliation between source file to staging table data.
So now we have validated our source 🙂 @ my next blog we are going to see how to deal with the Requirement documents for execution.
Thanks – Asik
Test Planning phase is consisting of – Test Estimation, Test Plan preparation and Test Case design.
Test Estimation is an ART. Estimations is a variable component of SDLC, however it can be achieved based on experience. A person who estimates the project timelines should be technically sound enough.
Test Estimates are typically required to be produced at 3 stages
- Ball Park Estimate (BPE) should be produced in the Plan & Analyse phase of the SDLC.
- High Level Estimate (HLE) should be produced towards the end of High Level Solution Design in the Design phase.
- Detailed Estimate (DE) should be produced towards the end of Detailed Design in the Build phase
In Data warehouse world the estimations could be differ from each of the components mentioned below :
- Loads from file to staging tables to Integrated Model testing
- Loads from table to staging tables to IM tables testing
- DIMN /FACT projects testing
- Extract projects testing
- Reports testing
- Non-functional tests (T-Pump ,M-Load ,File watcher Catch up logic, Purging loads, Truncate loads
Test Plan :
Test Plan document identifies the risks to the program / project. Test analyst needs to ask as many as questions on the project and detail them as a document is called Test Plan. Please find common questions that you need to ask.
What you are testing? When you are starting and finishing testing? who is doing the testing? any risk? In Scope and Out Scope, who is giving data to you? to whom you are giving your inputs? who all are depend on you ? how you are depend on others? what files/tables/extract you are testing? how you are going to publish your reports? are you maintaining testing data in Quality Center ?
- This is the ‘WHAT’ will be done. The Test Plan deals with the actions that will be taken to mitigate those risks;
- This is the ‘HOW’ and ‘WHO’. The Test Plan may deal with a single test phase or multiple depending on the number of authors and the size/complexity of each phase;
– This will be identified in the Test Strategy. The typical test phases where Organization either author or review Test Plans
are System, Application Acceptance, System Integration, End-to-End and Production Assurance Testing Phases.
Test Case Design
The difference between a Test Condition and a Test Case is that a Test Condition is “what” we need to test and a Test Case is
“how” we will test the Test Condition.
- Test Conditions are the ‘What’ we need to test and are derived from the identified requirements.
- Test Conditions individually identify the individual functional behaviors that make up a requirement (Note: This differs from what is currently done, namely attempting to cover more than one function in the one Test Condition).
For Data warehouse projects we will have below test scenarios
1. Verify the Reconciliation – count between source and target are same after ETL Loads?
2. Verify Inserting a new record into warehouse – a brand new record which is not loaded into warehouse.
3.Verify how the Delta record is processed – Update record to an existing record in warehouse
4.Values that are loaded by ETL Code
5.Values that are loaded by Data Stage jobs
6.Deleting a record (expiring without updating)
Folks will explain each of them above in detail in upcoming blogs. Please provide feed backs to improve my blogs 🙂
Many Thanks – Asik
Business Intelligence and Testing!!!!
Intelligence is a very general mental capability that, involves the ability to reason, plan, solve problems, comprehend complex ideas, learn quickly and learn from experience.
Further simply put – It is “catching on,” “making sense” of things, or “figuring out” what to do
Business Intelligence is collecting the data from various sources of any organization to aid in planning, solving business problems and learning from experience there by taking proper decisions to expand its business and maximize profits.
Lets understand what a Warehouse is and how we can connect it to Information and Technology!!
Consider a person wants to set up a Provisional Store and he wants to know how to stock the Goods to maintain Supply and Demand life cycle. He’ll start a shop in a small room and gets goods in a limited quantity. 100 kg rice, 50 kg wheat, 100 kg dal, 3-4 varieties chocolates, 3-4 varieties of soaps and starts his business.
People start coming to his store and start buying things. In 15 days his stock is almost sold and then he analyses how the sales were. To achieve this he will maintain all the data on a paper and calculate (Add, Subtract, Multiply, Divide)
He’ll continue this for sometime and forecast the data he collected.
- If analysis says that the goods are sold as per his calculations and he’s making profit then he’ll start getting goods in large quantities. Since the store is very small he cannot accommodate goods in bulk in the store so he will setup a warehouse where he can store all the goods. And small amount of goods required for a month, he’ll move them to the provisional store for selling.
- If analysis says that the few goods in his shop are not sold and he is not making profit as expected but he is not making loss, then he can change the goods he is selling and try his analysis to ear profits
- If analysis says that the goods in his shop are not sold at all and he is in huge loss then he might have to stop his business.
As a shopkeeper he needs to do analysis on the Goods available and the sales in the warehouse to continue or stop buying things (forecasting), he creates different racks to keep different goods to access them easily and very quickly. He should know where he stored the goods and how they are related say for an example. He would store all the Groceries in one rack, Home Maintenance goods in one rack, canned foods in one rack etc., if he stores the goods logically and physically he will be able to get to know about the Stock easily.
Basic analysis to be done by the shopkeeper before he decides to setup a warehouse.
How is the Demand and supply in the warehouse? To answer this he needs to answer the following questions.
- Find out the existing stock of goods
- Does he need new goods – Inserting new records in Data warehouse
- Does he need more number of existing goods -Updating of records in Data warehouse
- Does he need to remove goods which or not in demand – Deleting of records in Data warehouse
If a small provisional store needs to decide the progress based on the available data we can imagine the power of data in any big organization. 10 years ago the cost of a 500 MB USB Drive was sold for INR 2500.00, but now we can get 500 GB Flash Drive for INR 2500.00
As on today we have 7,195,883,823 population in this world. Most of them are connected with one or more Public or Private organization , it could be Banks, Schools, Colleges, Employees, Employers etc.,. Like the above shop keeper the organization needs to maintain the Data relates to 7,195,883,823 people. So Experts in Information and Technology industries are developing robust techniques to maintain and validate the data.
Just think 10 years ago the cost of a 500 MB USB Drive was sold for INR 2500.00, but now we are getting 500 GB Flash Drive for INR 2500.00 – what does it mean ? the data is increased !! so we need more space for lesser price !!!
All organizations are very much interested on Customer data to enhance and forecast their business. So they are seeking Information and Technology specialists for designing. Technology specialist are proposing solutions at various levels to capture and maintain the data for Analysis and Forecasting.
In today’s world, all of us are connected through internet be it social networking, ecommerce, online banking etc. So the experts in Information and Technology industries are developing robust techniques to maintain and validate such huge data and assist the businesses to completely change the way they have been working.
Just like analysis of Warehouse data and the trends of customer’s buying trends help a shopkeeper to plan his strategy to buy goods to maximize the profits, any organization needs to analyze the data of its own organization (Internal data about various departments and people in the organization) and the Customer data to forecast its business and strategize to maximize profits. So they are seeking Information and Technology specialists’ expertise for designing a data warehouse. Technology specialists are proposing solutions at various levels to capture and maintain the data for Analysis and Forecasting.
This gave birth to a closed group called Business Intelligence (Group Information System). There are many roles involved to capture the customer data. Let’s discuss responsibilities of few of them.
- Solution Architect who designs the whole System right from
- How the data is retrieved from source?
- How the retrieved data is loaded into Warehouse?
- How the data is used for analytic purposes?
- Business Analysts who
- Analyze the business needs
- Create the rules for capturing the right data from the source systems.
3. Data Analysts who design how the source system and target (warehouse) interact with each other.
- What data can be loaded?
- How should it be sent into the data warehouse?
4. Developers who convert business needs into coding that will
a. Extract the data from the source system
b. Load the data into warehouse based on the Rules provided by Data modelers and business analysts
- Testers who validate
a. whether source is sending the data what they agreed.
b. Validating the data received based on the business needs
c. Validating the transformation rules defined by the business analysts in Source to target document
d. Validating the code developed by developers are as per the business / transformation rules provided.
Folks – please comment how useful the write up is!!! I will be posting about Test Planning (Estimation,Test Plan, Test Cases and Test Conditions) in my next blog.
Many Thanks – Asik
This post is updated with the valuable inputs of Harika Dommeti (Thanks :-))