Being a gap given today my blog about Process Control and its importance in ETL Testing.
Everything in the world should be controlled by something, like how a mom controls her kids 🙂 Why we need to control the ETL process? and why it does matter for the Testers?
If we do not control the ETL process then we might not do the Correct Inserts and Updates to the target system. The code may be worn out and it behaves crazy. And if something goes wrong then you cant find the root cause of the issue. If you would have lost few records in ETL process without controlling the process then those records will be unknown for ever. Data fix is easy if something goes wrong. We can easily control the source behavior by controlling the ETL Process.
The extract , load and transfer mechanism is controlled by few attributes.
1. Load Date
2. Load Time
4. Process Identifiers
5. Record Active Date
6. Record Expiry Date
Testers should verify these attributes because
1. Are the data loaded for today’s run?
2.Are the data updated correctly?
3.Are we loading the correct data from source system?
4.Are we maintaining the historical data?
5.Are the Insert and Update process is happening correctly?
6. Are we doing the correct Catch up for the delta records
(Catch up means, I have inserted a record today and its delta record is coming one week later, this record should expire the existing record and should insert the new record.)
7. Are the reconciliation is achieved in both Insert and Update scenarios?
In ETL process, the first step is loading into Intermediate Tables – just dumping all the source data into a table for further processing. You cant keep the data in the Intermediate tables for ever, because of the Database size. So you need to delete the data. So you can Truncate or Purge the data.
Truncate – Every time delete and load the data into intermediate tables.
Purge – Keep few days of data into Intermediate tables and delete the older data from Intermediate tables. This will be useful for capturing the historical data to do the catch up.
Here testers need to verify Truncating of records are functioning or not using the Date Columns in the Intermediate tables and Purging of records are happening for given data intervals using the Date and Time columns.
So now we are extracted the Source data and its controlled for Transforming – and we do not have any Delta Logic’s in the Intermediate tables.
Coming to the Warehouse tables, we have Inserts, Updates and Logical Deletes , let me explain you how these process are controlled and how testers needs to verify the same.
Again let me take my Face Book account for better understanding.
Inserts and Process Control
Today I have created a profile using the Face Book sign up page !! and when Face Book engineer load my data into their ware house they attach today’s date as my Effective Date that is 21/12/2013 and Expiry Date as HIGH DATE (coz they don’t know when I will deactivate the account) and they will give me a Process Active Identifier (to denote it is a Insert or update record) for inserting the record and as the record is active so they will not give Process Expiry Identifier .
> Testers needs to verify Date columns and the Identifiers are for the current day load or not.
Updates and Process Control
The very next data I have updated my Display Name, now my record is going for Delta,
what will happen to existing record’s Effective Date,Effective Date,Process Active Identifier ,Process Expiry Identifier ?
what will happen to new record’s Effective Date,Effective Date,Process Active Identifier ,Process Expiry Identifier ?
Above questions are based on our requirements. So Testers should verify the transformation for these Process Control values are achieved as per the Source to Target specifications.
Logical Deletes and Process Control
On the third day I’m deactivating the Face Book account. Now my record in the warehouse should be deactivate in the case
what will happen to columns Effective Date,Effective Date,Process Active Identifier ,Process Expiry Identifier for my delete record?
Again testers should verify the correct transformation is achieved for the Deleted records as per the Transformation rules.
By ensuring this Process Control attributes we can ensure that
“Are we inserting the correct records into warehouse”
“Are we updating the correct records”
“Are we logically delete the correct records”
“Catch of the records are achieved”
“Maintaining the historical data and so on :-)”
Hi Guys ,
So we are ready with Test Environments, Test Plans, Test Cases to begin with Static Testing.
When we design our test cases we need to write test cases and test conditions right from Source validation till the Reports testing.
1. Test cases to validated the Source Data .
What you wanted to do with source data ? Why it is important ? Why do we need to know the importance of the source Data ?
Let me explain you with normal Scenario!!!
You wanted to purchase a Mobile so what you will do , start analyzing the markets with different channels like [Suggestions from friends] ,[Surfing in Internet], [Checking in news Papers Advertisements], [Checking @ Outlets]. In this you will not get the same details from all of these channels. So we will note down all the data retrieved from different channels and choose what you need !!!
Like the above scenario, a customer who approaches an Organization will go through multiple channels and share multiple information but at the end what you need for your Reports will be Designed , Developed , Tested and Extracted by Us 🙂
As far as I know, if you are 90% OK with the source data then there are high possibility of not finding even a single defect @ any of the test phases unless care less mistakes or code defects found by developers. The defects leaked by Developers can be easily fixed because ETL code is under our control. If a defect is leaked because of the data then it is a huge impact to the projects because we need to go back to the source system and inform them to fix !!!!
What all you need to verify in Source data ?
Just think you are intermediate between a seller (Source) and a target (Seller) . Do you buy defective product from your buyer? Does your buyer buy defective products from you ? No 😦 so please ensure you are buying a non defective product and a selling a 90% defect free product to your customer.
Coming to Business Intelligence 😉
1.The data that we need will be sent to our address (Unix Landing Directory) tester should verify the data has arrived to the correct address (Landing Directory).
2.Our source system will tell us on what Format they are sending the files. So the tester should verify the Name of the file is as what source said to us.
3.Date component in the file name is critical for us, this date is the date when the file is extracted for us.
4.The Source data will be sent to use is different formats , mostly it will be .DAT (data) or .DLY (daily) files.
5.Along with DATA files they will send CTL (Empty file) to make sure all the files arrived -as Acknowledgement copy 🙂
6.Now you validated the address (Landing Directory) and File names.
7. Now open the file you will see Numbers, Alpha-Numeric, Alpha characters separated by [,] or [|] – you need to verify how its separated
8.Most of the files will have 3 sections – Header , Detail and Trailer records.Testers should verify these sections are retrieved as per the specifications.
9.As a tester you should be interested on the data how it is Organized as per the Specification? the order of the column is matching with the order given in Specification ? Data types of the columns are matching with the data types mention in the Specification?
Order of the column:
In the spec you have this Order |Name|Place|Number|DOB| but in the file you are getting as |Name|DOB|Number|Place| —Wohooo its a Defect 🙂
Number of the Columns :
In the spec you have this Order |Name|Place|Number|DOB| but in the file you are getting as |Name|DOB|Place| —Wohooo its a Defect 🙂
Data Type of the Columns:
In the spec you have this Order |Name as VARCHAR |Place as VARCHAR |Number as INTEGER |DOB as DATE | but in the file you are getting as |Name as VARCHAR |Place as VARCHAR |Number as CHAR|DOB as DATE | Wohooo its a Defect 🙂
File name specification :
What are the file name, how it should be named, and extension of the file (.DAT), business time stamp?
File name should be as per the Spec, because developers mention the exact file name in their Data Stage File watched job. If the file name is not as per the Spec then the Job fail due to the mismatch. The time stamp filed (Date + Time) should be validated, example the Spec says the Date should be DDMMYYYY if source is sending the date as MMDDYYYY then our Data Stage job fails due to Invalid Date.
File landing location :
Each project has specific location to X-Com the files from Source System. Testers need to verify the file is arrived
to the correct location.
(Note: If tester cannot able to view the file due to File size, he can view the file using Unix Commands in PUTTY
File Header specification :
In general every file must have Header and the header record will contain when the file is processed. The testers
should aware of the Date in External file name and the date in the Header details.
i) Date in External file name: The date when the file is X-Com’d
ii) Date in Header record: The date when the file is processed.
Suppose if SAP system is extracting the file on 3/12/2013 and X-Com’ing on 4/12/2013, then External file name is
4/12/2013 and Header date is 3/12/2013. The Header is very important because Developer will use the Header
date as EFFECTIVE Date for Data loads.
File detailed records specification:
Detailed records section of the file contains the actual data that need to be loaded into Warehouse. Detail records
are separated by Pipe delimiter in general and it may change from source to source. Testers should be very
conscious about detailed records.
|File Position||Column Name||Key Column||Null / Not Null||Data Type||Filed Length|
The testers should verify:
1.The columns are in the correct order as per the Spec
2.If the columns are mentioned as Key then source should send us only one record for that combination, if
find more than one record after loading into the Staging tables then raise defect
3.If the columns are mentioned as Not Null in the spec then we should not raise Nulls (||) from the source,
suppose if we get NULLs , our Data Stage job fails due to NULLs presents in Not Null Columns , now
raise defect to Source system
4.Data type verification is the critical stuff, if a column is defined as INT then the source should send INT only
if not please raise defect to source system.
5. Filed length should be as per the spec, if the source sends less then what spec says then its OK, if it
exceeds then please raise defects to source system, because it will fail our Data Stage load jobs.
File Trailer record specification:
Trailers records will tell you how many records in the file. And the count could be including Header and Trailer. If
the file is related to transactional data then we might have a Hash total count column.
The tester should use the Trailer record count to verify the reconciliation between source file to staging table data.
So now we have validated our source 🙂 @ my next blog we are going to see how to deal with the Requirement documents for execution.
Thanks – Asik
In general BI Projects will have 3 environments
Production – where our tested code will function in real time
Development – where developers develops and Unit tests the code
Test Environment (System Test / SIT / E2E / UAT) – where the developed code will be deployed for testing
Setting up a Test Environment for Business Intelligence project is critical because the data for different level of testing is different. Developers will develop the ETL code in DEV (development) environment and when the testing phase kicks of they will point their codes to Testing environments.
Why we need a separate environment?
1. ACCESS and Test Environment
Because Developers are the one who designs the code and they will keep on changing the code until it works. We don’t have the version control in Data Stage or in Teradata. The development environment is the open space for all the Developers involved and there are high possibility of irregular updates on the code so the DEV environment is loosely controlled. Developers will have INSERT / UPDATE / DELETE access to all the designs in DEV and Test Environments. But testers will have only VIEW access on VIEWS (will explain what is Table and View in my next blog). Tester can only verify the data is as per ETL Code , they cant update any record to make them correct as per requirement. And developers also should agree what ever defects found in our environment should be open in DEV environments.
2. Data and Test Environments
Data that used for Developing the code should be different to the data is used for testing why because developers used to create Test Data for UNIT testing, they are happy when the functionality is working fine as expected. So they are not interested source data quality. If any data causing trouble to them , they simply delete it and load the rest of the data. But testers should be very conscious about the data because data is the key for us to proceed the testing. So testers should have their own data in their own Environment. They should not depend on the source data that relies on Development environment.
What all components needs access for a Tester ?
1. File Landing Directory – If your project is File to Table load then Source system will send the Files to a specified location. This location is different for both Development and Testing. Testers should get the access to this Directory.
2. Select ACCESS to Source Views – Tester can easily check whether they have Select access to the tables by simply querying the tables.If you are not having access to the tables then please create a Grants statement and send to your DBA.
[GRANT SEL ON DB_Name to User_ID;]
Good practice availing the access:
- A tester should run his queries in Views not in Tables, so Leads should make sure that the testers should not have Select access to tables before they start execution.
- Developers are creating Views over the tables, and when they deploy the code into PRODUCTION their code will points to the VIEWs not TABLEs.
- Developers might have introduced few filter statements in Views in order to stop duplicates, which might cause errors. Suppose if testers are tested the Target data using Tables then the errors related to Views could not be identified.
- Testers should not have Insert, Update, Delete access to Source and Target tables.
- Testers should have Create table access on Test Database for manufacturing data (will be explained in Test Data Management Blog)
Availability of the Source Data in Test Environment:
If the data load is into Existing warehouse tables then check the historical data is available in the Test environment by simply selecting the table. If there no records in test environment then request developer to copy sample records from PRODUCTION data into Test Environment.
[INSERT INTO TEST_DB.TABLE SEL * FROM PROD_DB.TABLE SAMPLE 1000;]
Q: Why we should have existing data into Test Environment if it is load into existing table?
A: Because our load into existing should not be delta records for the existing records. This can be verified using the record count before and after the load – the count existing count should not be disturbed.
If the data load is into new warehouse tables then the tables should not contain any data in the Test Environment.
Q: Why we should not have any data New target tables?
A: Because these tables do not exist in PRODUCTION and we are the one going to load the data into these tables.
So it should not contain any data.
Reference Tables in Test Environment
MAP and TYPE tables are used for Referential Integrity. A project can use the existing MAP or TYPE tables or they
can create their own based on the project requirement
- Testers should verify all the MAP and TYPE tables mentioned in the S2T (even if it is not used in S2T transformation) are created in the Test Environment
- If the MAP or TYPE table exists then the testers should verify the data in Test Environment and PROD environment should be same.
- If the MAP or TYPE tables are created project specific then testers should verify the data is inserted correctly as per the Insert statements provide in S2T
- Tester should cross the values inserted into MAP and TYPE tables are matching with DDS (refer Appendix section for Reference data). If you find any data missing or added in the table please raise defect to Data designer.
Now we have everything in our Environment 🙂 Now we are going for Validation of Specification on next Blog.
See you @ my next Blog.
Regards – Asik
Business Intelligence and Testing!!!!
Intelligence is a very general mental capability that, involves the ability to reason, plan, solve problems, comprehend complex ideas, learn quickly and learn from experience.
Further simply put – It is “catching on,” “making sense” of things, or “figuring out” what to do
Business Intelligence is collecting the data from various sources of any organization to aid in planning, solving business problems and learning from experience there by taking proper decisions to expand its business and maximize profits.
Lets understand what a Warehouse is and how we can connect it to Information and Technology!!
Consider a person wants to set up a Provisional Store and he wants to know how to stock the Goods to maintain Supply and Demand life cycle. He’ll start a shop in a small room and gets goods in a limited quantity. 100 kg rice, 50 kg wheat, 100 kg dal, 3-4 varieties chocolates, 3-4 varieties of soaps and starts his business.
People start coming to his store and start buying things. In 15 days his stock is almost sold and then he analyses how the sales were. To achieve this he will maintain all the data on a paper and calculate (Add, Subtract, Multiply, Divide)
He’ll continue this for sometime and forecast the data he collected.
- If analysis says that the goods are sold as per his calculations and he’s making profit then he’ll start getting goods in large quantities. Since the store is very small he cannot accommodate goods in bulk in the store so he will setup a warehouse where he can store all the goods. And small amount of goods required for a month, he’ll move them to the provisional store for selling.
- If analysis says that the few goods in his shop are not sold and he is not making profit as expected but he is not making loss, then he can change the goods he is selling and try his analysis to ear profits
- If analysis says that the goods in his shop are not sold at all and he is in huge loss then he might have to stop his business.
As a shopkeeper he needs to do analysis on the Goods available and the sales in the warehouse to continue or stop buying things (forecasting), he creates different racks to keep different goods to access them easily and very quickly. He should know where he stored the goods and how they are related say for an example. He would store all the Groceries in one rack, Home Maintenance goods in one rack, canned foods in one rack etc., if he stores the goods logically and physically he will be able to get to know about the Stock easily.
Basic analysis to be done by the shopkeeper before he decides to setup a warehouse.
How is the Demand and supply in the warehouse? To answer this he needs to answer the following questions.
- Find out the existing stock of goods
- Does he need new goods – Inserting new records in Data warehouse
- Does he need more number of existing goods -Updating of records in Data warehouse
- Does he need to remove goods which or not in demand – Deleting of records in Data warehouse
If a small provisional store needs to decide the progress based on the available data we can imagine the power of data in any big organization. 10 years ago the cost of a 500 MB USB Drive was sold for INR 2500.00, but now we can get 500 GB Flash Drive for INR 2500.00
As on today we have 7,195,883,823 population in this world. Most of them are connected with one or more Public or Private organization , it could be Banks, Schools, Colleges, Employees, Employers etc.,. Like the above shop keeper the organization needs to maintain the Data relates to 7,195,883,823 people. So Experts in Information and Technology industries are developing robust techniques to maintain and validate the data.
Just think 10 years ago the cost of a 500 MB USB Drive was sold for INR 2500.00, but now we are getting 500 GB Flash Drive for INR 2500.00 – what does it mean ? the data is increased !! so we need more space for lesser price !!!
All organizations are very much interested on Customer data to enhance and forecast their business. So they are seeking Information and Technology specialists for designing. Technology specialist are proposing solutions at various levels to capture and maintain the data for Analysis and Forecasting.
In today’s world, all of us are connected through internet be it social networking, ecommerce, online banking etc. So the experts in Information and Technology industries are developing robust techniques to maintain and validate such huge data and assist the businesses to completely change the way they have been working.
Just like analysis of Warehouse data and the trends of customer’s buying trends help a shopkeeper to plan his strategy to buy goods to maximize the profits, any organization needs to analyze the data of its own organization (Internal data about various departments and people in the organization) and the Customer data to forecast its business and strategize to maximize profits. So they are seeking Information and Technology specialists’ expertise for designing a data warehouse. Technology specialists are proposing solutions at various levels to capture and maintain the data for Analysis and Forecasting.
This gave birth to a closed group called Business Intelligence (Group Information System). There are many roles involved to capture the customer data. Let’s discuss responsibilities of few of them.
- Solution Architect who designs the whole System right from
- How the data is retrieved from source?
- How the retrieved data is loaded into Warehouse?
- How the data is used for analytic purposes?
- Business Analysts who
- Analyze the business needs
- Create the rules for capturing the right data from the source systems.
3. Data Analysts who design how the source system and target (warehouse) interact with each other.
- What data can be loaded?
- How should it be sent into the data warehouse?
4. Developers who convert business needs into coding that will
a. Extract the data from the source system
b. Load the data into warehouse based on the Rules provided by Data modelers and business analysts
- Testers who validate
a. whether source is sending the data what they agreed.
b. Validating the data received based on the business needs
c. Validating the transformation rules defined by the business analysts in Source to target document
d. Validating the code developed by developers are as per the business / transformation rules provided.
Folks – please comment how useful the write up is!!! I will be posting about Test Planning (Estimation,Test Plan, Test Cases and Test Conditions) in my next blog.
Many Thanks – Asik
This post is updated with the valuable inputs of Harika Dommeti (Thanks :-))