Being a gap given today my blog about Process Control and its importance in ETL Testing.
Everything in the world should be controlled by something, like how a mom controls her kids 🙂 Why we need to control the ETL process? and why it does matter for the Testers?
If we do not control the ETL process then we might not do the Correct Inserts and Updates to the target system. The code may be worn out and it behaves crazy. And if something goes wrong then you cant find the root cause of the issue. If you would have lost few records in ETL process without controlling the process then those records will be unknown for ever. Data fix is easy if something goes wrong. We can easily control the source behavior by controlling the ETL Process.
The extract , load and transfer mechanism is controlled by few attributes.
1. Load Date
2. Load Time
4. Process Identifiers
5. Record Active Date
6. Record Expiry Date
Testers should verify these attributes because
1. Are the data loaded for today’s run?
2.Are the data updated correctly?
3.Are we loading the correct data from source system?
4.Are we maintaining the historical data?
5.Are the Insert and Update process is happening correctly?
6. Are we doing the correct Catch up for the delta records
(Catch up means, I have inserted a record today and its delta record is coming one week later, this record should expire the existing record and should insert the new record.)
7. Are the reconciliation is achieved in both Insert and Update scenarios?
In ETL process, the first step is loading into Intermediate Tables – just dumping all the source data into a table for further processing. You cant keep the data in the Intermediate tables for ever, because of the Database size. So you need to delete the data. So you can Truncate or Purge the data.
Truncate – Every time delete and load the data into intermediate tables.
Purge – Keep few days of data into Intermediate tables and delete the older data from Intermediate tables. This will be useful for capturing the historical data to do the catch up.
Here testers need to verify Truncating of records are functioning or not using the Date Columns in the Intermediate tables and Purging of records are happening for given data intervals using the Date and Time columns.
So now we are extracted the Source data and its controlled for Transforming – and we do not have any Delta Logic’s in the Intermediate tables.
Coming to the Warehouse tables, we have Inserts, Updates and Logical Deletes , let me explain you how these process are controlled and how testers needs to verify the same.
Again let me take my Face Book account for better understanding.
Inserts and Process Control
Today I have created a profile using the Face Book sign up page !! and when Face Book engineer load my data into their ware house they attach today’s date as my Effective Date that is 21/12/2013 and Expiry Date as HIGH DATE (coz they don’t know when I will deactivate the account) and they will give me a Process Active Identifier (to denote it is a Insert or update record) for inserting the record and as the record is active so they will not give Process Expiry Identifier .
> Testers needs to verify Date columns and the Identifiers are for the current day load or not.
Updates and Process Control
The very next data I have updated my Display Name, now my record is going for Delta,
what will happen to existing record’s Effective Date,Effective Date,Process Active Identifier ,Process Expiry Identifier ?
what will happen to new record’s Effective Date,Effective Date,Process Active Identifier ,Process Expiry Identifier ?
Above questions are based on our requirements. So Testers should verify the transformation for these Process Control values are achieved as per the Source to Target specifications.
Logical Deletes and Process Control
On the third day I’m deactivating the Face Book account. Now my record in the warehouse should be deactivate in the case
what will happen to columns Effective Date,Effective Date,Process Active Identifier ,Process Expiry Identifier for my delete record?
Again testers should verify the correct transformation is achieved for the Deleted records as per the Transformation rules.
By ensuring this Process Control attributes we can ensure that
“Are we inserting the correct records into warehouse”
“Are we updating the correct records”
“Are we logically delete the correct records”
“Catch of the records are achieved”
“Maintaining the historical data and so on :-)”
Have a please week ahead !!!
In this post let me explain a Project Structure.
I have a Multi-Specialty Hospital located across the world. My hospital is famous for Vaccination. Patients who comes to my hospital across globe will be given a User Card with which they can access any of my hospital in the world.
Now we maintain all customers in one Database. There are heaps and heaps of customers related to my hospital.So I decided to split up the customers based on the country and load them into corresponding country tables. Elaborated details of this requirements will be mentioned in the Business Requirement Specification.
To pull the customers as per Country, a designer should know what are all the places the Customer Data is available. So the data extracting will be done by our Source System. They will pull the all the relevant customer data and will give us a Data file.
In Design Documents you will have
Source File Specification
1) File Name Specification – Name String , Extension of the files
2) Date and Time format of the File – YYYYMMDD, HHMMSSTT or any other format
3) Control File Specification – Name String , Extension of the files
5) Header Records Layout – |H|Customer_Records|20131012|1245872|
6) Details Record Layout – |D|Asik|123456|20101012|20121013|MVD|Sanjay|NSW|AU|06031987|A
7) Trailer Record Layout – |T|10|
Detail Records will tell you what data you are getting from source, what data type, is it mandatory or not and the length of the column.
|File Position||Column_Name||Filed Length||Data Type||Mandatory (Y/N)||Key Column|
|3||Customer Open Date||8||DATE||Y||N|
|4||Last Consulted Date||8||DATE||N||N|
|10||Date of Birth||8||DATE||N||N|
Please refer my Post – Source File Validation to know how validate the above details.
The sample file format will be
Now using the ETL process now we loaded the data into Staging Tables. Intermediate tables will look like below
Staging Table Load
Now my Staging Data is ready and we need to load them into corresponding Target Tables.
As a project we will test data load from file to Staging as Phase-I and Staging to Target tables to Phase II.
In this project we are not having any transformation rules , so we will insert the records from Staging tables to corresponding target tables.
All customers related to India will go to Table_India and so on. In my next blog let me come up with few more complex examples for Target tables load.
The Source to Target Document will looks like
|Customers From India will be loaded by below logic|
|S.No||Source Column||Source Table||Transformation Rule||Target Table||KEY / Value||Target Column|
|1||Customer Name||CUST_STG||If Source value is Null Do not Load Else load||CUST_IND||Key||CUST_NAME|
|2||Customer ID||CUST_STG||If Source value is Null Do not Load Else load||CUST_IND||Key||CUST_ID|
|3||Customer Open Date||CUST_STG||Assign Source Value||CUST_IND||Value||CUST_OP_DT|
|4||Last Consulted Date||CUST_STG||Assign Source Value||CUST_IND||Value||CUST_CONS_DT|
|5||Vacination Type||CUST_STG||Assign Source Value||CUST_IND||CUST_VAC_TYPE|
|6||Doctor Consulted||CUST_STG||Assign Source Value||CUST_IND||CUST_DR_CONS|
|7||State||CUST_STG||If Source value is Null Do not Load Else load||CUST_IND||Value||CUST_STATE|
|8||Country||CUST_STG||If Source value is Null Do not Load Else load||CUST_IND||Value||CUST_COUNTRY|
|9||Post Code||CUST_STG||If Source value is Null Do not Load Else load||CUST_IND||CUST_POS_C|
|10||Date of Birth||CUST_STG||Assign Source Value||CUST_IND||CUST_DOB|
|11||Active Customer||CUST_STG||Assign Source Value||CUST_IND||CUST_STATUS|
So based on above rule will load the data into IND_CUST tables.
Have you created , Updated , Deleted Face book account to know about DWH concepts ? Today in this post let me explain you what is the necessity and importance of the Data types in ETL Testing.
We will start with our known examples :
Can you fill 10 liters of water into a 5 liters container?
“No, the container can have only 5 liters of water, if you fill more than its capacity then it will burst :-(”
Can you use Salt instead of sugar to make Tea?
“No, then every one will stop drinking Tea :-(”
Can we name a Kid using Numbers?
“No, if we keep numbers , then how many duplicate persons exists in this world ? just Imaging if I was named as 10215 !!!!
Can anyone have their Bank balance as absolute number ?
“No, because every money that you spent is fractional amount ! you cant have $ 5 all the time , it would be $ 5.28 most of the time.
Can you have your mobile number more than 10 digit ?or Can you have your mobile number as alphabets?
“No, because the mobile number length is pre-defined and the number cant be alphabets”
Like the above example
Our Source files, Source tables , Target tables are constructed with limitations. You cant have or keep the data that you want. You can keep or have the data that what system can accept.
In every programming we have this data types , most of them who reads this post knew about basics of Data types.
INTEGER, CHAR, VARCHAR, DECIMAL, FLOAT etc.,
Most of the time developers are testers encounters problems because of the data typing in Data warehouse world are ,
1. Correct Data type is not chosen in Source tables
2. Correct length of the data is not received from the source system in the source file
3. Source is not giving the values as per the data types mentioned
4. Data got truncated when loading it into Target tables.
5.The amount column precision is not populated correctly as Teradata changes it to make round off value.
6.Regarding Dates, source will send them as var-char but when we load it into target tables we keep as DATE and the format
The Data type and its length will be designed it its DDL – Data Definition Language . If you want to know about the tables properties then please use the blow query
a) ” SHOW TABLE Database.Table_Name ” – this will give you all about data types, data length. Not Null, Null, Primary Key definitions
b) ” HELP TABLE Database.Table_Name” – this will give you all about the table.
As a Tester what we need to verify ?
Again as I said,
Check the data is matching with the data type mentioned in the spec.
Check any data truncation happened when source data is loaded into Staging tables
Check the data is the staging tables are as per the Staging tables DDL
Check the target table columns are loaded as per the Target tables DDL.
If it a Varchar columns from source ,then please take care of the space , invalid characters etc., right from source till staging tables, because data stage will not accept special characters
If its a Decimal column then make sure the precision is carried out till the target load
If its a Integer column then make sure you should not get blanks ans spaces from source
If its is a CHAR then check the length of the character that fit into this column
Like above we can add as many as much scenarios for Data Type verification.
Hops this blog helps you to understand what is the importance of the Data Types in ETL testing,
See you in my next post
Cheers – Asik