About Source to Target (S2T) Document or Mapping Document
S2T document is the bible of any ETL Projects. This will be designed by the Data Modeler’s using the FSD’s (Functional Specification Documents).
Data Modeler’s are interested in
- the Source Data’s nature
- Source Data is expected as it is mentioned in FSD
- Are the source is sending the correct data type as they promised?
- All the Date and Time fields are expected
- Columns that are participating in Referential Integrity are falling under the data set given in FSD
- Source Table and Target Table nature – like Set, Multiset etc.,
- Is the Source is Time Variant or Non Time Variant
- What all are the Key columns that makes the Record unique from the source?
- What all are the value columns that triggers the delta from the Source?
- Logical connections of the Source Tables participating in ETL Loads
- Logical connections of the Target Tables participating in ETL Loads
- Physical connections of the Source Tables participating in ETL Loads
- Physical connections of the Target Tables participating in ETL Loads
Most of the S2T’s are written in Excel spread sheet. Where you will find many columns, each column is important for the ETL Load to be designed.
Major components of an S2T,
- Version Number – To maintain and track the ongoing changes happening in the S2T are tracked
using the Version Numbers.
SourceDatabase– Source Database name or names will be mentioned in this space
Source Column– All the Source columns from respective Source Database are mentioned in this space (these columns will undergo the transformations if required.
Extraction Criteria– As I already mentioned, we are not going to pull all the data from the source before transformation however we need to pull the data that we required for the transformation and this will be mentioned in this space. All the Joins and Unions will be done here so as a Tester we need to understand and analysis this area with more care.
Example – Now I wanted to load Customers details and their balance from Savings account. Customers Details will be fetched from Cust_Detl table and the Balance from ACCT_BALANCE table, so you need to perform join in between. This extracting filters only Customers Savings account. So I consider this is my Data extraction criteria.
Filter Criteria– After we extracted the data from the source we need to filter the data if required for few or more target tables and this will be covered in this space.
Example – I have extracted 100 records from the source which is linked to Savings account and to ACCT_OVER_LIMIT , we just want to load , customer’s who’s acct balance is more than $100, so Data Modelers, will use the filter rule just below to the Record extraction criteria.
Target Database – Target Database name or names will be mentioned in this space
Target Column – All the Target columns from respective Target Database are mentioned in this space (these columns will undergo the transformations if required.
Key or Value Columns–Next to the Target Column names you could see Key / Value. Key column means the column that makes the record unique and Value means, what makes the record time variant.
Comments and Version Changes – This space will explain us what was in the S2T before and what changed now. Comments will tell us more about the transformation.
What we need to look in the S2T?
As soon as you get the S2T, please query your Staging Source tables and check the data that you have got will satisfy your transformation rule. S2T’s will be written with the SIT phase data, and the rules mentioned might change as soon as we get the UAT Phase data. So to achieve the good quality of testing we always interested in UAT data.
Example: Suppose if Data modeler mentioned a column as Key column (which should not change consecutive loads). But you have noticed that column values are changing from source on consecutive loads then you should notify the data modeler and the source system , either one should be correct and have to make the changes from their end.
Most of us will confuse the below transformation logic
- Difference between IN and NOT IN operator
Example: If the source columns are NOT NULL BLANK SPACE 0 and DO NOT LOAD the record. In this case, we might look for the record in the target when it has Unknown values.
- Joins – Even if the Data modelers mentioned the Joins in the Extraction rules we need to ensure the joins are working with the source data provide. This should be done because the data modelers will not experiment the transformation rules on the source data given.
- Reference tables – In every S2T you will have separate spread sheets that will have reference table names and the data needs to be inserted and the Insert scripts prepared by the Data Modelers.
Thanks for reading – Comments are Expected 🙂
Cheers – Asik
Hope you are enjoying your weekend. Have a wonderful weekend ahead !!!
Today I am going to explain how we need to take care of Nulls Blanks Spaces 0’s from Source System. These Nulls Blanks Spaces 0’s are extracted from the upstream applications and given to us to load it into Ware house. But our Data Modelers may or may not want them to put into ware house as it will not give meaning to the business.
Origination of Nulls Blanks Spaces 0s from Source System
I hope everyone used Online applications for your job search. You would have seen a ( * ) symbol in most of the field , so you need to enter some value.
1.In some field , you cant enter Numeric values
2.In some field , you can only enter Numeric Values
3.In some field , dates as per the given format
4.In some fields , you need to give your contact number as per the country standards
5.In some fields , you need to give your correct Email Id.
Like above you will be having so many check points for an online application. Say for an example, 2000 people applied for a Test Engineer job in my Company. And my Online Application has few Free Text field, applicant can enter what ever wanted to. In that few of them gave
1. Spaces in the Applicant Name (by mistake)
2. 0’s in the Experience column (because he is fresher)
3. Combinations of Spaces and 0s in the Phone Number field
4.They didn’t mention date fields.
My source system refers this Online webpage and pulls data and stores into their database. And will extract the data as files.
We will get the file as
D|Asik|8|06031987|9880112345|IND|PERSON@GMAIL.COM – Valid Record
D||8|06031987|9880112345|IND|PERSON@GMAIL.COM – In Valid Record
We load these records into our source tables we will not consider these NULLs Blanks 0s and Spaces as the Staging table data types are VarChar (it can have all types of data).
Our Source to Target specification will be taking care of these Unknowns.
a) If the source field_1 is NULL then populate Null in the corresponding target column_1
b) If the source field_2 is NULL then do not load the record
c) If the source field_3 is 0 then populate Null in the corresponding target column_3
d) If the source field_4 is 0 then do not load the record
e) If the source field_5 is BLANK then populate Null in the corresponding target column_5
f) If the source field_6 is BLANK then do not load the record
g) If the source field_7 is Space then populate Null in the corresponding target column_7
h) If the source field_8 is Space then do not load the record
This verification can be done for each columns separately or collectively to one column like,
If Column_1 from source is NULL, BLANK, SPACE, 0 then populate NULL to the corresponding target column_1
If Column_1 from source is NULL, BLANK, SPACE, 0 then Do Not load the record into ware house.
Whatever the circumstances our ETL code should stop Unknown records loading into warehouse.
Most of the S2T extraction criteria designed to stop the records into Warehouse if (mentioned) column values are any of NULL BLANK SPACE. Source system does not have any validations to stop Unknowns When they extract from their upstream. These unknown checks are closely dependent on Data Types of the mentioned columns.When tester designs the SQL, to test these scenarios, he should consider the Data Type of the columns.
1. Integer: Teradata accepts only NULLs and 0s for INTEGER columns. If we insert BLANK or SPACE it will take it as 0s only
2. Decimal: Teradata accepts only NULLs and 0.00s for Decimal columns. If we insert BLANK or SPACE it will take it as 0.0s only
3. CHAR: All NULL BLANK SPACE 0 can be possible for CHAR columns, so we need to test all the Unknown possibilities for CHAR columns
4. VARCHAR: All NULL BLANK SPACE 0 can be possible for VARCHAR columns, so we need to test all the
Unknown possibilities for VARCHAR columns
When we test VARCHAR column we should be very conscious because:
Source Record from file,
Scenario_01 – (0s with Space in VARCHAR columns)
|D| Asik| 51302138| 0|
These values will be loaded with spaces into staging tables. The tester designed SQL (where Column<>0) then this
record will not be stopped! So it should be (where TRIM(Column)<>0).
If you find any issues like this please raise the concern to Developer and ask them to add TRIM for
Scenario_02 (multiple 0000s in VRACHAR or CHAR columns)
|D| Asik| 51302138| 00| –> CHAR with 00
|D| Asik| 51302138|000000| –> VARCHAR with 000000
|D| Asik| 51302138|0| –>CHAR or VARCHAR with 0
Source can send 0, 00, 00000 to VARCHAR or CHAR columns, but the meaning is same its 0. If the tester designed SQL (where Column<>0) then only the 3rd record will be stopped whereas other two records will be loaded to in Target tables.
If you find any issues like this then please raise concerns to developer to add TRIM Leading 0 and check <> 0.
Hope you all will Stop Unknown records into my warehouse.
My Next Post is will be a Sample Project !!!!
Chapter 8 – Business Intelligence – Do you know how to Make Cofee? Then you know (E)xtract T(ransform) & L(oad)
[I am using Civil Engineers and Coffee making scenarios for Better Understanding for Beginners]
Most widely spoken concept in Business Intelligence world is ETL – Extract Transform and Load. The persons who are in IT field already they call it as ETL Architecture 🙂 .
ETL Architecture :
Who use the word Architecture most commonly ? – Civil Engineers. They say Building Architecture so when you are constructing something you are using the word Architecture. You cant construct anything without prior planning. So a Civil engineer will draw a diagram that will tell how the building going to look like.
Likewise in Business Intelligence world, we have Solution Designers (Civil Engineers) who designs the ETL System and they are also called System Architectures!!!
So now you understood why we are calling it as ETL System Architect 🙂
Now let me explain how Extract Transform and Load process works.
How to make Coffee?
Asik in the below image is preparing coffee for his wife 😦 .
In the first step, he is preparing decoction (extracting the essence of the coffee from Beans).
In the second step, he is transforming it to Coffee by adding correct proportion Milk,Sugar
In the third Step, he loaded the Coffee into Container for Drinking
After drinking the coffee (testing) he found (defects) he needs to add more sugar 😦 he added and ready for (reporting) studying.
Now how ETL works ?
Developers involved in designing the Extract Transform Load Jobs. What is a Job? to make it understand ! we can say it is a Function.
These jobs will be written using Unix Scripting.
Extract Job (decoction ready) will bring all the data into Staging tables – testers needs to verify all the data is moved from file to Staging table
Transform Job (added milk and sugar) will apply rules and filter the records , alters the records as per documents – testers needs to verify all the data transformed as expected
Load Jobs (coffee ready ) once the transform jobs ran successful then Load jobs will load the data into Intermediate tables (will be used for delta check in Day_02) – testers needs to verify the the transformed data is loaded into target tables
1.In Load job we have Insert as well as Delta
Hope you guys understood the ETL Logic!!!
In my next blog, I will explain about INSERTS , UPDATES , Logically DELETES
Thanks for viewing my blog so far ! I wanted to keep sharing my knowledge as much as I can – Happy Reading 🙂
In this blog I am going to explain about Test Data. In other testing practices you don’t need predefined test data, as we are the user who updated the values either in the Front End applications or some other applications. But in Data warehouse we are the ones who are completely depend on our Upstream (front end application) applications. We will test what all are the data that used in the Upstream guys.
We testers are the intermediate guys who buys the data from one team and validate , transform and sell it to other guys 🙂
Let me explain you in normal terms, you are doing a business , you need to buy the goods from your buyer and sell to u r customer!
What all you need to do?
– Need to buy quality products
– Need to check the count what you ordered
– Need to check the specification what you asked for?
– Need their support in fixing the issues
– if they cant fix it, then you need to repair it and sell it to your customer
Now come to our Business Intelligence scenarios,
As a tester what you need to do with the data ?
– Check the quality of the data (is as per the specifications)
– Check the count is correct using Trailer records (refer my Source Data validation Blog)
– Check if the data is not as per requirement then ask upstream guys (the people who gave the data to you) to fix the data issues
– Check you have got all the data that you required, if not ask your upstream to provide more data, suppose if the deny, please manufacture your data 😦
As I said already in my earlier blogs we have below types of DWH testing
1. Loads from file to staging tables to IM tables
2. Loads from table to staging tables to IM tables
3. DIMN/FACT projects
4. Extract projects
Test Data fabricating for Staging table to Target tables are quite easy, but for Extract Testing and Reports testing are bit difficult 😦
For most of the positive scenarios Upstream will provide the data for us !! but we tester wants to break the code so we will have few negative scenarios for which we will not get data from upstream. And for the scenarios which will happen once in a while we will not get the data so we need to create the data.
For Extracts, the Teradata SQL codes will be written by joining more than 10 tables , each tables will have its own WHERE clauses. After extracting the records using the SQL , you found that few scenarios are missing then you need to set up a data that needs to flow across all the tables.( Will be explained in my Know about Extracts Blog)
As I said in my Test Environment blog about the access, the testers should have Create table access in Test Environment and Select access to View Data Base.
So using Create Table access we can create our own test data as follows:
Please read the image very carefully, if you don’t understand please send me an EmailTo : firstname.lastname@example.org
1 Create a TEMP table in you test Environment using Create Table Qry.
2. Insert sample records into your TEMP tables using INSERT INTO Qry.
3. Modify the data as per your requirement
4.Insert back the modified data into your actual table.
5.Your test data is ready to use 🙂
Hope you guys understood the test data manufacturing technique . If you guys need more support please comment in this blog 🙂