Change Data Capture and Slowly Changing Dimension

Change Data Capture and Slowly Changing Dimension for Nearly Real Time Data Warehouse by Using Triggers, Delta View and Staging Tables

Abstract— In the big data era, data become more important for Business Intelligence and Enterprise Data Analytics system operation. The load cycle of traditional data warehouse is fixed and longer, which can’t timely response the rapid and real time data change. More computational resources would be required to reduce the load cycle with full data loads and it will be very expensive. To bypass problems related to the use of computational resources to process large volumes of data, an ETL (Extract, Transform and Load) technique with zero latency can be used, that works by constantly processing small data loads. Process of identification of changed records to reduce the extract volume is knows as CDC (Change Data Capture). Among the extraction techniques of the zero latency ETL are the use of logs, triggers, materialized views and timestamps. This paper proposes a structure capable of performing this task by means of triggers and a tool developed for the automatic generation of the SQL (Structured Query Language) code to create these trigger, besides showing its performance and comparing it to other CDC techniques. This will be helpful for producing NRTDWH. This paper also explains the approaches of loading the data into target tables using incremental load methods (SCD). Dimension delta view generation and staging table ETL framework are the two methods explained in this paper to process the historical data and real-time data separately. Both of them are SCD type 2 incremental loading. Delta view and stage table framework for a dimension encompasses its entire source table and produces a set of keys that should be incrementally processed. We have employed this approach in real world project and have noticed an effectiveness of real-time data ETL and reduction in the loading time of big dimension.
Keywords— CDC (Change Data Capture); ETL (Extract, Transform & Loading); Delta View; Incremental Load; Nearly Real Time Data Warehouse (NRTDWH); Slowly Changing Dimension(SCD).
I. INTRODUCTION
One of the leading challenges for large corporations is to obtain and analyze large volumes of data coming from various sources. To create support structures for this exigency, different architectures are presented, having the data warehouse model as their main approach. Extraction is the operation of extracting data from a source system for further use in a data warehouse environment. This is the first step of the ETL process. After the extraction, this data can be transformed and loaded into the data warehouse.
The data acquisition process represents one of the most technically demanding and one of the most expensive parts of building a data warehouse. Usually, the traditional data warehouse only supports the historical information query and analysis, which cannot obtain the upto-date real-time data. After the first or initial loading, data warehouse must be periodically refreshed to keep up with source data updates. Data warehouse loading is usually done periodically, by a background process. The update patterns (daily, weekly, etc.) for traditional data warehouses and data integration process result in outdated data to a greater or lesser extent [4]. Real time data warehouse will able to show the ETL working result in an exact time according to the transactional time on a number system [2].
Traditionally, DW updates are processed with an ETL (Extract, Transform, and Load) tool. ETL is a software program that extracts the data from the source system, transforms and cleanses the data, and then loads it into the DW [15]. To produce NRTDWH, ETL therefore can be implemented by applying Change Data Capture [2]. CDC is used to know the changing on the data sources and then capture it to be given to the database destinations which need it. Reducing the cost and resources associated with updating an enterprise data warehouse is the problem for business or IT. CDC in working with ETL is as shown in Fig.1. Our tools provide a new approach to moving information into a data warehouse and reduce cost and resources associated with updating an enterprise data warehouse.
Incremental loading is one of the approach of gradual change of data warehouse in accordance with the changes that have occurred in the data sources since the last synchronization. Only the data that has changed since the
Fig. 1: Working of CDC with ETL
previous reload needs to be transformed and processed. Incremental reload work quicker and efficiently than full reload. Fact tables are also the ones that benefit most from the incremental loading style since they typically contain a huge number of records. On the other hand, dimension tables are usually relatively small and can be populated in a full reload manner. However, there are cases when dimension tables are also fairly large, both in terms of number of records and number of attributes. For such large dimensions, it is beneficial to construct the “delta view” –a view that will contain keys of records that need to be addressed in the ETL procedure. Delta view contains the keys of deleted, updated and inserted records of the relational source table, and that information is then used in accordance with the dimension type to update the dimension.
II. LITERATURE SURVEY
A lot of work has been done on the data warehouse architecture and gives different techniques for improving and better performance of architecture. Traditional data warehouse has static structure in schema and are not able to have dynamic structure. Due to periodic up gradation nature it can’t allow for integration of data. To resolve this problem of integration solution is provided in integration of data warehouse during execution has been enabled with no effect on performance and minimize the delay on decision support database between the transaction of information and data updating. The researcher focuses on the loading processes and usage of data area for efficient integration. Data warehouse implements views that are maintained and updated are the materialized views.
H. Gupta and I. S. Mumick [4] introduce data warehouse incremental reloading is evidently related to incremental maintenance of materialized views as both areas cope with same issue – how to update physically integrated data under a given time constraint. The basic core idea of materialized view in incremental update assumes computing new updated content using materialized parent view (the outdated view needed to be refreshed) and its incremental update, often referred to as delta view.
Mitchell J Eccles, David J Evans, Anthony J Beaumont [2] presents Modeling of the data changing capture process by using a set of web service. Captured modeling use the web service to facilitate real time data warehouse is introduce an architecture of multi level real time data cache. Meanwhile [14] modeling of the ETL for real time data warehouse with using schedule algorithm to balance the query and updates thread control trigger based on ETL machine.
Xiaofang Li, Yingchi Mao [3] present the real-time data ETL framework to process the historical data and real time data. Then, combining an external dynamic storage area, a dynamic mirror replication technology was proposed to avoid the contention between OLAP queries and OLTP updates. Finally, the experiments is set up based on the TPCH benchmark to evaluate the performance of the proposed real-time data ETL framework .
Mayuri B. Bokade, Prof. S. S. Dhande, Prof. H. R. Vyavahare [15] proposes a framework of change data capture based on transaction log in real time data warehouse. Change data capture in working with ETL, successfully provides new approach to move information into data warehouse and also makes information integration more efficient in real time i.e. delivers information at the right-time. They introduce ETL and CDC using only Oracle.
J. Shi, Y. Bao, F. Leng [11] Various works in the area of data extraction for ETL tools have been published, especially about the extraction of only altered data, also defined as “Change Data Capture” or CDC. With this method it is possible to obtain a significant performance gain due to a reduction of the mass of data to be worked.
C. R. Valencio, M. H. Marioto, G. F. Donega Zafalon, J. M. Machado and J. C. Momente [7] introduce a tool that enables the extraction of any subset of information held in the database, enabling the use of online and offline extraction techniques. The total overload time caused by this method will then be proportional to the portion of online captured data. Also, the use of triggers causes structural changes concerning only the database being monitored, and avoids overloads on other databases managed by the same DBMS.
III. EXISTING SYSTEM
A. Current Change Data Capture Approaches
There are different widely accepted approaches to accomplish CDC. This section discusses the strengths and weaknesses of each approach. Following is a discussion about how each approach is incapable of delivering changes in real-time.
1) Timestamp based Capture
Timestamp based data capture is performed by adding a timestamp column (specifying the time and date when observed row was last modified) to each table of interest. These timestamps provide the selection criteria for the capture of changed records. Fig. 2. Shows the timestamps approach to CDC. The need to change the structure of each table of interest, by adding a single timestamp column, is the first disadvantage of this approach. Second, it is not possible to capture intermediate states of data. If a record changed state more than once since the last ETL run, only the current (precisely the last) state of the respective record would be at disposal. Accordingly, when using this method, the inserted records cannot be distinguished from updated ones.
An additional problem with this technique is a lack of possibility to detect deleted records [2]. Possible solution to detecting deletions is to label these records as inactive until they have been processed. This concept leads to other problems in operational systems. For example, all applications working with operational data have to treat inactive records as deleted ones. This method causes frequent writes of timestamp attributes on the operational systems and frequent reads whenever ETL process runs. Indexing timestamps can speed up reading but on the other hand can slow down INSERT and UPDATE operations.
2) Log Scanning
This technique employs the logging and recovery capabilities of a DBMS. Since transaction logs are utilized by DBMS to store transaction information for logging and recovery, these logs contain all information required to capture changed data [1]. A specialized vendor specific application must be written to monitor the log files and capture the data of interest. Fig. 3. show how log-based CDC works. As new data arrives in the database, the log is updated and the CDC tool will poll this log to work out the changes.
The major pitfall of this approach is not every DBMS or data store is capable of producing a log [11]. Additionally, the DBMSs that create logs have their own log format, meaning a different algorithm is needed to parse logs produced by different DBMSs. Although, this issue can be circumvented by purchasing an off the shelf product capable of parsing different logs. A potential cause for error in log-based CDC arises when logs are either purged or archived off-line. The logs can become large files in a short space of time, so they are either moved to off-line storage, or are purged to keep the file size to a minimum. If purging or archiving takes place before the CDC scan, then inevitably changes will be missed.
Fig.2. Timestamps CDC: new data is timestamped. The CDC tool polls
the database, and extracts records that have been added or modified since the last scan.
Fig.3. Log-based CDC: As new data arrives in the database, the log
is updated and the CDC tool will poll the log to work out the changes.
3) The Cost of Polling
The CDC technologies described above all require some resource to be polled. When setting up a polling mechanism a
time interval is specified to govern how often the resource is polled. For CDC this means the poll interval is the determining factor on the timeliness of changes being delivered to targets. Moreover, polling increases the latency between detecting and delivering changes. This affect prohibits a true real-time CDC solution. Additionally, the poll interval will be set to a value that is respectful to the resource being polled, in order to avoid a denial of service. In log-based CDC, if the log is polled too frequently, there is risk of locking the log file so the DBMS can’t write to it, meaning the DBMSs resources are consumed to temporarily cache log data. In the timestamps CDC, the operational database is directly polled, meaning the DBMS will have to deal with the CDC workload as well as the normal transactional workload. High frequent polling will inevitably affect the transactional performance.
IV. IMPLEMENTATION DETAIL
A. System Overview
1) System Architecture of CDC
In a zero latency ETL tool, the CDC step is added to the extraction, since it must also be constantly active, storing altered data, so that other steps can process only that information instead of all the information stored in the base. As this capture process happens paralleling the use of the system, it is called online extraction, represented in Fig.4.
Trigger based approach mentioned in above architecture is expalined as below,
This approach involves the usage of triggers to track changes in the source database. Whenever a modification takes place in the source database corresponding log record is written into a dedicated log table. These dedicated log tables serve to determine rows changed since the last ETL processing. The log tables usually contain a timestamp column that provides the exact time and date when a given row was last modified. The statements that are executed when trigger is fired, maintain the timestamp column accurate i.e. update the
Fig.4. Model of an Online Extraction.
timestamp column with the current time. This technique assumes that the DBMS supports active behavior i.e. triggers. A database trigger is a piece of user written code that will automatically execute when a given set of events have proceeded [2].
The set and granularity of the events and conditions that cause a trigger to fire are database specific. This approach enables capturing all types of events (INSERT, UPDATE, DELETE, MERGE). For UPDATEed rows both before and after image can be available. When change data arrives, a trigger will fire, sending the changed data to a cache area. The CDC tool will then poll this cache to retrieve the changes. Using triggers means database application code does not need to be modified and all of the work for CDC is done by the DBMS.
One of the problems of the zero latency ETL tools is the processing time as each step depends on previous steps. Therefore, if an extraction step takes a long time, the whole process is affected. To perform this task, it is necessary to create a structure that is capable of automatically mapping the alterations made on the data stored in the source repositories, making it then possible to take the transformation and loading steps only for the mapped data. Furthermore, the structure must be able of efficiently mapping only those changes that are interesting to the data warehouse, as not all the alterations need to be integrated in the final repository.
All the structure that is needed to capture altered data is created. With this created structure, each command executed in one of the chosen tables will activate the following execution flow of extraction actions within the DBMS:
• The SQL command of the user ends its execution;
• If the executed operation (insertion, edition or deletion) is one of the operations that must be reproduced in the data warehouse, a trigger is activated. Otherwise, the extraction does not happen;
• After a trigger is activated, a function referent to the table that is undergoing the operation is executed. This function stores the altered tuple information and concludes the extraction.
Only the tuples that need to be migrated to the data warehouse are captured and identified, so it will be possible to recover altered data in a more direct and precise manner than with the offline integration strategy. Another advantage of this structure is that both the SQL command of the user and the extraction operation are kept within the same transaction, therefore guaranteeing that there will be no loss of data in this step. Fig.5. shows the flow of executions that happen during an online extraction step.
2) Format of the Script Generated by the Tool
The proposed structure works using the DBMS PostgreSQL. For the storage of altered data, a new instance in the target database containing a single table is created. In this table, called log table, are stored the identifiers of the altered tuples. Said table has six columns:
• log id – an integer type field, having a sequential number, that stores the execution order of the operations;
• instance – a text type field, stores the name of the instance whose tuple was altered;
• table – a text type field, stores the name of the table whose tuple was altered;
• operation – stores the type of operation that was executed: INSERT, UPDATE or DELETE;
• pk column – a text type field, stores the name of the column that was used as a unique identifier in the table whose tuple was altered;
• tuple id – value of the unique identifier of the altered tuple.
Besides this new instance and table, a trigger and an extraction function are created for each table to be monitored. The triggers are created together with the monitored table and are responsible for the correct summoning of the extraction functions which, in turn, are created in the new instance generated by the script and are responsible for registering data in the log table, showed in Table I.
TABLE I. LOG TABLE
LOG ID ID PK TAB INST OPER
1 1500 customer_id Customer Person U
2 1765 Product_id Product Store I
3 2399 customer_id Customer Person D
4 3612 sales_id Sale Store I
5 4792 sales_id Sale Store I
6 8350 sales_id Sale Store I
Fig.5. Execution flow along the extraction step.
B. System Architecture Of Slowly Changing Dimension (SCD)
1) Delta view
When we know what records have changed in the source system, we can use that information to incrementally update fact and dimension tables in the data warehouse. Fact tables are also the ones that benefit most from the incremental loading style since they typically contain a huge number of records. On the other hand, dimension tables are usually relatively small and can be updated in a full reload manner. However, there are cases when dimension tables are also fairly large, both in terms of number of records and number of attributes. Such dimensions are sometimes referred to as “monster dimensions”, “whose size exceeds 100 million rows”.
Typically, those are customer or client dimensions in large enterprises. Such dimensions may take a long time to load, since they are both large in record count and have a considerable number of attributes that need to be calculated along the way (e.g. student’s Grade Point Average, rank, etc.). We believe that, for such large dimensions, it is beneficial to construct the “delta view” – a view that will contain keys of records that need to be addressed in the ETL procedure. Unlike fact tables, delta view records are deduced not only from the base source table (for the dimensions), but also from many other tables that are being used in the dimension’s ETL process. In our real-world change in any relational tables can impact the content of the dimension table.
C. Algorithm
1) Algorithm of CDC
CREATE TRIGGER cdc_table ON db.table AS
DECLARE @OPER, @INST, @TAB, @PK, @ID
Select @TAB=table of altered tuple
Select @PK= primary key of @TAB
Select @INST= instance in which @TAB belongs
Begin
IF(@OPER = DELETE)
Select @ID= ID from tuple before deletion;
ELSE
Select @ID= ID from tuple after insertion or update;
ENDIF
Insert into log table @ID, @PK, @TAB, @INST, @OPER;
END;
D. Mathematical Model
1) Mathematical model of SCD
The delta view construction idea is to, starting from the base table (e.g. Student), recursively traverse the graph and gather primary keys of the related tables. The related tables are divided in two sets:
• REFD – a set of tables referenced by the current table
(e.g. REFD (Student) = {City},
REFD (City) = {State})
• REFING – a set of tables referencing the current table
(e.g. REFING (Student) = {YearEnroll, StudAcdmYear},
REFING (City) = {CityName, StudAcdmYear,Student})
Finally, the delta view delta(t) for the dimension table t is given with:
V. EXPECTED RESULT
A. Trigger based CDC for real time data results
Trigger based CDC approach (proposed CDC algorithm) should almost give a zero latency and use a real time data. This would show increase in processing time when we calculate with respect to number of records inserted, deleted and updated over a time. Basic nature of the performance with respect to time in ms for given set of tuples is as mentioned in Fig 7. This approach would give 5-13% increase in processing time but considering against a full unload of source and giving a zero delay for real time data is worthy to consider.
B. SCD (Delta View/SCD Type 2) results
ETL target database tables should maintain the history of the records including all the expired records along with active records. SCD type 2 method proposed should enhance the perfomance of the load of the ETL tables.
Over all CDC and SCD mentioned above should work on NRTDWH data and should increase the performance of the ETL framework.
Fig.7. Real system test for Trigger based CDC
VI. CONCLUSION
The proposed CDC work presented in this work fulfills its objective that is to automatically create all the necessary structure to online extraction all altered data in a determined time interval, aiming to feed a zero latency ETL tool. However, results obtained from a test using fictitious data have shown that the generated overload time was higher than expected as, in literature an acceptable rate is between 5% and 13%. Nevertheless, in this work tests using a real system have shown an acceptable average overload time of 13%.
Another important point is that the tool also enables the extraction of any subset of information held in the database, enabling the use of online and offline extraction techniques. Also, the use of triggers causes structural changes concerning only the database being monitored, and avoids overloads on other databases managed by the same DBMS. In this paper we have presented an algorithm for delta view construction that enables incremental loading of dimension tables. This method is suitable for large, complicated (both in terms of attribute and row count) dimensions, , because they can have significant impact on the ETL process. We have shown that delta view, yielding the keys of records that need to be incrementally (re)processed, can be constructed from the graph of data flow dependencies that can be fairly easy assembled.
Hence using proposed CDC method and delta view SCD method NRTDWH can be achieved.
Acknowledgment
The authors would like to thank the researchers as well as publishers for making their resources available and teachers for their guidance. We are thankful to the authorities of Savitribai Phule University of Pune and concern members of cPGCON2016 conference, organized by, for their constant
guidelines and support. We also thank the college authorities for providing the required infrastructure and support. Because of their support only we are able to prepare this paper.
References
[1] Igor Mekterovic, Ljiljana Brkic, “Delta View Generation for Incremental Loading of Large Dimensions in a Data Warehouse” in MIPRO 2015.
[2] Mitchell J Eccles, David J Evans, Anthony J Beaumont, “True Real-Time Change Data Capture With Web Service Database Encapsulation” in 2010 IEEE 6th World Congress on Services .
[3] Xiaofang Li, Yingchi Mao, “Real-Time Data ETL Framework for Big Real-Time Data Analysis” in 2015 IEEE International Conference on Information and Automation.
[4] H. Gupta and I. S. Mumick, “Incremental Maintenance of Aggregate and Outer join Expressions,” Information Systems, pp. 435-464, 2006.
[5] W. Liang, H. Wang and M. Orlowska, “Materialized view selection under the maintenance time constraint,” Data & Knowledge Engineering, vol. 37, pp. 203-216, 2001.
[6] R. Prabhu and D. Lyman, “Extracting delta for incremental data warehouse maintenance,” in IEEE 16th International Conference on Data Engineering (ICDE’00), San Diego, CA, USA, 2000
[7] C. R. Valencio, M. H. Marioto, G. F. Donega Zafalon, J. M. Machado and J. C. Momente, “Real Time Delta Extraction Based on Triggers to Support Data Warehousing,” in International Conference on Parallel and Distributed Computing, Applications and Technologies, 2013.
[8] X. Zhang, W. Sun, W. Wang, Y. Feng and B. Shi , “Generating Incremental ETL Processes Automatically,” in Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS’06), 2006.
[9] T. Jorg and S. Desloch, “Towards Generating ETL Processes for Incremental Loading,” in IDEAS08, Coimbra, Portugal, 2008.
[10] R. Rocha, F. Cardoso and M. Souza, “Performance tests in data warehousing ETLM process for detection of changes in data origin,” in Proceedings of Data Warehousing and Knowledge Discovery, DaWaK 2003, LNCS 2737, 2003.
[11] J. Shi, Y. Bao, F. Leng, and G. Yu, “Study on log-based change data capture and handling mechanism in real-time data warehouse,” in CSSE ’08: Proceedings of the 2008 International Conference on Computer Science and Software Engineering. Washington, DC, USA: IEEE Computer Society, 2008, pp. 478–481.
[12] J. Xu, Y. Pei, Overview of data extraction, transaction and loading,Computer Science, Vol. 38, No. 4, 2011.
[13] Lin Ziyu, Yang Dongqing, Song Guojie, “Study on change data capture in Real-time data warehouse” Journal of Computer Research and Development, 2007, 44: 447-451.
[14] Jie Song; Yubin Bao; Jingang Shi; 2010, A Triggering and Scheduling Approach for ETL.Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on, Page(s): 91 – 98.
[15] Mayuri B. Bokade, Prof. S. S.Dhande, Prof. H. R. Vyavahare, “ Framework Of Change Data Capture And Real Time Data Warehouse”,2013 IJERT,vol 2 Issue 4.
.

Essay: Change Data Capture and Slowly Changing Dimension

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: