Wednesday, March 3, 2010

Informatica vs DataStage


Note: The following is based on differences between Informatica 8.6.1 and IBM DataStage 8.0.1. Informatica’s standard industry term is Mapping for a source-target pipeline and IBM DataStage’s is a job.

Pipeline Partitioning –

IBM's thrust is DataStage's pipeline partitioning. Data can be segmented on multiple partitions, processed and then re-collected with IBM DataStage. IBM DataStage lets control a job design based on the logic of the processing instead of defaulting the whole pipeline flow to one partition type. IBM DataStage offers 7 different types of multi-processing partitions.

Informatica offers partitioning as dynamic partitioning which defaults a workflow not at every Stage/Object level in a mapping/job. Informatica offers other partitioning choices as well at the workflow level.

Designer/Monitoring GUIs –

Informatica offers access to the development and monitoring effort through its 4 GUIs - offered as Informatica PowerDesigner, Repository Manager, Worflow Designer, Workflow Manager.

IBM DataStage caters to development and monitoring its jobs through 3 GUIs - IBM DataStage Designer(for development), Job Sequence Designer(workflow design) and Director(for monitoring).

Version Control –

Informatica offers instant version control through its repository server managed with “Repository Manager” GUI console. A mapping with work-in-progress cannot be opened until saved and checked back into the repository.

Version Control was offered as a component until version Ascential DataStage7.5.x. Ascential was acquired by IBM and when DataStage was integrated into IBM Information Server with DataStage at version 8.0.1, the support of version control as a component was discontinued.

Repository based flow –

Informatica, offers a step-by-step effort of creating a data integration solution. Each object created while mapping a source with a target gets saved into the repository project folder categorized by - Sources, Targets, Transformations, Mappings, Mapplets, User-defined functions, Business Components, Cubes and Dimensions. Each object created can be shared, dropped into a mapping across cross-functional development teams. Thus increasing re-usability. Projects are folder based and inter-viewable.

IBM DataStage offers a project based integration solution, projects are not interviewable. Every project needs a role based access. The step-by-step effort in mapping a source to a target lineages into a job. For sharing objects within a job, separate objects need to be created called containers that are local/shared.

Creating a source-target pipeline –

Within Informatica’s PowerCenter Designer, first a source definition needs to be created using “Source Analyzer” that imports the metadata, then a target definition is created using “Target Designer”, then a transformation using “Transformation Developer” is created, and finally maps a source-transformation-target using “Mapping Designer”.

IBM lets drag and drop a functionality i.e a stage within in one canvas area for a pipeline source-target job. With IBM DataStage within the “DataStage Designer” import of both source and target metadata is needed, proceeding with variety of stages offered as database stages, transformation stages, etc.

The biggest difference between both the vendor offerings in this area is Informatica forces you to be organized through a step-by-step design process, while IBM DataStage leaves the organization as a choice and gives you flexibility in dragging and dropping objects based on the logic flow. 

Code Generation and Compilation –

Informatica’s thrust is the auto-generated code. A mapping gets created by dropping a source-transformation-target that doesn’t need to be compiled.

IBM DataStage requires to compile a job in order to run it successfully. Changing business requirements effect the maintenance of change control management with IBM DataStage jobs. Re-compilation is required for every occurring change.

Reusability –

Informatica offers ease of re-usability through Mapplets and Worklets for re-using mappings and workflows.

IBM Stage offers re-usability of a job through containers(local&shared). To re-use a Job Sequence(workflow), you will need to make a copy, compile and run.

Change Data Capture (CDC) –

Informatica offers CDC through a separate edition – Real-time Edition. CDC is a drag and drop object within IBM DataStage Designer.

Data Encryption/Masking -
Data Masking or encryption needs to be done before reaching IBM DataStage Server. Informatica has an offering within PowerCenter Designer as a separate transformation called “Data Masking Transformation”.

Variety of Transformations –

Informatica offers about 30 general transformations for processing incoming data.

IBM offers about 40 data transforming stages/objects.

Impact Analysis –

Informatica offers a separate edition – Advanced edition that helps with data lineage and impact analysis.

IBM DataStage offers through Designer by right clicking on a job to perform dependencies or impact analysis.

Real-Time Integration –

IBM DataStage within the Designer offers creating in-the-box real-time solutions for WISD, XML, Web Services, WebSphere MQ, Java based services.

Informatica offers SOA/Real-time integration through Real-Time edition.

Monitoring –

Informatica Workflow monitor offers different levels of run-statistics information. Tracing levels are offered at 4 different levels – Normal, Terse, Verbose Initialization and Verbose data. These tracing levels offer the degree of information based on source/target rows, caching, transformation statistics for each mapping.

IBM DataStage offers Operational Statistics from DataStage Director. The start, elapsed, end times can be viewed within the Director GUI. Row statistics can be obtained at every processing stage/object through the monitor option within the Director.

Friday, February 12, 2010

New buzz for Industry..MDM solutions..So what is it?

We have had so many MDM like words before Master File, Master Data, Customer Master file, Product Master File, etc. So whats new with MDM's inception? Is it different from before?

Yes, MDM is a new technique around the maintenance of the old customer/product master file. Electronic data processing(EDP) has evolved from 1950s to 2000s. It was about time to change the maintenance procedure of the master file hence the buzz word "Master Data Management"  . Companies have invested billions of dollars in creation of MDM solutions. Following are the companies that offer MDM solutions as a product -

1. Oracle Siebel UCM/Oracle CDH
2. IBM MDM through InfoSphere.
3. SAP MDM
4. Informatica Siperian (Informatica acquired Siperian in January 2010).
5. SAS DataFlux
6. Initiate Systems
7. D&B Purisma
8. SAP MDM
9. Tibco
10. VisionWare
11. Sun
12. Talend MDM (Open Source MDM)

MDM is a concept to keep all packaged, Enterprise and analytical applications up-to-date with a single view of entities, mainly Customer and Product. Beware, MDM is a concept not a framework. A MDM solution can be built using home grown techniques or through MDM product solution offerings.

So what is MDM anyways?
MDM is a technique to generate a "golden record" of an organization's essential entity such as Customer, Product, Account, etc. Some people argue that a datawarehouse contains a "single version of truth" of these essential entities. Well said, but the truth of the matter is datawarehouses are flattened out from the 3rd normal form. Whereas, a MDM is still in the 3rd normal form that can be leveraged by the transactional system. A transactional record gets updated frequently and cannot show up in the analytical/datawarehouse immediately. MDM data stores get updated immediately through real-time processing, that can help business gain confidence in making a decision. MDM enables data cleansing, matching, de-duplicating and surviving techniques to uniquely identify the "golden record" along data governance procedures with the responsbilities&roles of data stewards.

For the year 2010, Gartner predicts that 70% of  Global 2000 companies would be MDM ready. Forrester projects $6 billion in spending for MDM implementations by 2010.

Thursday, February 11, 2010

ELT - Extract, Load and Transform. When to use?

Heard so much about it around 2005-06, now the buzz word has faded on "ELT". So what exactly, is an ELT? Why can't I traditionally use an ETL? Well, ELT pretty much can do the same transformations as an ETL. It is for you to decide why, when, where and how to do the transformations.

Mostly, if I/O is a concern then ELT is your answer. ELT does not create any memory/process/swap overheads on the hosting server since all calculations are done on the database. Talend's offerings for ELT based processing is a must watch. Also, ELT is mostly relied for 3rd normal form to 3rd normal form data loading. If you have an enterprise datawarehouse that has batch/real time loading then ELT is not your choice, the answer is always a YES to use ETL for datawarehouse loading.

Talend offers open source ETL and ELT. Interesting features around Talend are you could use both ELT and ETL objects in the mapping you are designing. Also, Talend lets you choose between Java and PERL source code type, just at a click before creating a mapping.

A typical ELT mapping/job contains a source query object(E), then creation of a table, followed by loading (L), and finally transforming(T) such as aggregating, then loading into a target table.

The old school method of doing an ELT was, when a stored proc. created temp tables, then getting them loaded and then performing some calculations before loading into target tables. InfoPump, Microsoft DTS was another old school method of doing an ELT. 

With lowering hardware costs, ELT seems like an almost obsolete concept now.

Disclaimer: The above are my personal opinions based on my observation and do not represent my employer's view in any way.

Competitive price offerings from expressor-software

Booyah! I watched expressor-software's deep dive webinar today hosted by two of their highly knowledgeable - Director of Field Engineering (DFE), Marketing Director (MD). After the webinar I had tons of questions around the framework that I would perhaps need to design for a large EDW if I used expressor-software.
Soon after, DFE spoke about scalability, stress testing, semantic repository, operational metadata, I feel it is an amazing kick-off. Following are features that came fascinating to me right away -
1. Semantic repository based on a rules engine. What that means is a data steward who creates business rules can log the rules into their semantic engine and a developer can pick it off during the dev. phase. Amazing isnt't it?
2. Re-usability - Often on financial and banking projects we hear developer's frustrations. "Should I recreate the business rule? Phew! the tool could have been a little better to offer something like a shared container(IBM DataStage term), mapplet(Informatica lingo)". expressor-software's thrust is on their business rules re-usability.
3. Pipeline Partitioning - Early 2000s is when Ascential introduced pipeline partioning into DataStage at VERSION 5. Even Informatica introduced pipeline partitioning later in their offerings. For a startup tool, this is great to have in its 2nd version itself! Good visioning expressor-software!
4. Pricing Model - Channel based pricing. A channel based on their pricing model is sequential processing if 1 channel. For multi-processing multi channels need to be bought. Excellent price chart starting with 6months, 1 year, 2 year based subscriptions. Feels like you are at a redbox for renting a tool.

What I really lookforward from expressor-software is -
1. Although every tool should and can be configured with an enterprise scheduler in this era of fast paced data integration world, every tool needs to have its own workflow designer. I'm waiting for expressor-software to come out with its own workflow based designer for developers to call it fully packaged enterprise tool and for operators to receive notifications at wee and fiesta hours. :)

Disclaimer: The above are my personal opinions based on my observation and do not represent my employer's view in any way.

DataIntegration market getting more crowded

Too many offerings..too many vendors..Small, Big, financial viability, Confused? With Open Source offerings in ETL, ELT, MDM, etc the market seems to have got crowded than ever before. Following are the vendor offerings out in the stiff market of data integration space.
Top picks out in the market -
1. Informatica
2. IBM DataStage
3. Ab Initio
4. Oracle
5. SAP Business Objects DataIntegrator
6. Microsoft SSIS

Top Challengers-
1. SAS/DataFlux
2. iWay Software
3. Talend
4. Syncsort
5. Pervasive Software
6. Pitney Bowes Insight

Competitive price vendors -
1. Informatica Cloud edition
2. expressor-software (pricing based on channels i.e. multiple processing)

Open Source vendors -
1. Talend
2. Apatar
3. Pentaho Kettle
4. SnapLogic

Cloud/SaaS Based vendors -
1. Informatica
2. SnapLogic
Both started by entrepreneur Gaurav Dhillon.

Top Pipeline Partitioning vendors -
1. IBM DataStage (process based)
2. Informatica (multi thread based)
3. expressor-software (hybrid based)

Top Message oriented/Real-time processing vendors -
1. IBM DataStage
2. Informatica

Best salesforce.com Integration vendors -
1. Informatica (both Cloud edition based and adaptor based support)
2. IBM DataStage (Adaptor/Pack based support)


Top ELT architecture based vendors -
1. Talend (excellent ELT based objects to drag and drop in the designer)
2. IBM DataStage (provides options to create tables before loading)

Disclaimer: The above are my personal opinions based on my observation and do not represent my employer's view in any way.