DataStage - Data Integration

Expand all | Collapse all

DataStage job hours estimate question

  • 1.  DataStage job hours estimate question

    Posted 11 days ago
    Edited by Jessica Long 11 days ago
    Hi there. I'm looking to get a rough estimate for a number of hours needed to do the following:

    Build a job that will run weekly to load an incremental data updates into an existing Hadoop tables (9 tables in total). The data source is from a secure FTP location (an existing connection via VPN). ETL steps:

    1. get a single archived multiple record types flat file, pipe-separated
    2. extract rows for each target table - first column has a 3-character identifier for the table. all other columns in that row is an actual data that has to be loaded
    3. change format for date/datetime fields
    4. load data into HDFS (Hadoop)

    Please provide an approximate number of hours you think will be needed to build an enterprise-grade solution - build in development environment, test, document, deploy to production, test

    Thank you in advance!


  • 2.  RE: DataStage job hours estimate question

    Posted 11 days ago
    Oleksiy,

    There's not enough information to provide a precise estimate, but based on what you have shared, the logic to fulfill the technical requirements is very straightforward. Building the actual flow logic could be accomplished in a couple minutes; here's one design option:
    • Complex Flat File (CFF) stage as source; Pipe-delimited, 9 output links, one for each target, use Record ID to determine which output link
    • 9 Transformer stages, one for each CFF output link; Date/DateTime transformations
    • 9 File Connector stages, one for each Transformer output link (table); define HDFS properties

    These other factors could affect the timeline, but some of these are issues that only need to be addressed once:
    • Developer skill level
    • Unknown or unusual metadata, data type translations, handling nulls
    • Additional complex transformations
    • Parent/Child relationship requirements
    • Establishing Hadoop connectivity
    • Coordination with enterprise scheduling team
    • Complicated migration logistics

    Anybody else have other design options or timeline factors?

    ------------------------------
    Brian Sprecher
    ------------------------------



  • 3.  RE: DataStage job hours estimate question

    Posted 10 days ago
    Edited by Oleksiy Nazarenko 10 days ago

    Brian,

    Thank you very much for your answer! Especially for listing some other factors that should be considered and that could impact severely a development time. An number of hours estimate for this activity that was provided by IT department is ridiculously high. So, I decided to post my question here to get experts opinions to validate my suspicions.

    Now, - back to the factors you've listed - there is a dedicated DataStage resource (an expert skill level, I'd assume). There wasn't any issues with metadata, types, or data itself, and there is no need for any other transformation or relationships/tables load order (I had to load historical data manually). Connection to Hadoop should have been established already as it is our Data Lake. All others could be treated as exceptions, and will result in additional time added to the original hours estimate.

    Is it fair to say that a 30 hours estimate would give a plenty of time to have this built, tested, and documented?





  • 4.  RE: DataStage job hours estimate question

    Posted 10 days ago
    Edited by Brian Sprecher 10 days ago
    Oleksiy,

    An estimate from me would be unfair to your IT department, because I don't have full disclosure to all the variables, nor would I have accountability for my answer.

    Would the IT department be willing to share a project timeline breakdown by task? If they don't have that level of detail, you could ask them how long it would take to build, test, document, and implement the simplest DataStage flow they can imagine. Then have them estimate the most complex, then somewhere in-between.

    ------------------------------
    Brian Sprecher
    ------------------------------



  • 5.  RE: DataStage job hours estimate question

    Posted 10 days ago
    Edited by Oleksiy Nazarenko 10 days ago
    Brian,

    Thanks again. I understand that no one would be able to or willing to provide an estimate for a specific project without knowing all aspects. I was expecting to get a generic Rough Order of Magnitude (ROM) estimate from the people that work with DataStage, given the technical requirements from above - something like "it would take me X to Y hours". At this point I don't want to ask for any task breakdown or anything else.

    I'm going to wait for a few days and see if anyone else would be willing to share his/her opinion or provide an estimate.
    FYI - we were told that this DataStage flow will take 300 hours to develop.




  • 6.  RE: DataStage job hours estimate question

    Posted 7 days ago
    Maybe this guidelines could help:

    Introduction

    This HowTo attempts to define a method for estimating ETL job development for DataStage jobs by calculating job complexity based on the job pattern and specific challenges within the job.

    This estimation model covers the developer only - the design and development of the job, unit testing to ensure the job is in a running state and fulfilling the specification, defect fix support for the job through other testing phases.

    Job Complexity Estimation

    Identify your set of ETL job patterns. There will be a common repetitive type of job pattern for large data migration projects.

    A set of job patterns defines a base time for building each type of job. A set of general overheads lists some stage types that may add more time to the base time. A set of special overheads can apply to specific job patterns.

    General Job Complexity Overheads

    These are stage and actions that can add overhead to the complexity and time to develop any job:

    • 0.25 for key lookup, join or merge
    • 0.5 for range lookup
    • 0.5 for change data capture
    • 0.5 for field level validation (checking nulls, dates, numbers)
    • 0.25 for modify stage
    • 0.25 for undocumented source file definition (sequential, xml)

    These fields have no real overhead taking little time to load and configure. They are part of the base estimate for each job pattern: sort, aggregation, filter, copy, transformer, database, dataset, sequential file

    DataStage Job Patterns

    These are the common ETL job patterns. What follows is the name of the job, the number of base days to develop the job and any specific overheads for that job pattern. Each job is made up of three parts: the base job time, the general overheads from above and the specific overheads for that job pattern. Add these together for a final estimate.

    The base time is the time to create the job, put all the stage together, import the metadata and do unit testing to get it to a running state.

    Relational Table Prepare Job Prepares a data source for a load into a relational database table.

    • 1 day base
    • 0.25 for rejection of rows with missing parents.
    • 0.5 for output of augmentation requests for missing parents.
    • 0.5 for any type of hierarchy validation.

    Table Append, Load, Insert, Update or Delete Loads data into a relational table via insert, update or delete.

    • 0.5 day base
    • 0.25 for a bulk load.
    • 0.25 for user-defined SQL.
    • 0.25 for before-sql or after-sql requirements.
    • 0.5 rollback table after failed load.
    • 1 restart from last row after failed load.

    Dimension Prepare job Takes a source database SQL or a source flat file and creates staging files with transformed data. Typically involve extract, lookup, validation, changed data capture and transformation. These jobs take the same amount of time to build as relational table loads but take extra unit testing time due to the increased combinations of changed data scenarios.

    • 1 day base
    • 0.5 type I unit testing
    • 0.5 type II unit testing
    • 0.5 type II unit testing

    Fact Prepare Job Loads fact data. The validation against dimensions is a general lookup overhead. Fact jobs tend to be the most complex jobs having a lot of source tables and validating against multiple dimensions. They also tend to have the most complex functions.

    • 3 days base
    • 0.25 per source table (adds to SQL select, lookup and change capture complexity)
    • 0.25 for calculations (this is a custom setting

    End to End Load A job that extracts data, prepares it and loads it all in one is a combination of the estimates from above. Just merge the times for the prepare and load jobs into one.

    The fact job has a high base estimate and attracts a lot of general overheads such as lookups and joins.

    Skill Weighting

    The skill weighting alters the estimated time based on the experience and confidence of the developer. For lack of an alternative the skills is defined as the number of years of experience with the tool. An experience consultant has the base weighting of 1 (no affect on estimates) with less experienced staff attracting more time.

    • 4+ years = 1
    • 3-4 years = 1.25
    • 1-2 years = 1.5
    • 6 months to 1 year = 1.75
    • Up to 6 months = 2
    • Novice = 3

    The total number of days estimated for each job is multiplied by the skill weighting to get a final estimate.

    Data Volume Weighting

    Very high volumes of data can take longer to develop: more time is spent making the job as efficient as possible. Unit testing of large volumes takes more time. Optimisation testing takes time.

    • Low volume = 1
    • Medium volume = 1.25
    • High volume = 1.5
    • Very high volume = 2

    Examples

    I extract a list of invoices from a system and load it to some relational data store tables. An invoice is made up of invoice header records and invoice item records (an invoice can have many items on it).

    • Extract invoice header to flat file: 1 day
    • Extract invoice item to flat file: 1 day
    • Prepare invoice header file: 1 (base) + 1 (four joins) + 0.5 (change data capture) + 0.5 (validate fields) = 3 days.
    • Load invoice header data: 1 day.
    • Prepare invoice item file: 1 (base) + 0.5 (two joins) + 0.25 (reject missing header) + 0.5 (change data capture) = 2.25 days.
    • Load invoice item data: 1 (base) + 0.5 (before-sql disable constraints, bulk load, after-sql enable constraints) = 1.5.

    An expert (4+ years) would complete all jobs in 9.75 days. A novice would take 29.25 days with the x3 weighting.

    Conclusion

    These are just guidelines. Happy for people to jump in and add their own estimates for difficult stages or tasks in a job. These guidelines could then go into a modelling spreadsheet that estimates all ETL jobs on a project.

    This does not take into account complexity of business rules: having to write a lot of transformation custom code or custom stages. It does not take into account a large number of columns for change capture and validation jobs.



    ------------------------------
    VICTOR JIMENEZ SANCHEZ
    DataStage Consultant
    IBM
    ------------------------------