Brian,Thank you very much for your answer! Especially for listing some other factors that should be considered and that could impact severely a development time. An number of hours estimate for this activity that was provided by IT department is ridiculously high. So, I decided to post my question here to get experts opinions to validate my suspicions.Now, - back to the factors you've listed - there is a dedicated DataStage resource (an expert skill level, I'd assume). There wasn't any issues with metadata, types, or data itself, and there is no need for any other transformation or relationships/tables load order (I had to load historical data manually). Connection to Hadoop should have been established already as it is our Data Lake. All others could be treated as exceptions, and will result in additional time added to the original hours estimate.Is it fair to say that a 30 hours estimate would give a plenty of time to have this built, tested, and documented?
Original Message:Sent: Fri April 30, 2021 09:44 AMFrom: Brian SprecherSubject: DataStage job hours estimate questionOleksiy,There's not enough information to provide a precise estimate, but based on what you have shared, the logic to fulfill the technical requirements is very straightforward. Building the actual flow logic could be accomplished in a couple minutes; here's one design option:
This HowTo attempts to define a method for estimating ETL job development for DataStage jobs by calculating job complexity based on the job pattern and specific challenges within the job.
This estimation model covers the developer only - the design and development of the job, unit testing to ensure the job is in a running state and fulfilling the specification, defect fix support for the job through other testing phases.
Identify your set of ETL job patterns. There will be a common repetitive type of job pattern for large data migration projects.
A set of job patterns defines a base time for building each type of job. A set of general overheads lists some stage types that may add more time to the base time. A set of special overheads can apply to specific job patterns.
These are stage and actions that can add overhead to the complexity and time to develop any job:
These fields have no real overhead taking little time to load and configure. They are part of the base estimate for each job pattern: sort, aggregation, filter, copy, transformer, database, dataset, sequential file
These are the common ETL job patterns. What follows is the name of the job, the number of base days to develop the job and any specific overheads for that job pattern. Each job is made up of three parts: the base job time, the general overheads from above and the specific overheads for that job pattern. Add these together for a final estimate.
The base time is the time to create the job, put all the stage together, import the metadata and do unit testing to get it to a running state.
Relational Table Prepare Job Prepares a data source for a load into a relational database table.
Table Append, Load, Insert, Update or Delete Loads data into a relational table via insert, update or delete.
Dimension Prepare job Takes a source database SQL or a source flat file and creates staging files with transformed data. Typically involve extract, lookup, validation, changed data capture and transformation. These jobs take the same amount of time to build as relational table loads but take extra unit testing time due to the increased combinations of changed data scenarios.
Fact Prepare Job Loads fact data. The validation against dimensions is a general lookup overhead. Fact jobs tend to be the most complex jobs having a lot of source tables and validating against multiple dimensions. They also tend to have the most complex functions.
End to End Load A job that extracts data, prepares it and loads it all in one is a combination of the estimates from above. Just merge the times for the prepare and load jobs into one.
The fact job has a high base estimate and attracts a lot of general overheads such as lookups and joins.
The skill weighting alters the estimated time based on the experience and confidence of the developer. For lack of an alternative the skills is defined as the number of years of experience with the tool. An experience consultant has the base weighting of 1 (no affect on estimates) with less experienced staff attracting more time.
The total number of days estimated for each job is multiplied by the skill weighting to get a final estimate.
Very high volumes of data can take longer to develop: more time is spent making the job as efficient as possible. Unit testing of large volumes takes more time. Optimisation testing takes time.
I extract a list of invoices from a system and load it to some relational data store tables. An invoice is made up of invoice header records and invoice item records (an invoice can have many items on it).
An expert (4+ years) would complete all jobs in 9.75 days. A novice would take 29.25 days with the x3 weighting.
These are just guidelines. Happy for people to jump in and add their own estimates for difficult stages or tasks in a job. These guidelines could then go into a modelling spreadsheet that estimates all ETL jobs on a project.
This does not take into account complexity of business rules: having to write a lot of transformation custom code or custom stages. It does not take into account a large number of columns for change capture and validation jobs.