Modernization with IBM Power

 View Only



LinkedIn Share on LinkedIn

Legacy modernization using AI/ML - AI-driven Data cleansing and Migration

By RajKumar Chindanuru posted 6 days ago

  

From my previous blog post – “Migrating data from legacy IBM i systems to modern platforms can be complex due to inconsistent formats, redundant records, missing or inconsistent values.”  Migrating data to modern relational databases (like PostgreSQL or MySQL) involves above mentioned challenges. Manually fixing these issues is time-consuming and error prone. AI-powered tools can automate data analysis, cleansing and transformation, making migration faster and more accurate.

Let us see how AI can help to address those challenges:

Challenge

How AI addresses the problem

Duplicate data

AI can identify possible duplicate records even if names or fields are slightly misspelled.

Missing values

ML models can fill in missing fields by learning from existing data patterns.

Improper File layouts

AI/ML models can read RPG copybooks or DDS and can auto-generate table schemas.

Incompatible data formats

AI can learn transformation rules (e.g., date formats, units) and apply them during extraction, transformation and loading operations.

Next we will see how AI can help data cleansing by taking a sample input from DB2 for i

Cust Id

Name

Phone

State

Zip

CUST001

John Smith

1234567890

TX

CUST002

J. Smith

1234567890

Texas

77001

CUST003

Jane

9876543210

TX

77002

After AI driven cleaning:

Cust Id

Name

Phone

State

Zip

Duplicate of

CUST001

John Smith

1234567890

TX

77001

NULL

CUST003

Jane Doe

9876543210

TX

77002

NULL

CUST002

J. Smith

1234567890

TX

77001

CUST001

The process includes the following steps:

·         Extract the DB2 for i data using SQL scripts and export it to a customers.csv file.

·         Executes the following python script for duplicate data detection using fuzzy string matching.

from thefuzz import fuzz

import pandas as pd

df = pd.read_csv("customers.csv")

def find_duplicates(df):

    duplicates = []

    for i, row1 in df.iterrows():

        for j, row2 in df.iterrows():

            if i < j and fuzz.token_set_ratio(row1['Name'], row2['Name']) > 85:

                duplicates.append((i, j))

    return duplicates

print(find_duplicates(df))

·         It predicts the missing zip codes using scikit-learn (a machine learning model)

from sklearn.ensemble import RandomForestClassifier

df['Zip'] = df['Zip'].fillna(0).astype(str)

train = df[df['Zip'] != "0"]

X = train[['State']]

y = train['Zip']

model = RandomForestClassifier()

model.fit(X, y)

missing = df[df['Zip'] == "0"]

df.loc[missing.index, 'Zip'] = model.predict(missing[['State']])

Note: The above code may need to be refined for better performance or to handle large data sets.

Similarly, AI can help in mapping suitable data structures.

Let us say we have a DDS like

A          R CUSTREC

A            CUSTID         10A

A            NAME           30A

A            PHONE          10S 0

A            STATE           2A

A            ZIP             5S 0

AI tools can convert this into a modern schema:

CREATE TABLE customers (

  cust_id VARCHAR(10),

  name VARCHAR(30),

  phone BIGINT,

  state CHAR(2),

  zip INTEGER

);

AI-powered data migration accelerates the process accurately and handles millions of records simultaneously.

To conclude, AI combined with traditional ETL processes can modernize legacy systems with less manual effort. Machine learning models help fill data gaps, eliminate duplicates, and transform file-based records into relational databases while preserving business logic. This approach not only accelerates digital transformation but also ensures higher data quality and prepares organizations for cloud-native or analytics-ready systems.

0 comments
3 views

Permalink