From my previous blog post – “Migrating data from legacy IBM i systems to modern platforms can be complex due to inconsistent formats, redundant records, missing or inconsistent values.” Migrating data to modern relational databases (like PostgreSQL or MySQL) involves above mentioned challenges. Manually fixing these issues is time-consuming and error prone. AI-powered tools can automate data analysis, cleansing and transformation, making migration faster and more accurate.
Let us see how AI can help to address those challenges:
Challenge
|
How AI addresses the problem
|
Duplicate data
|
AI can identify possible duplicate records even if names or fields are slightly misspelled.
|
Missing values
|
ML models can fill in missing fields by learning from existing data patterns.
|
Improper File layouts
|
AI/ML models can read RPG copybooks or DDS and can auto-generate table schemas.
|
Incompatible data formats
|
AI can learn transformation rules (e.g., date formats, units) and apply them during extraction, transformation and loading operations.
|
Next we will see how AI can help data cleansing by taking a sample input from DB2 for i
Cust Id
|
Name
|
Phone
|
State
|
Zip
|
CUST001
|
John Smith
|
1234567890
|
TX
|
|
CUST002
|
J. Smith
|
1234567890
|
Texas
|
77001
|
CUST003
|
Jane
|
9876543210
|
TX
|
77002
|
After AI driven cleaning:
Cust Id
|
Name
|
Phone
|
State
|
Zip
|
Duplicate of
|
CUST001
|
John Smith
|
1234567890
|
TX
|
77001
|
NULL
|
CUST003
|
Jane Doe
|
9876543210
|
TX
|
77002
|
NULL
|
CUST002
|
J. Smith
|
1234567890
|
TX
|
77001
|
CUST001
|
The process includes the following steps:
· Extract the DB2 for i data using SQL scripts and export it to a customers.csv file.
· Executes the following python script for duplicate data detection using fuzzy string matching.
from thefuzz import fuzz
import pandas as pd
df = pd.read_csv("customers.csv")
def find_duplicates(df):
duplicates = []
for i, row1 in df.iterrows():
for j, row2 in df.iterrows():
if i < j and fuzz.token_set_ratio(row1['Name'], row2['Name']) > 85:
duplicates.append((i, j))
return duplicates
print(find_duplicates(df))
· It predicts the missing zip codes using scikit-learn (a machine learning model)
from sklearn.ensemble import RandomForestClassifier
df['Zip'] = df['Zip'].fillna(0).astype(str)
train = df[df['Zip'] != "0"]
X = train[['State']]
y = train['Zip']
model = RandomForestClassifier()
model.fit(X, y)
missing = df[df['Zip'] == "0"]
df.loc[missing.index, 'Zip'] = model.predict(missing[['State']])
Note: The above code may need to be refined for better performance or to handle large data sets.
Similarly, AI can help in mapping suitable data structures.
Let us say we have a DDS like
A R CUSTREC
A CUSTID 10A
A NAME 30A
A PHONE 10S 0
A STATE 2A
A ZIP 5S 0
AI tools can convert this into a modern schema:
CREATE TABLE customers (
cust_id VARCHAR(10),
name VARCHAR(30),
phone BIGINT,
state CHAR(2),
zip INTEGER
);
AI-powered data migration accelerates the process accurately and handles millions of records simultaneously.
To conclude, AI combined with traditional ETL processes can modernize legacy systems with less manual effort. Machine learning models help fill data gaps, eliminate duplicates, and transform file-based records into relational databases while preserving business logic. This approach not only accelerates digital transformation but also ensures higher data quality and prepares organizations for cloud-native or analytics-ready systems.