Today, we are thrilled to announce the launch of IBM Synthetic Data Sets, a family of enterprise-grade data sets that are artificially generated and designed for predictive AI model training and fine-tuning of large language models (LLMs) that benefit IBM Z and IBM LinuxONE clients, ecosystem, and independent software vendors (ISVs). Given the real-time inferencing strengths of the IBM Z and IBM LinuxONE platforms, the top use cases for AI among IBM Z and IB M LinuxONE clients are fraud detection of various kinds and anti-money laundering detection. C lients can download the curated data sets designed for these use cases in mind, and enable development of predictive AI models and LLMs for financial services or optimize existing models for improved accuracy and risk mitigation.
The family of three IBM Synthetic Data Sets include:
Register now for our webinar on March 6th to learn more.
This launch shows IBM’s continued investment in enabling clients to derive enhanced business value from AI on the mainframe. IBM Synthetic Data Sets are designed to help with training or fine-tuning AI models quickly, enhancing predictive models and LLMs, and validating models with known ground truth. The combination of the value of the artificially generated datasets with the hardware acceleration sets clients up for delivering innovation with speed and higher accuracy for cost savings.
Clients can get started faster on AI projects and proof of concepts with synthetic transactional data that quickly allows data scientists to build models for faster time to value . Data privacy regulations such as GDPR require stringent compliance otherwise noncompliance may result in hefty fines and legal action. As a result, companies are rigorously restricting data access, which can sometimes make real data unavailable for use in projects. Alternatively, obtaining permissions for data access can take months and multiple steps of approvals, and upon receiving the data, sensitive data needs to be identified, redacted or encrypted. This process can take up to six months, which hinders the ability to show business value quickly. With IBM Synthetic Data Sets, the easy to download CSV files already have pre-curated key attributes needed for the specific IBM Z and IBM LinuxONE use cases and do not include any real personally identifiable information ( PII ) , so data scientists can begin model training, show value via a proof of concept, in parallel while waiting for access to their real data. For Independent Software Vendors (ISVs) who do not have access to their IBM Z and IBM LinuxONE customers’ data, these synthetic datasets can enable AI solution creation with realistic synthetic transactional data.
Train models anywhere on the platform of your choice with Synthetic Data Sets, and deploy those models on IBM Z and IBM LinuxONE with AI Toolkit for IBM Z and IBM LinuxONE , Cloud Pak for Data on Z , or Machine Learning for z/OS . Perform inference on IBM z16 and IBM LinuxONE 4, leveraging hardware acceleration investments and data gravity to dramatically enhance AI inferencing speed and scale. Powered by the groundbreaking IBM Telum ™ processor with on-chip AI acceleration for inferencing, clients can achieve up to 19x higher throughput and 20x reduced response time when collocating applications and inferencing. 1
Secondly, clients can enhance predictive AI models and fine-tune LLMs with additional rich and broad data, leading to significant cost savings in areas such as fraud detection and money laundering prevention . For example, money laundering by nature often goes undetected in real data, as criminals attempt to move illicit funds to conceal their origins . This frequently involves crossing bank and national boundaries, producing complex transaction patterns. IBM Synthetic Data Sets for Core Banking and Money Laundering has every transaction labeled for “is money laundering” or not, span ning the entire the banking ecosystem, incorporating global transactions, and even includ ing cash transactions which are typically un available in real banking data. This rich dataset with known ground truth enables data scientists to validate their models, and create robust AML models, thereby reducing risk and saving costs for organizations. Moreover, reducing false positives saves countless hours of labor spent investigating flagged instances.
IBM Synthetic Data Sets also label two other common criminal activities: check fraud and Automated Push Payment (APP) fraud. APP are scams like fake bills, fake romantic interests, and fake relative-in-distress that induce victims to send funds to criminals. Each instance of check and APP fraud is labelled in the data. Check and APP fraud are also two of the sources of illicit funds that create downstream laundering as described above. Throughout the data sets are attributes that can link one data set to another, providing referential integrity where the behavior of a user is better understood more broadly across the datasets. Check fraud, APP fraud, and money laundering is an example of where multiple data files can be combined for a rich and broad training data set.
Furthermore, organizations are combining multiple AI models, a technique called “Composite AI”, that increase s prediction accurac y . IBM Z and IBM LinuxONE clients can use IBM Synthetic Data Sets in Composite AI methods to enhance predictive capabilities. For example, with IBM Synthetic Data Sets for Homeowners Insurance, claims data helps train predictive AI which can then be enhanced with fine-tuning of LLMs with the free-text input from why a claim was filed. The combination of models produces a more holistic understanding of claims that are filed, which then helps with detecting claims fraud more accurately. At Hot Chips, Telum II was announced to be available in 2025, and the new version of the on-chip accelerator would be able to support Composite AI use cases such as the ones IBM Synthetic Data Sets help enhance.
Finally, IBM Synthetic Data Sets are designed to help with validating existing models for accuracy. With all transactions labeled with “is money laundering” or not, or “is fraudulent” or not, IBM Synthetic Data Sets serves as an answer sheet and provides ground truth about whether a transaction is fraudulent . Therefore, w hen data scientists want to gauge a model’s accuracy, the model can be tested on the s ynthetic data to validate if the model correctly detected fraud.
Why use IBM Synthetic Data Sets over other synthetic data solutions on the market? The alternative solutions are synthetic data generators, which are licensed software that require clients to bring samples of their real data that the generator then augments with some methodology to create larger synthetic data sets. Let’s compare below.
Comparison Topic/Area/Type
IBM Synthetic Data Sets
Other Synthetic Data Generators
Ease of use
Easily downloadable .CSV files (with accompanying DDL files to uploading to DB2 or other data tables) for immediate use.
Licensed software that requires a learning curve to use the product.
Client Seed data Required
No - clients do not need to bring any real data into the solution, which solves for the data access challenges. The datasets do not have any real PII included because they are created using population statistics instead of anonymized at the individual level.
Yes - clients need to bring their real data, which can be challenging with sensitive data and data privacy regulations, especially in regulated industries.
Known Ground Truth
Yes - ground truth is known. All transactions are labeled for fraud or money laundering or not.
No - Ground truth is not always known about fraud and money laundering in real data.
Logic maintained
Yes - IBM Synthetic Data Sets retains and accurately reflects the complex relationships and constraints present in the real-world. This is due to the years of custom inputs and code worked into our agent-based model that doesn’t come by default in other synthetic data generators in the market.
No - Complex relationships and constraints in the real world do not automatically translate and often present challenges when generating data with synthetic data generators.
Referential Integrity - refers to the relationship between different tables, and that the connection makes sense, is accurate, consistent, and up to date.
Yes - Across IBM Synthetic Data Sets, there is referential integrity that isn’t often found with data that uses standard synthetic data generators.
No - similar and related to the “logic maintained” issue, the current challenge with generator solutions is that those complex relationships do no hold when using standard synthetic data generators.
Use Cases
Targeted for IBM Z and IBM LinuxONE real-time inferencing use cases such as fraud detection and anti-money laundering.
A variety of use cases that may not need the logic maintained or referential integrity of real data.
Based on the above, clients should choose IBM Synthetic Data Sets when wanting to get started on AI projects quickly with compliance with data privacy regulations, and when looking for richer synthetic data that includes ground truth known and maintains business logic for more enhanced model training.
To learn more about the exciting new IBM Synthetic Data Sets:
1DISCLAIMER: Performance results based on IBM internal tests using a CICS OLTP credit card workload with in-transaction fraud detection. A synthetic credit card fraud detection model was used: https://github.com/IBM/ai-on-z-fraud-detection. On IBM z16, inferencing was done with WMLz on zCX . Tensorflow Serving was used on the compared x86 server. A Linux on IBM Z LPAR, located on the same IBM z16, was used to bridge the network connection between the measured z/OS LPAR and the x86 server. Additional network latency was introduced with the Linux " tc-netem " command to simulate a remote cloud environment with 60ms average latency. Results may vary.IBM z16 configuration: Measurements were run using a z/OS (v2R4) LPAR with WMLz (OSCE) and zCX with APAR– oa61559 and APAR - OA62310 applied, 8 CPs, 16 zIIPs , and 8GB of memory. x86 configuration: Tensorflow Serving 2.4 ran on Ubuntu 20.04.3 LTS on 8 Skylake Intel® Xeon® Gold CPUs @ 2.30 GHz with Hyperthreading turned on, 1.5 TB memory, RAID5 local SSD Storage.