With Generative AI based on LLMs (Large Language Model), we allow a model to execute several tasks such as summaries, knowledge extraction from a document or natural language. The added values are numerous and to cite a few as in the world of IT: we could, for example, accelerate the resolution of incidents for business applications by allowing a Generative AI to succinctly summarize the problem for us. ; offer us fairly promising solutions, or by further improving the management of customer complaints in call centers (banks, mobile telephone operators, etc.).
To achieve this end, these foundation models would have to be trained with a large corpus of data of various origins: Internet, system logs, stack overflow data, finance data...data which will possibly come with profanity, obscenities, hatred etc.
So what happens when companies use foundation models where the origin of the data used to train the models is not known? What happens when the process of collection, validation etc. is a black box? A foundation model, depending on the data on which it was trained, will always do “ garbage in = garbage out and/or quality in = quality out ”. In other words, the response quality of a foundation model will depend on the quality of the data with which it was trained. Andrew Ng, associate professor in the Department of Computer Science at Stanford University, CEO and founder of LandingAI and Coursera, has long campaigned for the adoption of a Data-centric culture rather than Model-centric. According to the latter, companies should focus on developing systematic engineering practices to improve data in a reliable, efficient way . How can we ensure that the solution we use in business will not cause hallucinations or use foul language.
At IBM , we have opted for transparency, ethics and governance by meticulously choosing our data which will be used to train our foundation model called Granite .
In the research article " Granite Foundation Models " that we published recently, our basic foundation model granite.13b (13 for 13 billion parameters), as well as its variants, granite.13b.instruct and granite.13b. cat, were trained on a dataset cleaned by IBM Research . The origin of the data used to train the model are:
Totaling more than 6 TB before cleaning to arrive at just over 2 TB after going through a governance process. See table below: