watsonx.governance

LinkedIn Share on LinkedIn

 View Only

Granite: IBM's highly curated, trustworthy foundation model for enterprises

By Lindsey Sample posted Fri December 01, 2023 03:45 PM

  

In his recent LinkedIn post, IBM watsonx Director, Armand Ruiz, provides information about IBM's highly curated large-language-model (LLM), called Granite.13B. This LLM differs from others that may include false information or hate speech and profanity (HAP) as it has been governed and filtered to only "enterprise safe" data sources. Ruiz says...

"At IBM, we curated 6.48TB of data to train our LLM Granite.13B. This was reduced to 2.07 TB after pre-processing, a 68% decrease. This step was crucial to ensure that we provide a high-quality, unbiased, ethical, and legal dataset for training our models for enterprise use cases. This is how:

Data Sources used for training:

1) 𝗮𝗿𝗫𝗶𝘃: Over 1.8 million scientific paper pre-prints posted to arXiv.
2) 𝗖𝗼𝗺𝗺𝗼𝗻 𝗖𝗿𝗮𝘄𝗹: Open repository of web crawl data.
3) 𝗗𝗲𝗲𝗽𝗠𝗶𝗻𝗱 𝗠𝗮𝘁𝗵𝗲𝗺𝗮𝘁𝗶𝗰𝘀: Mathematical question and answer pairs data.
4) 𝗙𝗿𝗲𝗲 𝗟𝗮𝘄: Public-domain legal opinions from US federal and state courts.
5) 𝗚𝗶𝘁𝗛𝘂𝗯 𝗖𝗹𝗲𝗮𝗻: Code data from CodeParrot covering a variety of coding languages.
6) 𝗛𝗮𝗰𝗸𝗲𝗿 𝗡𝗲𝘄𝘀: News on computer science and entrepreneurship, taken between 2007-2018.
7) 𝗢𝗽𝗲𝗻𝗪𝗲𝗯 𝗧𝗲𝘅𝘁: Open-source version of OpenAI’s Web Text corpus containing web pages through 2019.
8) 𝗣𝗿𝗼𝗷𝗲𝗰𝘁 𝗚𝘂𝘁𝗲𝗻𝗯𝗲𝗿𝗴 (𝗣𝗚-𝟭𝟵): A repository of free e-books with focus on older works for which U.S. copyright has expired.
9) 𝗣𝘂𝗯𝗺𝗲𝗱 𝗖𝗲𝗻𝘁𝗿𝗮𝗹: Biomedical and life sciences papers.
10) 𝗦𝗘𝗖 𝗙𝗶𝗹𝗶𝗻𝗴𝘀: 10-K/Q filings from the US Securities and Exchange Commission (SEC) for the years 1934-2022.
11) 𝗦𝘁𝗮𝗰𝗸 𝗘𝘅𝗰𝗵𝗮𝗻𝗴𝗲: Anonymized set of all user-contributed content on the Stack Exchange network, a popular collection of websites centered around user-contributed questions and answers.
12) 𝗨𝗦𝗣𝗧𝗢: US patents granted from 1975 to May 2023, excluding design patents.
13) 𝗪𝗲𝗯𝗵𝗼𝘀𝗲: Unstructured web content converted into machine-readable data feeds acquired by IBM.
14) 𝗪𝗶𝗸𝗶𝗺𝗲𝗱𝗶𝗮: Eight English Wikimedia projects (enwiki, enwikibooks, enwikinews, enwikiquote, enwikisource, enwikiversity, enwikivoyage, enwiktionary). containing extracted plain text from pages and articles.

Once the data has been cleared and downloaded, it is prepared for model training through a series of steps collectively known as the pre-processing pipeline. These steps include the following:

1) Text extraction
2) De-duplication
3) Language identification
4) Sentence splitting
5) Hate, abuse, and profanity annotation
6) Document quality annotation
7) URL block-listing annotation
8) Filtering
9) Tokenization

Some pre-processing steps adhere to an annotation/filtering pattern, where documents or sentences are first annotated and then filtered during the filtering task based on defined thresholds.

This is how we build trustworthy LLMs for your Business. Kudos to the team from IBM Research that keeps innovating and building great models for our customers." - Amand Ruiz (via LinkedIn)


#watsonx.governance
0 comments
4 views

Permalink