watsonx.governance

ย View Only

Granite: IBM's highly curated, trustworthy foundation model for enterprises

By Lindsey Sample posted Fri December 01, 2023 03:45 PM

  

In his recent LinkedIn post, IBM watsonx Director, Armand Ruiz, provides information about IBM's highly curated large-language-model (LLM), called Granite.13B. This LLM differs from others that may include false information or hate speech and profanity (HAP) as it has been governed and filtered to only "enterprise safe" data sources. Ruiz says...

"At IBM, we curated 6.48TB of data to train our LLM Granite.13B. This was reduced to 2.07 TB after pre-processing, a 68% decrease. This step was crucial to ensure that we provide a high-quality, unbiased, ethical, and legal dataset for training our models for enterprise use cases. This is how:

Data Sources used for training:

1) ๐—ฎ๐—ฟ๐—ซ๐—ถ๐˜ƒ: Over 1.8 million scientific paper pre-prints posted to arXiv.
2) ๐—–๐—ผ๐—บ๐—บ๐—ผ๐—ป ๐—–๐—ฟ๐—ฎ๐˜„๐—น: Open repository of web crawl data.
3) ๐——๐—ฒ๐—ฒ๐—ฝ๐— ๐—ถ๐—ป๐—ฑ ๐— ๐—ฎ๐˜๐—ต๐—ฒ๐—บ๐—ฎ๐˜๐—ถ๐—ฐ๐˜€: Mathematical question and answer pairs data.
4) ๐—™๐—ฟ๐—ฒ๐—ฒ ๐—Ÿ๐—ฎ๐˜„: Public-domain legal opinions from US federal and state courts.
5) ๐—š๐—ถ๐˜๐—›๐˜‚๐—ฏ ๐—–๐—น๐—ฒ๐—ฎ๐—ป: Code data from CodeParrot covering a variety of coding languages.
6) ๐—›๐—ฎ๐—ฐ๐—ธ๐—ฒ๐—ฟ ๐—ก๐—ฒ๐˜„๐˜€: News on computer science and entrepreneurship, taken between 2007-2018.
7) ๐—ข๐—ฝ๐—ฒ๐—ป๐—ช๐—ฒ๐—ฏ ๐—ง๐—ฒ๐˜…๐˜: Open-source version of OpenAIโ€™s Web Text corpus containing web pages through 2019.
8) ๐—ฃ๐—ฟ๐—ผ๐—ท๐—ฒ๐—ฐ๐˜ ๐—š๐˜‚๐˜๐—ฒ๐—ป๐—ฏ๐—ฒ๐—ฟ๐—ด (๐—ฃ๐—š-๐Ÿญ๐Ÿต): A repository of free e-books with focus on older works for which U.S. copyright has expired.
9) ๐—ฃ๐˜‚๐—ฏ๐—บ๐—ฒ๐—ฑ ๐—–๐—ฒ๐—ป๐˜๐—ฟ๐—ฎ๐—น: Biomedical and life sciences papers.
10) ๐—ฆ๐—˜๐—– ๐—™๐—ถ๐—น๐—ถ๐—ป๐—ด๐˜€: 10-K/Q filings from the US Securities and Exchange Commission (SEC) for the years 1934-2022.
11) ๐—ฆ๐˜๐—ฎ๐—ฐ๐—ธ ๐—˜๐˜…๐—ฐ๐—ต๐—ฎ๐—ป๐—ด๐—ฒ: Anonymized set of all user-contributed content on the Stack Exchange network, a popular collection of websites centered around user-contributed questions and answers.
12) ๐—จ๐—ฆ๐—ฃ๐—ง๐—ข: US patents granted from 1975 to May 2023, excluding design patents.
13) ๐—ช๐—ฒ๐—ฏ๐—ต๐—ผ๐˜€๐—ฒ: Unstructured web content converted into machine-readable data feeds acquired by IBM.
14) ๐—ช๐—ถ๐—ธ๐—ถ๐—บ๐—ฒ๐—ฑ๐—ถ๐—ฎ: Eight English Wikimedia projects (enwiki, enwikibooks, enwikinews, enwikiquote, enwikisource, enwikiversity, enwikivoyage, enwiktionary). containing extracted plain text from pages and articles.

Once the data has been cleared and downloaded, it is prepared for model training through a series of steps collectively known as the pre-processing pipeline. These steps include the following:

1) Text extraction
2) De-duplication
3) Language identification
4) Sentence splitting
5) Hate, abuse, and profanity annotation
6) Document quality annotation
7) URL block-listing annotation
8) Filtering
9) Tokenization

Some pre-processing steps adhere to an annotation/filtering pattern, where documents or sentences are first annotated and then filtered during the filtering task based on defined thresholds.

This is how we build trustworthy LLMs for your Business. Kudos to the team from IBM Research that keeps innovating and building great models for our customers." - Amand Ruiz (via LinkedIn)


#watsonx.governance
0 comments
4 views

Permalink