In his recent LinkedIn post, IBM watsonx Director, Armand Ruiz, provides information about IBM's highly curated large-language-model (LLM), called Granite.13B. This LLM differs from others that may include false information or hate speech and profanity (HAP) as it has been governed and filtered to only "enterprise safe" data sources. Ruiz says...
"At IBM, we curated 6.48TB of data to train our LLM Granite.13B. This was reduced to 2.07 TB after pre-processing, a 68% decrease. This step was crucial to ensure that we provide a high-quality, unbiased, ethical, and legal dataset for training our models for enterprise use cases. This is how:
Data Sources used for training:
1) ๐ฎ๐ฟ๐ซ๐ถ๐: Over 1.8 million scientific paper pre-prints posted to arXiv.
2) ๐๐ผ๐บ๐บ๐ผ๐ป ๐๐ฟ๐ฎ๐๐น: Open repository of web crawl data.
3) ๐๐ฒ๐ฒ๐ฝ๐ ๐ถ๐ป๐ฑ ๐ ๐ฎ๐๐ต๐ฒ๐บ๐ฎ๐๐ถ๐ฐ๐: Mathematical question and answer pairs data.
4) ๐๐ฟ๐ฒ๐ฒ ๐๐ฎ๐: Public-domain legal opinions from US federal and state courts.
5) ๐๐ถ๐๐๐๐ฏ ๐๐น๐ฒ๐ฎ๐ป: Code data from CodeParrot covering a variety of coding languages.
6) ๐๐ฎ๐ฐ๐ธ๐ฒ๐ฟ ๐ก๐ฒ๐๐: News on computer science and entrepreneurship, taken between 2007-2018.
7) ๐ข๐ฝ๐ฒ๐ป๐ช๐ฒ๐ฏ ๐ง๐ฒ๐
๐: Open-source version of OpenAIโs Web Text corpus containing web pages through 2019.
8) ๐ฃ๐ฟ๐ผ๐ท๐ฒ๐ฐ๐ ๐๐๐๐ฒ๐ป๐ฏ๐ฒ๐ฟ๐ด (๐ฃ๐-๐ญ๐ต): A repository of free e-books with focus on older works for which U.S. copyright has expired.
9) ๐ฃ๐๐ฏ๐บ๐ฒ๐ฑ ๐๐ฒ๐ป๐๐ฟ๐ฎ๐น: Biomedical and life sciences papers.
10) ๐ฆ๐๐ ๐๐ถ๐น๐ถ๐ป๐ด๐: 10-K/Q filings from the US Securities and Exchange Commission (SEC) for the years 1934-2022.
11) ๐ฆ๐๐ฎ๐ฐ๐ธ ๐๐
๐ฐ๐ต๐ฎ๐ป๐ด๐ฒ: Anonymized set of all user-contributed content on the Stack Exchange network, a popular collection of websites centered around user-contributed questions and answers.
12) ๐จ๐ฆ๐ฃ๐ง๐ข: US patents granted from 1975 to May 2023, excluding design patents.
13) ๐ช๐ฒ๐ฏ๐ต๐ผ๐๐ฒ: Unstructured web content converted into machine-readable data feeds acquired by IBM.
14) ๐ช๐ถ๐ธ๐ถ๐บ๐ฒ๐ฑ๐ถ๐ฎ: Eight English Wikimedia projects (enwiki, enwikibooks, enwikinews, enwikiquote, enwikisource, enwikiversity, enwikivoyage, enwiktionary). containing extracted plain text from pages and articles.
Once the data has been cleared and downloaded, it is prepared for model training through a series of steps collectively known as the pre-processing pipeline. These steps include the following:
1) Text extraction
2) De-duplication
3) Language identification
4) Sentence splitting
5) Hate, abuse, and profanity annotation
6) Document quality annotation
7) URL block-listing annotation
8) Filtering
9) Tokenization
Some pre-processing steps adhere to an annotation/filtering pattern, where documents or sentences are first annotated and then filtered during the filtering task based on defined thresholds.
This is how we build trustworthy LLMs for your Business. Kudos to the team from IBM Research that keeps innovating and building great models for our customers." - Amand Ruiz (via LinkedIn)
#watsonx.governance