> For the complete documentation index, see [llms.txt](https://kb.annjose.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://kb.annjose.com/gen-ai-ml/generative-ai-a-primer.md).

# Generative AI - A Primer

### Details about the article

Title:    **Generative AI exists because of the transformer**\
*Link:*    [https://ig.ft.com/generative-ai/ ](<https://ig.ft.com/generative-ai/ >)\
Author(s): [Visual Storytelling Team](https://www.ft.com/visual-and-data-journalism) and [Madhumita Murgia](https://www.ft.com/madhumita-murgia)\
Published: **9/11/2023** London \
Referred from: [HN post - How Transformers Work](https://news.ycombinator.com/item?id=37774676).

### Summary

Generative AI - software that can create plausible and sophisticated text, images, code at a level that mimics human ability. LLMs are pattern-spotting engines that guess the next best option in a sequence. They are not search engines that are looking up facts. Process

* *tokenizes* the text (breaks it down into sub-words)
* observes that word in the *context of other occurrences* of the word
* processes this data and produces a vector - **word embedding** - a list of numbers, based on each word's proximity to the word LLM is underpinned by the Transformer architecture proposed by Google researchers in 2017. Transformer architecture
* processes a sequence of words together at a once, instead of each individual words separately. This helped LLMs to have better context and patterns, and thus producing *more accurate* text. And this process *runs faster* because it can be parallelized.
* A key concept of the transformer architecture is **self-attention** which allows LLMs to understand *relationship* between words. It looks at *each token* in a body of text and and decided which of those words are most important to *understand its meaning*.
* Prior to transformers, the standard method for language translation was **RNN (Recurrent Neural Network)**, which scanned each word sequentially - in the forward direction only.
* With self-attention, transformers compute all the words at the same time.

When the *user gives a prompt*, it tokenizes, encodes the prompt and represents it in a machine-understandable format - it includes the meaning, positions and relationship between words. Then it will try to predict the next word, the next word etc. until the output is complete. These predictions also come as tokens and the model assigns a *probability score* to each of these tokens - to indicate the likelihood that the token is the best next word.

So the content generated by LLMs may seem plausible and coherent, but they *not always factually correct*. Transformer models can *recognize any repeating patterns* - pixels in image, code, notes in music, DNA in proteins etc. Two ways to predict these tokens:

1. *greedy search* - the model predicts each word in *isolation*. But this could make the whole phrase irrelevant, even when each individual token is meaningful.
2. *beam search* - the model looks at the probability of *a larger set of tokens* instead of each token individually. So it considers multiple routes and finds the best option.

#### Further Reading

* [Transformer: A Novel Neural Network Architecture for Language Understanding](https://blog.research.google/2017/08/transformer-novel-neural-network.html)

### Detailed Notes

* LLM is a giant leap forward in our quest to build intelligence
* Generative AI - software that can create plausible and sophisticated text, images, code at a level that mimics human ability.
* LLM is underpinned by the Transformer architecture proposed by Google researchers in 2017

**How does LLM generate text?**

First, it translates words into a format that it can understand.

* a set of words are broken into tokens - **tokenization**. Tokens are usually sub-words. Example - `We go to work by train`. One of the tokens is `work`.
* in order to understand the meaning of `work`, the LLM observes that word in the context of other occurrences of the word - using enormous sets of training data created from the internet.
* after the training, there will be...
  * a few words that appear next to `work` - `are`, `her`, `friend`, `admirable`, `streamlined`
  * and other words that don't - `dove`, `polka`
* the LLM processes this data and produces a vector - **word embedding** - a list of numbers, based on each word's proximity to the word `work`.
  * word embedding of `work` could be `[.35, .21, .07, .25, .33,....]`
  * each of these embeddings contain 100's of values, each of which represents various linguistic features of the word.
  * we don't know exactly what each of these values represent, but the words with similar embeddings are usually used in similar (comparable) context.
  * Example - `football` and `soccer` are not identical, but have similar meaning, so their embeddings quantify that closeness
  * if we take just two of these characteristics and project them to a 2-d plain, we can see the distance between them and can identify clusters of similar words.&#x20;

|                                  |                                  |
| -------------------------------- | -------------------------------- |
| ![](/files/kJKk85SwvIqFhJDQJTW9) | ![](/files/nZc3DXPVq23xMgfT7Nzx) |

#### The key differentiator of LLMs - Transformer Architecture

Transformer architecture processes a sequence of words together at a once, instead of each individual words separately. This helped LLMs to have better context and patterns, and thus producing *more accurate* text. And this process *runs faster* because it can be parallelized. This architecture was first published by Google Research team in 2017 - [Transformer: A Novel Neural Network Architecture for Language Understanding](https://blog.research.google/2017/08/transformer-novel-neural-network.html)

A key concept of the transformer architecture is **self-attention** which allows LLMs to understand *relationship* between words. It looks at *each token* in a body of text and and decided which of those words are most important to *understand its meaning*.

Prior to transformers, the standard method for language translation was **RNN (Recurrent Neural Network)**, which scanned each word sequentially - in the forward direction only. With self-attention, transformers compute all the words at the same time.

Example - take the word `interest`. In the sentence `I have no interest in politics`, the word `interest` is used as a noun to indicate the subject's affiliation to politics. In the sentence `The bank's interest rates continue to rise`, the same word is used in the financial sense. Even when we combine the two usages, `I have no interest in hearing about the rising interest rate of the bank`, the model is able to recognize the meaning of the word in each context. In the first use of the word `intrest`, `no` and `in` gets the highest attention. For the second usage, it is `rate` and `bank`. It also allows the model to use other words in place of `interest` - at the right place. For example, `I have no enthusiasm in hearing about the rising...`. This is particularly useful when summarizing content.

**Another example** -

1. `The dog chewed the bone because it was hungry`. Here, `it` refers to the dog.
2. `The dog chewed the bone because it was delicious`. Here, `it` refers to the bone, not the dog.

This self-attention helps LLMs to gather context from a broad area - well beyond the sentence boundaries. This helps you scale things up.

#### LLM's available now

1. **OpenAI**'s **GPT-4**
2. **Google** **PaLM** which powers its **Bard** chatbot (and now a newer model Gemini)
3. Anthropic Claude
4. Meta's LLaMA
5. Cohere's Command
6. Mistral

#### How it predicts the next token

LLMs are trained on huge corpus of text available in the internet. They identify patterns and context in this data and create word embeddings, positional encoding and self-attention.

When the user gives a prompt, it tokenizes, encodes the prompt and represents it in a machine-understandable format - it includes the meaning, positions and relationship between words. Then it will try to predict the next word, the next word etc. until the output is complete.

These predictions also come as tokens and the model assigns a probability score to each of these tokens - to indicate the likelihood that the token is the best next word.

There are two ways to predict these tokens:

1. *greedy search* - the model predicts each word in isolation. But this could make the whole phrase irrelevant, even when each individual token is meaningful.
2. *beam search* - the model looks at the probability of a larger set of tokens instead of each token individually. So it considers multiple routes and finds the best option.

Beam search produces better accurate results and more human-like text.

<figure><img src="/files/8P6ZykO6eTBM6oxC5zhX" alt=""><figcaption><p>Beam search that looks at larger set of tokens</p></figcaption></figure>

#### Limitations of LLMs

LLMs are not search engines that look up facts. They are *pattern-spotting engines that guess the next best option in a sequence*. The output of the LLMs may seem plausible and coherent, but they may not be factually correct. So they can fabricate information in a process called **hallucination**. So they can make up references to articles that don't exist, wrong authors for papers etc.

Companies are trying to limit the extend of this hallucination in a few ways:

1. put humans in the loop to give feedback and fill in the gaps in information - **RLHF (Reinforcement Learning with Human Feedback)**
2. a method called **grounding** - cross-checks the LLM's output against web search results and give citations so that people can verify.

#### The potential of LLMs

The power of LLMs go far beyond text. Transformer models can recognize any repeating patterns (pixels in image, code, notes in music, DNA in proteins).

For decades, AI research had produced specialized models to summarize, translate, search and retrieve. Transformers unified them all into a single structure capable of doing multiple tasks.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://kb.annjose.com/gen-ai-ml/generative-ai-a-primer.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
