Embeddings : why turning words into vectors¶

Playing with Infinity embeddings online here. This is a very cool project low-latency REST API for serving text-embeddings, reranking models, clip, clap and colpali.

You can try the following request on the embeddings endpoint :

{
  "model": "michaelfeil/bge-small-en-v1.5",
  "encoding_format": "float",
  "user": "string",
  "input": [
    "this is a simple encoding test"
  ],
  "modality": "text"
}

Basically I think you have the idea here, this article is about how to turn words into vectors (just a list of float) to feed machine learning algorithms that need numerical input.

But there's a way to get around this by using something called "text embeddings" or just embeddings. These help preserve the meaning of phrases and words, even when they're turned into high-dimensional vectors.

That's why text embeddings are so useful in NLP tasks like categorizing text, analyzing opinions, translating languages, and answering questions.

So if you're looking for a free solution that won't break the bank, open-source embedding models are a very good option. Here are some of the best ones for me :

Word2Vec - This model is like a pioneer in word embeddings, mapping words to vectors in a continuous space to show how they relate to each other. The original version here is now archived 🤓
GloVe - This method collects global statistics from massive texts to create word embeddings that get pretty good at capturing relationships between words.
BERT - This transformer model looks at both sides of the sentence when creating its embeddings, which helps it do really well in NLP tasks. The original paper here if you have the time and passion for it!
txtai - A one-stop shop open source for text embeddings that can be used for semantic search, orchestrating large language models, and more. Here 60+ use cases bible 🥷
Chroma - A text-embedding open-source model that does its thing. And here the best Chroma CookBook

In this article we will be using Chroma, hugging face and OpenAI to generate embeddings of our generated content seen in the previous article about building a youtube LLM bot.

If you prefer video courses, for me this is a very good explanation from statquest our fellow youtube free teacher we all love in the game 👨🏼‍🏫

OpenAI¶

First let's write a simple function who compare two words by comparing their OpenAI vectors representation called simple_openai_embedding()

Do not forget to add your OpenAI key according to the documentation 😎

In [1]:

Copied!





from langchain.embeddings import OpenAIEmbeddings
from langchain.evaluation import load_evaluator


def simple_openai_embedding():
    # Get embedding for a word.
    embedding_function = OpenAIEmbeddings()
    vector = embedding_function.embed_query("apple")
    print(f"Vector for 'apple': {vector}")
    print(f"Vector length: {len(vector)}")

    # Compare vector of two words
    evaluator = load_evaluator("pairwise_embedding_distance")
    words = ("apple", "iphone")
    x = evaluator.evaluate_string_pairs(prediction=words[0], prediction_b=words[1])
    print(f"Comparing ({words[0]}, {words[1]}): {x}")
from langchain.embeddings import OpenAIEmbeddings
from langchain.evaluation import load_evaluator


def simple_openai_embedding():
    # Get embedding for a word.
    embedding_function = OpenAIEmbeddings()
    vector = embedding_function.embed_query("apple")
    print(f"Vector for 'apple': {vector}")
    print(f"Vector length: {len(vector)}")

    # Compare vector of two words
    evaluator = load_evaluator("pairwise_embedding_distance")
    words = ("apple", "iphone")
    x = evaluator.evaluate_string_pairs(prediction=words[0], prediction_b=words[1])
    print(f"Comparing ({words[0]}, {words[1]}): {x}")

Distances between words¶

Embeddings algorithms use various distance metrics to measure the similarity between words, phrases, or documents. These metrics are essential because they help us understand how similar or dissimilar two entities are in terms of meaning.

If you do not have seen our article about NLP basics and you want to know more about distances do not hesitate to take a look 👀

BERT Embeddings¶

BERT stands for Bidirectional Encoder Representations from Transformers. It is a pre-trained language model developed by Google in 2018 that uses deep bidirectional representations from transformer architectures to generate contextualized word embeddings.

In this article we will not be speaking in details about BERT but playing with the transformers.BertTokenizer class in order to play with BERT embeddings without deeply understunding how it works under the hood.

In [41]:

Copied!

import transformers

# Load pre-trained model tokenizer (vocabulary-multilingual)
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
import transformers

# Load pre-trained model tokenizer (vocabulary-multilingual)
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [42]:

Copied!





text = "An other course about NLP and embeddings ... "
marked_text = "[CLS] " + text + " [SEP]"

# Tokenize our sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(marked_text)
segments_ids = [1] * len(tokenized_text)

# Map the token strings to their vocabulary indeces.
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

# Print out the tokens.
print (tokenized_text)
text = "An other course about NLP and embeddings ... "
marked_text = "[CLS] " + text + " [SEP]"

# Tokenize our sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(marked_text)
segments_ids = [1] * len(tokenized_text)

# Map the token strings to their vocabulary indeces.
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

# Print out the tokens.
print (tokenized_text)

['[CLS]', 'an', 'other', 'course', 'about', 'nl', '##p', 'and', 'em', '##bed', '##ding', '##s', '.', '.', '.', '[SEP]']

In [35]:

Copied!

model = transformers.BertModel.from_pretrained("bert-base-uncased")
model = transformers.BertModel.from_pretrained("bert-base-uncased")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [39]:

Copied!

embedding_layer = model.embeddings.word_embeddings.weight
print(embedding_layer)
embedding_layer = model.embeddings.word_embeddings.weight
print(embedding_layer)

Parameter containing:
tensor([[-0.0102, -0.0615, -0.0265,  ..., -0.0199, -0.0372, -0.0098],
        [-0.0117, -0.0600, -0.0323,  ..., -0.0168, -0.0401, -0.0107],
        [-0.0198, -0.0627, -0.0326,  ..., -0.0165, -0.0420, -0.0032],
        ...,
        [-0.0218, -0.0556, -0.0135,  ..., -0.0043, -0.0151, -0.0249],
        [-0.0462, -0.0565, -0.0019,  ...,  0.0157, -0.0139, -0.0095],
        [ 0.0015, -0.0821, -0.0160,  ..., -0.0081, -0.0475,  0.0753]],
       requires_grad=True)

In [44]:

Copied!

print(tokenizer(['courses']))
print(tokenizer(['courses']))

{'input_ids': [[101, 5352, 102]], 'token_type_ids': [[0, 0, 0]], 'attention_mask': [[1, 1, 1]]}

Position embeddings¶

In addition to the Token Embeddings described so far, BERT also relies on Position Embeddings. While Token Embeddings are used to represent each possible word or subword that can be provided to the model, Position Embeddings represent the position of each token in the input sequence.

In [50]:

Copied!

print(model.embeddings.position_embeddings)
print(model.embeddings.position_embeddings)

Embedding(512, 768)

In [53]:

Copied!

print(model.embeddings.position_embeddings.weight)
print(model.embeddings.position_embeddings.weight)

Parameter containing:
tensor([[ 1.7505e-02, -2.5631e-02, -3.6642e-02,  ...,  3.3437e-05,
          6.8312e-04,  1.5441e-02],
        [ 7.7580e-03,  2.2613e-03, -1.9444e-02,  ...,  2.8910e-02,
          2.9753e-02, -5.3247e-03],
        [-1.1287e-02, -1.9644e-03, -1.1573e-02,  ...,  1.4908e-02,
          1.8741e-02, -7.3140e-03],
        ...,
        [ 1.7418e-02,  3.4903e-03, -9.5621e-03,  ...,  2.9599e-03,
          4.3435e-04, -2.6949e-02],
        [ 2.1687e-02, -6.0216e-03,  1.4736e-02,  ..., -5.6118e-03,
         -1.2590e-02, -2.8085e-02],
        [ 2.6413e-03, -2.3298e-02,  5.4922e-03,  ...,  1.7537e-02,
          2.7550e-02, -7.7656e-02]], requires_grad=True)

While there are 30,522 different Token Embeddings, there are only 512 different Position Embeddings. This is because the largest input sequence accepted by the BERT model is 512 tokens long.

If you want to know more about BERT embeddings you can take a look at the excellent article BERT Embeddings

Store your vectors with chromadb¶

Using a vector database (also known as a dense vector space model or DVS) can be beneficial for several reasonsin your NLP project ! There is many ways to do this here we will be using the open source solution ChromaDB 🤓

If you understood the previous part of this tutorial, it's pretty much all about finding world similarity 🙏 Vector databases allow for efficient similarity searches between vectors, which is crucial in NLP tasks like text classification, clustering, and recommendation systems.

Scalability & Flexibility : As the number of data points grows, traditional databases may become inefficient due to increased query complexity but vector databases scale very well with the volume of data, making them great for various large-scale NLP applications. They also can handle high-dimensional data and support various similarity metrics. This flexibility makes them suitable for a wide range of NLP tasks.

Some popular algorithms used in vector databases like FAISS (Facebook AI Similarity Search) developed by Facebook AI Research for efficient similarity search and indexing of dense vectors or HNSW (Hierarchical Navigable Small World) who provides a scalable and efficient way to index and search high-dimensional data.

In summary by leveraging the power of vector databases, you can significantly improve the performance of your NLP models, especially when dealing with large datasets or high-dimensional data. In addition vector databases are designed to scale with the volume of data, making them ideal for large-scale NLP applications where traditional databases may become inefficient.

Generate embeddings with OpenAI and store with Chromadb¶

In this part we will see how to genrerate an embeddings from a given documents databases (here we will be using a folder called book with some markdown documents inside), you can also do it with others format like txt.

In this example we will be using te generated data on our previous article about Ollama chain of thoughts, for reminder it was about creating a chains who extract the youtube transcript content of a video and create a markdown course about it.

In [6]:

Copied!

#start chromadb 
import chromadb
chroma_client = chromadb.HttpClient(host='localhost', port=8000)
#start chromadb 
import chromadb
chroma_client = chromadb.HttpClient(host='localhost', port=8000)

In [8]:

Copied!

chroma_client.heartbeat() # returns a nanosecond heartbeat. Useful for making sure the client remains connected.
chroma_client.heartbeat() # returns a nanosecond heartbeat. Useful for making sure the client remains connected.

Out[8]:

1731341954144062000

In [13]:

Copied!





from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores.chroma import Chroma
import os
import shutil

#
#   CREATE DB /HuggingFace Embedding
#

CHROMA_PATH = "../db/chroma_db_generated_courses"
DATA_PATH = "../generated_files"

def generate_data_store():
    documents = load_documents()
    chunks = split_text(documents)
    save_to_chroma(chunks)


def load_documents():
    loader = DirectoryLoader(DATA_PATH, glob="*.md")
    documents = loader.load()
    return documents


def split_text(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=350,
        length_function=len,
        add_start_index=True,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

    document = chunks[10]
    print(document.page_content)
    print(document.metadata)

    return chunks


from langchain.embeddings import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

def save_to_chroma(chunks: list[Document]):
    # Clear out the database first.
    if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH)

    # Create a new DB from the documents.
    db = Chroma.from_documents(
        chunks, huggingface_embeddings, persist_directory=CHROMA_PATH
    )
    db.persist()
    print(f"Saved {len(chunks)} chunks to {CHROMA_PATH}.")

generate_data_store()
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores.chroma import Chroma
import os
import shutil

#
#   CREATE DB /HuggingFace Embedding
#

CHROMA_PATH = "../db/chroma_db_generated_courses"
DATA_PATH = "../generated_files"

def generate_data_store():
    documents = load_documents()
    chunks = split_text(documents)
    save_to_chroma(chunks)


def load_documents():
    loader = DirectoryLoader(DATA_PATH, glob="*.md")
    documents = loader.load()
    return documents


def split_text(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=350,
        length_function=len,
        add_start_index=True,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

    document = chunks[10]
    print(document.page_content)
    print(document.metadata)

    return chunks


from langchain.embeddings import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

def save_to_chroma(chunks: list[Document]):
    # Clear out the database first.
    if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH)

    # Create a new DB from the documents.
    db = Chroma.from_documents(
        chunks, huggingface_embeddings, persist_directory=CHROMA_PATH
    )
    db.persist()
    print(f"Saved {len(chunks)} chunks to {CHROMA_PATH}.")

generate_data_store()

/var/folders/n6/lmzfz5ld1pdbqfz1jbmz0jy40000gn/T/ipykernel_91156/251333344.py:49: LangChainDeprecationWarning: The class `HuggingFaceEmbeddings` was deprecated in LangChain 0.2.2 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-huggingface package and should be used instead. To use it run `pip install -U :class:`~langchain-huggingface` and import as `from :class:`~langchain_huggingface import HuggingFaceEmbeddings``.
  huggingface_embeddings = HuggingFaceEmbeddings(

Split 5 documents into 53 chunks.
here are five free apis that you should use in your next application Lauren pixum lets you create placeholder images in just an instant so if you have an application that still needs a UI or has a UI but needs pictures then you can use this and enter the dimensions as parameters Json placeholder lets you create fake data for your applications if you're developing or testing your application you need a bunch of realistic fake data open food facts lets you look up food by barcode and other metrics while looking up its nutritional values so like I can look up goldfish for example and see all of the different ratings it barcode the common names that sucks I lowkey just really do love goldfish Digi dates let you do time and date related functions in your application conversions validation human readable versions Etc what AI lets you use natural language processing within your applications I'm not talking about like you know large language models I'm talking about like the OG version of
{'source': '../generated_files/5cXwOdWWJnM_en_TEST.md', 'start_index': 85}
Saved 53 chunks to ../db/chroma_db_generated_courses.

/var/folders/n6/lmzfz5ld1pdbqfz1jbmz0jy40000gn/T/ipykernel_91156/251333344.py:62: LangChainDeprecationWarning: Since Chroma 0.4.x the manual persistence method is no longer supported as docs are automatically persisted.
  db.persist()

Hugging Face Embeddings¶

In this part, let's explore the HuggingFaceEmbeddings from langchain and shape a Chroma vector database with all my generated documents 😎

In [17]:

Copied!





from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader

# Define the directory containing the text file and the persistent directory
current_dir = os.path.dirname(os.path.abspath("."))
file_path = os.path.join(current_dir, "generated_files", "vbVc7TxAvAI_en.md")
db_dir = os.path.join(current_dir, "db")

# Check if the text file exists
if not os.path.exists(file_path):
    raise FileNotFoundError(
        f"The file {file_path} does not exist. Please check the path."
    )

# Read the text content from the file
loader = TextLoader(file_path)
documents = loader.load()

# Split the document into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# Display information about the split documents
print("\n--- Document Chunks Information ---")
print(f"Number of document chunks: {len(docs)}")
print(f"Sample chunk:\n{docs[0].page_content}\n")

# Function to create and persist vector store
def create_vector_store(docs, embeddings, store_name):
    persistent_directory = os.path.join(db_dir, store_name)
    if not os.path.exists(persistent_directory):
        print(f"\n--- Creating vector store {store_name} ---")
        Chroma.from_documents(
            docs, embeddings, persist_directory=persistent_directory)
        print(f"--- Finished creating vector store {store_name} ---")
    else:
        print(
            f"Vector store {store_name} already exists. No need to initialize.")


print("\n--- Using Hugging Face Transformers ---")
huggingface_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)
create_vector_store(docs, huggingface_embeddings, "chroma_db_huggingface")

print("Embedding demonstrations for OpenAI and Hugging Face completed.")
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader

# Define the directory containing the text file and the persistent directory
current_dir = os.path.dirname(os.path.abspath("."))
file_path = os.path.join(current_dir, "generated_files", "vbVc7TxAvAI_en.md")
db_dir = os.path.join(current_dir, "db")

# Check if the text file exists
if not os.path.exists(file_path):
    raise FileNotFoundError(
        f"The file {file_path} does not exist. Please check the path."
    )

# Read the text content from the file
loader = TextLoader(file_path)
documents = loader.load()

# Split the document into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# Display information about the split documents
print("\n--- Document Chunks Information ---")
print(f"Number of document chunks: {len(docs)}")
print(f"Sample chunk:\n{docs[0].page_content}\n")

# Function to create and persist vector store
def create_vector_store(docs, embeddings, store_name):
    persistent_directory = os.path.join(db_dir, store_name)
    if not os.path.exists(persistent_directory):
        print(f"\n--- Creating vector store {store_name} ---")
        Chroma.from_documents(
            docs, embeddings, persist_directory=persistent_directory)
        print(f"--- Finished creating vector store {store_name} ---")
    else:
        print(
            f"Vector store {store_name} already exists. No need to initialize.")


print("\n--- Using Hugging Face Transformers ---")
huggingface_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)
create_vector_store(docs, huggingface_embeddings, "chroma_db_huggingface")

print("Embedding demonstrations for OpenAI and Hugging Face completed.")

--- Document Chunks Information ---
Number of document chunks: 7
Sample chunk:
# Transcript and Ollama Response for Video ID: vbVc7TxAvAI

## Language: en

## Transcript:

Out[17]:

' print("\n--- Using Hugging Face Transformers ---")\nhuggingface_embeddings = HuggingFaceEmbeddings(\n    model_name="sentence-transformers/all-mpnet-base-v2"\n)\ncreate_vector_store(docs, huggingface_embeddings, "chroma_db_huggingface")\n\nprint("Embedding demonstrations for OpenAI and Hugging Face completed.")\n\n '

Query chroma vector db¶

Now that we have saved our data into vector inside Chroma your directory shoult now have a db folder with somthing like this inside :

db
├── chroma.sqlite3
└── chroma_db_generated_courses
    ├── chroma.sqlite3
    └── d4021e7e-f19f-4905-a088-e4d5e3bec331
        ├── data_level0.bin
        ├── header.bin
        ├── length.bin
        └── link_lists.bin

3 directories, 6 files

Now we can write a query_vector_store function who take the chroma store_name, a given string query and an embedding_function and query our database in natural language instead of a traditional SELECT 😎

In [21]:

Copied!





# Function to query a vector store
def query_vector_store(store_name, query, embedding_function):
    persistent_directory = os.path.join(db_dir, store_name)
    if os.path.exists(persistent_directory):
        print(f"\n--- Querying the Vector Store {store_name} ---")
        db = Chroma(
            persist_directory=persistent_directory,
            embedding_function=embedding_function,
        )
        retriever = db.as_retriever(
            search_type="similarity_score_threshold",
            search_kwargs={"k": 3, "score_threshold": 0.1},
        )
        relevant_docs = retriever.invoke(query)
        # Display the relevant results with metadata
        print(f"\n--- Relevant Documents for {store_name} ---")
        for i, doc in enumerate(relevant_docs, 1):
            print(f"Document {i}:\n{doc.page_content}\n")
            if doc.metadata:
                print(f"Source: {doc.metadata.get('source', 'Unknown')}\n")
    else:
        print(f"Vector store {store_name} does not exist.")


# Define the user's question
query = "What is an API?"

# Query each vector store
query_vector_store("chroma_db_generated_courses", query, huggingface_embeddings)

print("Querying demonstrations completed.")
# Function to query a vector store
def query_vector_store(store_name, query, embedding_function):
    persistent_directory = os.path.join(db_dir, store_name)
    if os.path.exists(persistent_directory):
        print(f"\n--- Querying the Vector Store {store_name} ---")
        db = Chroma(
            persist_directory=persistent_directory,
            embedding_function=embedding_function,
        )
        retriever = db.as_retriever(
            search_type="similarity_score_threshold",
            search_kwargs={"k": 3, "score_threshold": 0.1},
        )
        relevant_docs = retriever.invoke(query)
        # Display the relevant results with metadata
        print(f"\n--- Relevant Documents for {store_name} ---")
        for i, doc in enumerate(relevant_docs, 1):
            print(f"Document {i}:\n{doc.page_content}\n")
            if doc.metadata:
                print(f"Source: {doc.metadata.get('source', 'Unknown')}\n")
    else:
        print(f"Vector store {store_name} does not exist.")


# Define the user's question
query = "What is an API?"

# Query each vector store
query_vector_store("chroma_db_generated_courses", query, huggingface_embeddings)

print("Querying demonstrations completed.")

--- Querying the Vector Store chroma_db_generated_courses ---

--- Relevant Documents for chroma_db_generated_courses ---
Document 1:
Generated Course Content for Video ID: 5cXwOdWWJnM

Language: en

Course Content:

Free API Course Outline

Introduction

APIs (Application Programming Interfaces) are a crucial part of application development. They enable developers to interact with third-party services and access data, features, and functionalities that might not be available within their own applications.

In this course, we'll explore the world of free APIs, discussing their benefits, popular options, and how to use them effectively in your application development projects.

Main Topics

1. Free APIs for Application Development

Overview of Popular Free APIs

There are numerous free APIs available, each offering unique features and capabilities. Let's take a look at some popular ones:

API Name: OpenWeatherMap API

Description: Provides current weather conditions, forecasts, and atmospheric data.

Features:

Current weather conditions (temperature, humidity, wind speed)

Forecasting functionality

Source: ../generated_files/5cXwOdWWJnM_chain_en.md

Document 2:
Ollama Response:

Based on the transcript, I've created a detailed course outline that covers five free APIs and their potential uses in application development. I've also included Python code examples where possible.

Course Title: Exploring Free APIs for Application Development

Course Description: In this course, we'll explore five free APIs that can be used to enhance the functionality of your applications. We'll dive into each API's features, provide code examples, and discuss potential use cases for each.

Module 1: Lauren Pixum - Placeholder Images

Overview of Lauren Pixum and its features

Instant placeholder images creation

Dimensions as parameters (e.g., width, height, aspect ratio)

Code Example: ```python import requests

def get_placeholder_image(width, height): params = { 'width': width, 'height': height } response = requests.get('https://picsum.photos', params=params) return response.content

Create a placeholder image with dimensions 500x300

Source: ../generated_files/5cXwOdWWJnM_en.md

Document 3:
Recommendations for Next Steps and Further Learning

Practice working with APIs: Try out different APIs and experiment with their functionality.

Expand your skillset: Consider learning more about API design, implementation, and security.

By leveraging free APIs like Pixum, Json Placeholder, Open Food Facts, DigiDates, and What AI, you can unlock new possibilities for your applications.

Source: ../generated_files/5cXwOdWWJnM_en_chain.md

Querying demonstrations completed.

Specific search types queries¶

Now let's add some spices and add an extra parameter called search_type to our query function in order to specify the type of distance we will be choosing to query our database.

In [28]:

Copied!





# Function to query a vector store with different search types and parameters

persistent_directory = os.path.join(db_dir, "chroma_db_generated_courses")

def query_vector_store(
    store_name, query, embedding_function, search_type, search_kwargs
):
    if os.path.exists(persistent_directory):
        print(f"\n--- Querying the Vector Store {store_name} ---")
        db = Chroma(
            persist_directory=persistent_directory,
            embedding_function=embedding_function,
        )
        retriever = db.as_retriever(
            search_type=search_type,
            search_kwargs=search_kwargs,
        )
        relevant_docs = retriever.invoke(query)
        # Display the relevant results with metadata
        print(f"\n--- Relevant Documents for {store_name} ---")
        for i, doc in enumerate(relevant_docs, 1):
            print(f"Document {i}:\n{doc.page_content}\n")
            if doc.metadata:
                print(f"Source: {doc.metadata.get('source', 'Unknown')}\n")
    else:
        print(f"Vector store {store_name} does not exist.")


# Define the user's question
query = "What is API?"

# Showcase different retrieval methods

# 1. Similarity Search
# This method retrieves documents based on vector similarity.
# It finds the most similar documents to the query vector based on cosine similarity.
# Use this when you want to retrieve the top k most similar documents.
print("\n--- Using Similarity Search ---")
query_vector_store("chroma_db_generated_courses", query,
                   huggingface_embeddings, "similarity", {"k": 3})
# Function to query a vector store with different search types and parameters

persistent_directory = os.path.join(db_dir, "chroma_db_generated_courses")

def query_vector_store(
    store_name, query, embedding_function, search_type, search_kwargs
):
    if os.path.exists(persistent_directory):
        print(f"\n--- Querying the Vector Store {store_name} ---")
        db = Chroma(
            persist_directory=persistent_directory,
            embedding_function=embedding_function,
        )
        retriever = db.as_retriever(
            search_type=search_type,
            search_kwargs=search_kwargs,
        )
        relevant_docs = retriever.invoke(query)
        # Display the relevant results with metadata
        print(f"\n--- Relevant Documents for {store_name} ---")
        for i, doc in enumerate(relevant_docs, 1):
            print(f"Document {i}:\n{doc.page_content}\n")
            if doc.metadata:
                print(f"Source: {doc.metadata.get('source', 'Unknown')}\n")
    else:
        print(f"Vector store {store_name} does not exist.")


# Define the user's question
query = "What is API?"

# Showcase different retrieval methods

# 1. Similarity Search
# This method retrieves documents based on vector similarity.
# It finds the most similar documents to the query vector based on cosine similarity.
# Use this when you want to retrieve the top k most similar documents.
print("\n--- Using Similarity Search ---")
query_vector_store("chroma_db_generated_courses", query,
                   huggingface_embeddings, "similarity", {"k": 3})

--- Using Similarity Search ---

--- Querying the Vector Store chroma_db_generated_courses ---

--- Relevant Documents for chroma_db_generated_courses ---
Document 1:
Generated Course Content for Video ID: 5cXwOdWWJnM

Language: en

Course Content:

Free API Course Outline

Introduction

APIs (Application Programming Interfaces) are a crucial part of application development. They enable developers to interact with third-party services and access data, features, and functionalities that might not be available within their own applications.

In this course, we'll explore the world of free APIs, discussing their benefits, popular options, and how to use them effectively in your application development projects.

Main Topics

1. Free APIs for Application Development

Overview of Popular Free APIs

There are numerous free APIs available, each offering unique features and capabilities. Let's take a look at some popular ones:

API Name: OpenWeatherMap API

Description: Provides current weather conditions, forecasts, and atmospheric data.

Features:

Current weather conditions (temperature, humidity, wind speed)

Forecasting functionality

Source: ../generated_files/5cXwOdWWJnM_chain_en.md

Document 2:
Recommendations for Next Steps and Further Learning

Practice working with APIs: Try out different APIs and experiment with their functionality.

Expand your skillset: Consider learning more about API design, implementation, and security.

By leveraging free APIs like Pixum, Json Placeholder, Open Food Facts, DigiDates, and What AI, you can unlock new possibilities for your applications.

Source: ../generated_files/5cXwOdWWJnM_en_chain.md

Document 3:
Ollama Response:

Based on the transcript, I've created a detailed course outline that covers five free APIs and their potential uses in application development. I've also included Python code examples where possible.

Course Title: Exploring Free APIs for Application Development

Course Description: In this course, we'll explore five free APIs that can be used to enhance the functionality of your applications. We'll dive into each API's features, provide code examples, and discuss potential use cases for each.

Module 1: Lauren Pixum - Placeholder Images

Overview of Lauren Pixum and its features

Instant placeholder images creation

Dimensions as parameters (e.g., width, height, aspect ratio)

Code Example: ```python import requests

def get_placeholder_image(width, height): params = { 'width': width, 'height': height } response = requests.get('https://picsum.photos', params=params) return response.content

Create a placeholder image with dimensions 500x300

Source: ../generated_files/5cXwOdWWJnM_en.md


--- Using Max Marginal Relevance (MMR) ---

--- Querying the Vector Store chroma_db_generated_courses ---

--- Relevant Documents for chroma_db_generated_courses ---
Document 1:
Generated Course Content for Video ID: 5cXwOdWWJnM

Language: en

Course Content:

Free API Course Outline

Introduction

APIs (Application Programming Interfaces) are a crucial part of application development. They enable developers to interact with third-party services and access data, features, and functionalities that might not be available within their own applications.

In this course, we'll explore the world of free APIs, discussing their benefits, popular options, and how to use them effectively in your application development projects.

Main Topics

1. Free APIs for Application Development

Overview of Popular Free APIs

There are numerous free APIs available, each offering unique features and capabilities. Let's take a look at some popular ones:

API Name: OpenWeatherMap API

Description: Provides current weather conditions, forecasts, and atmospheric data.

Features:

Current weather conditions (temperature, humidity, wind speed)

Forecasting functionality

Source: ../generated_files/5cXwOdWWJnM_chain_en.md

Document 2:
Examples of static site generators (e.g., Hugo, Jekyll)

Python code example using Flask to generate a static site: ```python from flask import Flask, render_template

app = Flask(name)

@app.route("/") def index(): return render_template("index.html")

if name == "main": app.run() ``` Lecture 1.2: Client-Server Architecture

Definition of client-server architecture

How clients and servers communicate

Examples of client-server architectures (e.g., RESTful API, gRPC)

Python code example using Flask to create a RESTful API: ```python from flask import Flask, jsonify

app = Flask(name)

@app.route("/api/data", methods=["GET"]) def get_data(): data = {"message": "Hello World"} return jsonify(data)

if name == "main": app.run() ``` Module 2: Serverless Architecture (MPA)

Definition of serverless architecture

How MPA works

Benefits and drawbacks of MPA

Python code example using AWS Lambda to create an API Gateway: ```python import boto3

Source: ../generated_files/vbVc7TxAvAI_en.md

Document 3:
Example usage:

data = perform_nlp("This is a sample text for NLP analysis.") if data is not None: print(data) else: print("Failed to retrieve NLP results")

def tokenize_text(text): url = "https://api.nltk.org/v1/nlp/text/tokenize" data = {"text": text} response = requests.post(url, json=data) if response.status_code == 200: return response.json() else: print(f"Failed to retrieve tokenized text. Status code: {response.status_code}") return None

Example usage:

data = tokenize_text("This is a sample text for NLP analysis.") if data is not None: print(data) else: print("Failed to retrieve tokenized text") ```

Conclusion:

Source: ../generated_files/5cXwOdWWJnM_en_TEST.md


--- Using Similarity Score Threshold ---

--- Querying the Vector Store chroma_db_generated_courses ---

--- Relevant Documents for chroma_db_generated_courses ---
Document 1:
Generated Course Content for Video ID: 5cXwOdWWJnM

Language: en

Course Content:

Free API Course Outline

Introduction

APIs (Application Programming Interfaces) are a crucial part of application development. They enable developers to interact with third-party services and access data, features, and functionalities that might not be available within their own applications.

In this course, we'll explore the world of free APIs, discussing their benefits, popular options, and how to use them effectively in your application development projects.

Main Topics

1. Free APIs for Application Development

Overview of Popular Free APIs

There are numerous free APIs available, each offering unique features and capabilities. Let's take a look at some popular ones:

API Name: OpenWeatherMap API

Description: Provides current weather conditions, forecasts, and atmospheric data.

Features:

Current weather conditions (temperature, humidity, wind speed)

Forecasting functionality

Source: ../generated_files/5cXwOdWWJnM_chain_en.md

Document 2:
Recommendations for Next Steps and Further Learning

Practice working with APIs: Try out different APIs and experiment with their functionality.

Expand your skillset: Consider learning more about API design, implementation, and security.

By leveraging free APIs like Pixum, Json Placeholder, Open Food Facts, DigiDates, and What AI, you can unlock new possibilities for your applications.

Source: ../generated_files/5cXwOdWWJnM_en_chain.md

Document 3:
Ollama Response:

Based on the transcript, I've created a detailed course outline that covers five free APIs and their potential uses in application development. I've also included Python code examples where possible.

Course Title: Exploring Free APIs for Application Development

Course Description: In this course, we'll explore five free APIs that can be used to enhance the functionality of your applications. We'll dive into each API's features, provide code examples, and discuss potential use cases for each.

Module 1: Lauren Pixum - Placeholder Images

Overview of Lauren Pixum and its features

Instant placeholder images creation

Dimensions as parameters (e.g., width, height, aspect ratio)

Code Example: ```python import requests

def get_placeholder_image(width, height): params = { 'width': width, 'height': height } response = requests.get('https://picsum.photos', params=params) return response.content

Create a placeholder image with dimensions 500x300

Source: ../generated_files/5cXwOdWWJnM_en.md

Querying demonstrations with different search types completed.

In [30]:

Copied!





# 2. Max Marginal Relevance (MMR)
# This method balances between selecting documents that are relevant to the query and diverse among themselves.
# 'fetch_k' specifies the number of documents to initially fetch based on similarity.
# 'lambda_mult' controls the diversity of the results: 1 for minimum diversity, 0 for maximum.
# Use this when you want to avoid redundancy and retrieve diverse yet relevant documents.
# Note: Relevance measures how closely documents match the query.
# Note: Diversity ensures that the retrieved documents are not too similar to each other,
#       providing a broader range of information.
print("\n--- Using Max Marginal Relevance (MMR) ---")
query_vector_store(
    "chroma_db_generated_courses",
    query,
    huggingface_embeddings,
    "mmr",
    {"k": 3, "fetch_k": 20, "lambda_mult": 0.5},
)
# 2. Max Marginal Relevance (MMR)
# This method balances between selecting documents that are relevant to the query and diverse among themselves.
# 'fetch_k' specifies the number of documents to initially fetch based on similarity.
# 'lambda_mult' controls the diversity of the results: 1 for minimum diversity, 0 for maximum.
# Use this when you want to avoid redundancy and retrieve diverse yet relevant documents.
# Note: Relevance measures how closely documents match the query.
# Note: Diversity ensures that the retrieved documents are not too similar to each other,
#       providing a broader range of information.
print("\n--- Using Max Marginal Relevance (MMR) ---")
query_vector_store(
    "chroma_db_generated_courses",
    query,
    huggingface_embeddings,
    "mmr",
    {"k": 3, "fetch_k": 20, "lambda_mult": 0.5},
)

--- Using Max Marginal Relevance (MMR) ---

--- Querying the Vector Store chroma_db_generated_courses ---

--- Relevant Documents for chroma_db_generated_courses ---
Document 1:
Generated Course Content for Video ID: 5cXwOdWWJnM

Language: en

Course Content:

Free API Course Outline

Introduction

APIs (Application Programming Interfaces) are a crucial part of application development. They enable developers to interact with third-party services and access data, features, and functionalities that might not be available within their own applications.

In this course, we'll explore the world of free APIs, discussing their benefits, popular options, and how to use them effectively in your application development projects.

Main Topics

1. Free APIs for Application Development

Overview of Popular Free APIs

There are numerous free APIs available, each offering unique features and capabilities. Let's take a look at some popular ones:

API Name: OpenWeatherMap API

Description: Provides current weather conditions, forecasts, and atmospheric data.

Features:

Current weather conditions (temperature, humidity, wind speed)

Forecasting functionality

Source: ../generated_files/5cXwOdWWJnM_chain_en.md

Document 2:
Examples of static site generators (e.g., Hugo, Jekyll)

Python code example using Flask to generate a static site: ```python from flask import Flask, render_template

app = Flask(name)

@app.route("/") def index(): return render_template("index.html")

if name == "main": app.run() ``` Lecture 1.2: Client-Server Architecture

Definition of client-server architecture

How clients and servers communicate

Examples of client-server architectures (e.g., RESTful API, gRPC)

Python code example using Flask to create a RESTful API: ```python from flask import Flask, jsonify

app = Flask(name)

@app.route("/api/data", methods=["GET"]) def get_data(): data = {"message": "Hello World"} return jsonify(data)

if name == "main": app.run() ``` Module 2: Serverless Architecture (MPA)

Definition of serverless architecture

How MPA works

Benefits and drawbacks of MPA

Python code example using AWS Lambda to create an API Gateway: ```python import boto3

Source: ../generated_files/vbVc7TxAvAI_en.md

Document 3:
Example usage:

data = perform_nlp("This is a sample text for NLP analysis.") if data is not None: print(data) else: print("Failed to retrieve NLP results")

def tokenize_text(text): url = "https://api.nltk.org/v1/nlp/text/tokenize" data = {"text": text} response = requests.post(url, json=data) if response.status_code == 200: return response.json() else: print(f"Failed to retrieve tokenized text. Status code: {response.status_code}") return None

Example usage:

data = tokenize_text("This is a sample text for NLP analysis.") if data is not None: print(data) else: print("Failed to retrieve tokenized text") ```

Conclusion:

Source: ../generated_files/5cXwOdWWJnM_en_TEST.md

In [31]:

Copied!





# 3. Similarity Score Threshold
# This method retrieves documents that exceed a certain similarity score threshold.
# 'score_threshold' sets the minimum similarity score a document must have to be considered relevant.
# Use this when you want to ensure that only highly relevant documents are retrieved, filtering out less relevant ones.
print("\n--- Using Similarity Score Threshold ---")
query_vector_store(
    "chroma_db_generated_courses",
    query,
    huggingface_embeddings,
    "similarity_score_threshold",
    {"k": 3, "score_threshold": 0.1},
)
# 3. Similarity Score Threshold
# This method retrieves documents that exceed a certain similarity score threshold.
# 'score_threshold' sets the minimum similarity score a document must have to be considered relevant.
# Use this when you want to ensure that only highly relevant documents are retrieved, filtering out less relevant ones.
print("\n--- Using Similarity Score Threshold ---")
query_vector_store(
    "chroma_db_generated_courses",
    query,
    huggingface_embeddings,
    "similarity_score_threshold",
    {"k": 3, "score_threshold": 0.1},
)

--- Using Similarity Score Threshold ---

--- Querying the Vector Store chroma_db_generated_courses ---

--- Relevant Documents for chroma_db_generated_courses ---
Document 1:
Generated Course Content for Video ID: 5cXwOdWWJnM

Language: en

Course Content:

Free API Course Outline

Introduction

APIs (Application Programming Interfaces) are a crucial part of application development. They enable developers to interact with third-party services and access data, features, and functionalities that might not be available within their own applications.

In this course, we'll explore the world of free APIs, discussing their benefits, popular options, and how to use them effectively in your application development projects.

Main Topics

1. Free APIs for Application Development

Overview of Popular Free APIs

There are numerous free APIs available, each offering unique features and capabilities. Let's take a look at some popular ones:

API Name: OpenWeatherMap API

Description: Provides current weather conditions, forecasts, and atmospheric data.

Features:

Current weather conditions (temperature, humidity, wind speed)

Forecasting functionality

Source: ../generated_files/5cXwOdWWJnM_chain_en.md

Document 2:
Recommendations for Next Steps and Further Learning

Practice working with APIs: Try out different APIs and experiment with their functionality.

Expand your skillset: Consider learning more about API design, implementation, and security.

By leveraging free APIs like Pixum, Json Placeholder, Open Food Facts, DigiDates, and What AI, you can unlock new possibilities for your applications.

Source: ../generated_files/5cXwOdWWJnM_en_chain.md

Document 3:
Ollama Response:

Based on the transcript, I've created a detailed course outline that covers five free APIs and their potential uses in application development. I've also included Python code examples where possible.

Course Title: Exploring Free APIs for Application Development

Course Description: In this course, we'll explore five free APIs that can be used to enhance the functionality of your applications. We'll dive into each API's features, provide code examples, and discuss potential use cases for each.

Module 1: Lauren Pixum - Placeholder Images

Overview of Lauren Pixum and its features

Instant placeholder images creation

Dimensions as parameters (e.g., width, height, aspect ratio)

Code Example: ```python import requests

def get_placeholder_image(width, height): params = { 'width': width, 'height': height } response = requests.get('https://picsum.photos', params=params) return response.content

Create a placeholder image with dimensions 500x300

Source: ../generated_files/5cXwOdWWJnM_en.md

In this example our database is really small so we can not see the differences between theses metrics but if you have a larger one it's pretty obvious 🥸

Wrap it up¶

In this course, we've explored the concept of word embeddings, from traditional methods to advanced models like BERT. We've seen how BERT generates contextualized embeddings that capture nuanced meanings based on context, and how to implement BERT for extracting embeddings and fine-tuning for specific tasks.

Hope you learn a thing or two, happy crafting ⚙️