Creating a Local RAG (Retrieval Augmented Generation) System with PyTorch and Ollama¶
Modern large language models (LLMs) are limited to the information they were trained on. To overcome this limitation, additional "chunks" of information can be injected into the model's context window. This is achieved by computing cosine similarity scores between a user's query and chunks stored in a vector index.
Key Steps:
Consistent Embedding:
It is crucial to use the same embedding model throughout the process to ensure accurate results.Vectorization and Similarity Calculation:
The user query is vectorized, and cosine similarity is computed to identify the most relevant chunks. A score closer to 1 indicates a higher similarity.Reranking:
A reranking algorithm is often applied to refine the search results further.
In this local project, Ollama is used to run the LLM locally, while a PyTorch tensor is employed to store the embeddings instead of using a traditional vector database. The vectorized user query is processed using a nearest neighbors algorithm to retrieve the top 5 most similar chunks. Only chunks with a similarity score above 0.5 are passed to the LLM for generating a detailed response.
Imports¶
import pandas as pd
import numpy as np
import torch
from sentence_transformers import util, SentenceTransformer
import subprocess
Functions¶
def find_top_k_similar_embeddings(df, embeddings, query, embedding_model, k=5, device="cpu"):
"""
Find the top k most similar embeddings to a given query and return the corresponding chunks.
Parameters:
- df (pandas.DataFrame): The DataFrame containing the embeddings and chunk information.
- embeddings (torch.Tensor): A 2D tensor containing the embeddings.
- query (str): The query string to compare against the embeddings.
- embedding_model: The model used to embed the query.
- k (int): The number of top similar embeddings to return. Default is 5.
- device (str): The device to perform computations on ("cuda" or "cpu").
Returns:
- top_k_indices (torch.Tensor): The indices of the top k similar embeddings.
- top_k_scores (torch.Tensor): The similarity scores of the top k embeddings.
- top_k_chunks (list): The corresponding chunks from the DataFrame.
"""
# Step 1: Embed the query string using the same model
query_embedding = embedding_model.encode(query, convert_to_tensor=True).to(device)
# Step 2: Compute cosine similarity (dot product) between query and all embeddings
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
# Step 3: Get the top k results
top_k_scores, top_k_indices = torch.topk(dot_scores, k=k)
# Step 4: Retrieve the corresponding chunks from the DataFrame
top_k_chunks = df.iloc[top_k_indices.cpu().numpy()]["chunk"].tolist()
# Print the results in order
print(f"\nTop {k} most similar results:\n")
for i in range(k):
print(f"Index: {top_k_indices[i].item()}\nScore: {top_k_scores[i].item()}\nChunk:\n{top_k_chunks[i]}\n")
print("-" * 50)
return top_k_indices, top_k_scores, top_k_chunks
def load_and_prepare_embeddings(file_path, model_name_or_path="all-mpnet-base-v2"):
"""
Load a CSV file containing embeddings, convert them to PyTorch tensors,
and prepare the embedding model.
Parameters:
- file_path (str): The path to the CSV file containing the embeddings.
- model_name_or_path (str): The name or path of the embedding model to load.
Returns:
- df (pandas.DataFrame): The DataFrame containing the original data with embeddings converted to lists.
- embeddings (torch.Tensor): A PyTorch tensor of embeddings.
- embedding_model (SentenceTransformer): The embedding model loaded from SentenceTransformers.
"""
# Load the CSV file into a DataFrame
df = pd.read_csv(file_path)
# Convert the embedding column back to a list of floats
df["embeddings"] = df["embeddings"].apply(lambda x: np.fromstring(x.strip("[]"), sep=", "))
# Convert the list of embeddings to a PyTorch tensor
embeddings = torch.tensor(df["embeddings"].tolist(), dtype=torch.float32)
# Set the device to GPU if available, otherwise CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
embeddings = embeddings.to(device)
# Load the embedding model and move it to the same device
embedding_model = SentenceTransformer(model_name_or_path, device=device)
return df, embeddings, embedding_model
Read Csv file Containing Embeddings¶
# Set file path
file_path = r"C:\Users\steph\Desktop\RAG\output_csv\Book-2Designing-data-intensive-applications_processed.csv"
df, embeddings, embedding_model = load_and_prepare_embeddings(file_path)
display(df)
page_number | chunk | chunk_char_count | chunk_token_count | page_word_count | page_token_count | embeddings | |
---|---|---|---|---|---|---|---|
0 | 4 | 978-1-449-37332-0 LSI Designing Data-Intensive... | 805 | 144 | 250 | 313 | [0.02232709527015686, 0.019246932119131088, -0... |
1 | 4 | First Edition Revision History for the First E... | 1022 | 169 | 250 | 313 | [0.00605520186945796, -0.04391200467944145, -0... |
2 | 5 | Technology is a powerful force in our society.... | 403 | 76 | 64 | 76 | [0.021445734426379204, 0.09222898632287979, -0... |
3 | 7 | Computing is pop culture. ... Pop culture hold... | 397 | 82 | 71 | 82 | [-0.012340748682618141, 0.08791149407625198, -... |
4 | 9 | Table of Contents Preface. . . . . . . . . . .... | 682 | 179 | 248 | 270 | [-0.023812390863895416, -0.01836887188255787, ... |
... | ... | ... | ... | ... | ... | ... | ... |
1200 | 611 | uncertain (transaction status) (see in doubt) ... | 2615 | 521 | 350 | 521 | [0.01760762929916382, -0.026653055101633072, 0... |
1201 | 612 | windows (stream processing), 466, 468-472 infi... | 1647 | 373 | 239 | 373 | [-0.016259633004665375, 0.004297784063965082, ... |
1202 | 613 | About the Author Martin Kleppmann is a researc... | 1167 | 219 | 437 | 515 | [0.04322214052081108, 0.034118011593818665, -0... |
1203 | 613 | Males are larger than females, but the species... | 893 | 175 | 437 | 515 | [0.01628885790705681, 0.035985954105854034, -0... |
1204 | 613 | Due to its aggression, it was depicted on the ... | 605 | 121 | 437 | 515 | [0.05848781764507294, -0.007880114018917084, -... |
1205 rows × 7 columns
Test retreiving the top 5 chunks based on Similarity Score¶
# Set query string and preform similarity search
query = "data modelling best practices"
top_k_indices, top_k_scores, top_k_chunks = find_top_k_similar_embeddings(df, embeddings, query, embedding_model, k=5)
Top 5 most similar results: Index: 78 Score: 0.5955010056495667 Chunk: groups of peoplefor example, the engineers at the database vendor and the applica tion developers using their databaseto work together effectively. There are many different kinds of data models, and every data model embodies assumptions about how it is going to be used. Some kinds of usage are easy and some are not supported; some operations are fast and some perform badly; some data transformations feel natural and some are awkward. It can take a lot of effort to master just one data model (think how many books there are on relational data modeling). Building software is hard enough, even when work ing with just one data model and without worrying about its inner workings. But since the data model has such a profound effect on what the software above it can and cant do, its important to choose one that is appropriate to the application. In this chapter we will look at a range of general-purpose data models for data stor age and querying (point 2 in the preceding list). In particular, we will compare the relational model, the document model, and a few graph-based data models. We will also look at various query languages and compare their use cases. In Chapter 3 we will discuss how storage engines work; that is, how these data models are actually implemented (point 3 in the list). -------------------------------------------------- Index: 98 Score: 0.5462720394134521 Chunk: Its not possible to say in general which data model leads to simpler application code; it depends on the kinds of relationships that exist between data items. For highly interconnected data, the document model is awkward, the relational model is accept able, and graph models (see Graph-Like Data Models on page 49) are the most natural. Schema flexibility in the document model Most document databases, and the JSON support in relational databases, do not enforce any schema on the data in documents. XML support in relational databases usually comes with optional schema validation. No schema means that arbitrary keys and values can be added to a document, and when reading, clients have no guaran tees as to what fields the documents may contain. Document databases are sometimes called schemaless, but thats misleading, as the code that reads the data usually assumes some kind of structurei.e. , there is an implicit schema, but it is not enforced by the database 20. A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational Relational Model Versus Document Model | 39 -------------------------------------------------- Index: 76 Score: 0.5414575338363647 Chunk: CHAPTER 2 Data Models and Query Languages The limits of my language mean the limits of my world. Ludwig Wittgenstein, Tractatus Logico-Philosophicus (1922) Data models are perhaps the most important part of developing software, because they have such a profound effect: not only on how the software is written, but also on how we think about the problem that we are solving. Most applications are built by layering one data model on top of another. For each layer, the key question is: how is it represented in terms of the next-lower layer? For example: 1. As an application developer, you look at the real world (in which there are peo ple, organizations, goods, actions, money flows, sensors, etc.) and model it in terms of objects or data structures, and APIs that manipulate those data struc tures. Those structures are often specific to your application. 2. When you want to store those data structures, you express them in terms of a general-purpose data model, such as JSON or XML documents, tables in a rela tional database, or a graph model. -------------------------------------------------- Index: 30 Score: 0.5116366147994995 Chunk: But reality is not that simple. There are many database systems with different charac teristics, because different applications have different requirements. There are vari ous approaches to caching, several ways of building search indexes, and so on. When building an application, we still need to figure out which tools and which approaches are the most appropriate for the task at hand. And it can be hard to combine tools when you need to do something that a single tool cannot do alone. This book is a journey through both the principles and the practicalities of data sys tems, and how you can use them to build data-intensive applications. We will explore what different tools have in common, what distinguishes them, and how they achieve their characteristics. In this chapter, we will start by exploring the fundamentals of what we are trying to achieve: reliable, scalable, and maintainable data systems. Well clarify what those things mean, outline some ways of thinking about them, and go over the basics that we will need for later chapters. In the following chapters we will continue layer by layer, looking at different design decisions that need to be considered when working on a data-intensive application. -------------------------------------------------- Index: 148 Score: 0.5060915946960449 Chunk: The Datalog approach requires a different kind of thinking to the other query lan guages discussed in this chapter, but its a very powerful approach, because rules can be combined and reused in different queries. Its less convenient for simple one-off queries, but it can cope better if your data is complex. Summary Data models are a huge subject, and in this chapter we have taken a quick look at a broad variety of different models. We didnt have space to go into all the details of each model, but hopefully the overview has been enough to whet your appetite to find out more about the model that best fits your applications requirements. Historically, data started out being represented as one big tree (the hierarchical model), but that wasnt good for representing many-to-many relationships, so the relational model was invented to solve that problem. More recently, developers found that some applications dont fit well in the relational model either. New nonrelational NoSQL datastores have diverged in two main directions: 1. Document databases target use cases where data comes in self-contained docu ments and relationships between one document and another are rare. 2. Graph databases go in the opposite direction, targeting use cases where anything is potentially related to everything. All three models (document, relational, and graph) are widely used today, and each is good in its respective domain. --------------------------------------------------
PROMPT_TEMPLATE = """
Given the following context:
{context}
User Query: {query}
Please provide a detailed response based on the above context.
"""
def call_ollama(context, query, model="llama3"):
"""
Build the prompt using the context and user query,
then call the Ollama CLI via subprocess to get the response.
The prompt is passed via standard input.
"""
prompt = PROMPT_TEMPLATE.format(context=context, query=query)
# Construct the command without --prompt flag.
cmd = [
"ollama", "run", model
]
try:
# Pass the prompt as standard input using the 'input' parameter
result = subprocess.run(cmd, input=prompt, capture_output=True, text=True, check=True)
response = result.stdout.strip()
except subprocess.CalledProcessError as e:
response = f"Error calling Ollama: {e.stderr}"
return response
def chat(min_similarity_score=0.5):
print("Welcome to the RAG Chatbot. Type 'exit' to quit.")
while True:
user_input = input("You: ")
if user_input.lower() == "exit":
break
# Assume df, embeddings, and embedding_model are pre-loaded and accessible here.
# Replace 'find_top_k_similar_embeddings' with your actual function for finding embeddings.
top_k_indices, top_k_scores, top_k_chunks = find_top_k_similar_embeddings(
df, embeddings, user_input, embedding_model, k=5
)
kept_indices = []
discarded_indices = []
relevant_chunks = []
for i in range(len(top_k_scores)):
if top_k_scores[i].item() >= min_similarity_score:
relevant_chunks.append(top_k_chunks[i])
kept_indices.append(i + 1)
else:
discarded_indices.append(i + 1)
if kept_indices:
print(f"Kept chunk(s): {', '.join(map(str, kept_indices))}")
if discarded_indices:
print(f"Discarded chunk(s): {', '.join(map(str, discarded_indices))}")
context = "\n".join(relevant_chunks) if relevant_chunks else "No relevant context found."
# Get the response from Ollama via subprocess call
result = call_ollama(context, user_input)
print("Bot:", result)
Entering a query where no relevant chunks can be found will default back to normal LLM behaviour¶
chat()
Welcome to the RAG Chatbot. Type 'exit' to quit.
Top 5 most similar results: Index: 1202 Score: 0.24690213799476624 Chunk: About the Author Martin Kleppmann is a researcher in distributed systems at the University of Cam bridge, UK. Previously he was a software engineer and entrepreneur at internet com panies including LinkedIn and Rapportive, where he worked on large-scale data infrastructure. In the process he learned a few things the hard way, and he hopes this book will save you from repeating the same mistakes. Martin is a regular conference speaker, blogger, and open source contributor. He believes that profound technical ideas should be accessible to everyone, and that deeper understanding will help us develop better software. Colophon The animal on the cover of Designing Data-Intensive Applications is an Indian wild boar (Sus scrofa cristatus), a subspecies of wild boar found in India, Myanmar, Nepal, Sri Lanka, and Thailand. They are distinctive from European boars in that they have higher back bristles, no woolly undercoat, and a larger, straighter skull. The Indian wild boar has a coat of gray or black hair, with stiff bristles running along the spine. Males have protruding canine teeth (called tushes) that are used to fight with rivals or fend off predators. -------------------------------------------------- Index: 1204 Score: 0.23008430004119873 Chunk: Due to its aggression, it was depicted on the armor and weapons of Scandinavian, Germanic, and Anglo-Saxon warriors. In the Chinese zodiac, it symbolizes determination and impetuosity. Many of the animals on OReilly covers are endangered; all of them are important to the world. To learn more about how you can help, go to animals.oreilly.com. The cover image is from Shaws Zoology. The cover fonts are URW Typewriter and Guardian Sans. The text font is Adobe Minion Pro; the font in diagrams is Adobe Myriad Pro; the heading font is Adobe Myriad Condensed; and the code font is Dal ton Maags Ubuntu Mono. -------------------------------------------------- Index: 1144 Score: 0.20618656277656555 Chunk: Welcome to Algorithmic Prison, theatlantic.com, February 20, 2014. 83 Don Peck: Theyre Watching You at Work, theatlantic.com, December 2013. 84 Leigh Alexander: Is an Algorithm Any Less Racist Than a Human? theguar dian.com, August 3, 2016. 85 Jesse Emspak: How a Machine Learns Prejudice, scientificamerican.com, December 29, 2016. 86 Maciej Cegowski: The Moral Economy of Tech, idlewords.com, June 2016. 87 Cathy ONeil: Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown Publishing, 2016. ISBN: 978-0-553-41881-1 88 Julia Angwin: Make Algorithms Accountable, nytimes.com, August 1, 2016. 89 Bryce Goodman and Seth Flaxman: European Union Regulations on Algorith mic Decision-Making and a Right to Explanation, arXiv:1606.08813, August 31, 2016. -------------------------------------------------- Index: 1203 Score: 0.20139464735984802 Chunk: Males are larger than females, but the species aver ages 3335 inches tall at the shoulder and 200300 pounds in weight. Their natural predators include bears, tigers, and various big cats. These animals are nocturnal and omnivorousthey eat a wide variety of things, including roots, insects, carrion, nuts, berries, and small animals. Wild boars are also known to root through garbage and crop fields, causing a great deal of destruction and earning the enmity of farmers. They need to eat 4,0004,500 calories a day. Boars have a well-developed sense of smell, which helps them forage for underground plant material and burrowing animals. However, their eyesight is poor. Wild boars have long held significance in human culture. In Hindu lore, the boar is an avatar of the god Vishnu. In ancient Greek funerary monuments, it was a symbol of a gallant loser (in contrast to the victorious lion). -------------------------------------------------- Index: 310 Score: 0.17519617080688477 Chunk: 44 Steve Vinoski: Convenience over Correctness, IEEE Internet Computing, vol ume 12, number 4, pages 8992, July 2008. doi:10.1109/ MIC.2008.75 142 | Chapter 4: Encoding and Evolution -------------------------------------------------- Discarded chunk(s): 1, 2, 3, 4, 5
Exception in thread Thread-20 (_readerthread): Traceback (most recent call last): File "C:\Users\steph\anaconda3\Lib\threading.py", line 1075, in _bootstrap_inner self.run() File "C:\Users\steph\anaconda3\Lib\threading.py", line 1012, in run self._target(*self._args, **self._kwargs) File "C:\Users\steph\anaconda3\Lib\subprocess.py", line 1599, in _readerthread buffer.append(fh.read()) ^^^^^^^^^ File "C:\Users\steph\anaconda3\Lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 440: character maps to <undefined>
Bot: A query about cats! Given that there is no relevant context, I'll provide a general and informative response. **What are cats?** Cats (Felis catus) are small, typically furry, carnivorous mammals. They belong to the family Felidae and are often considered the most popular pet in many parts of the world. There are over 70 recognized breeds of domestic cats, ranging from the tiny Singapura to the massive Maine Coon. **Physical Characteristics** Cats are characterized by their slender bodies, with a typical length of around 10-15 inches (25-38 cm) and a weight range of 8-12 pounds (3.5-5.4 kg). They have a short, smooth coat that can vary in color from pure white to deep black or any shade of gray, brown, or tabby. Their eyes are usually bright green, gold, or copper. **Behavior and Habitat** Cats are known for their independence, playfulness, and hunting skills. In the wild, they inhabit various environments, including forests, grasslands, and deserts. Domestic cats often adapt to human environments, such as homes, apartments, and farms. They are generally nocturnal, spending most of their time sleeping or resting during the day. **Diet and Nutrition** Cats are obligate carnivores, meaning they require a diet rich in protein from animal sources. Their natural diet consists of small mammals, birds, insects, and other invertebrates. In captivity, they are typically fed commercial cat food or homemade diets formulated to meet their nutritional needs. **Health and Grooming** Cats have unique grooming habits, using their tongues and claws to clean themselves. They require regular veterinary check-ups and vaccinations to maintain good health. Common health issues include urinary tract infections, dental problems, and obesity. **History and Domestication** The domestic cat has a long history dating back to ancient Egypt around 4,000 years ago. Cats were revered as sacred animals and often mummified for burial with their owners. Today, cats are popular pets worldwide, with many people enjoying the companionship and entertainment they provide. I hope this general information about cats helps answer your query!
Entering a query where relevant chunks can be found will pass those chunks in LLM context¶
chat()
Welcome to the RAG Chatbot. Type 'exit' to quit.
Top 5 most similar results: Index: 210 Score: 0.6181493997573853 Chunk: Figure 3-9. Example of a star schema for use in a data warehouse. Usually, facts are captured as individual events, because this allows maximum flexi bility of analysis later. However, this means that the fact table can become extremely large. A big enterprise like Apple, Walmart, or eBay may have tens of petabytes of transaction history in its data warehouse, most of which is in fact tables 56. Some of the columns in the fact table are attributes, such as the price at which the product was sold and the cost of buying it from the supplier (allowing the profit mar gin to be calculated). Other columns in the fact table are foreign key references to other tables, called dimension tables. As each row in the fact table represents an event, the dimensions represent the who, what, where, when, how, and why of the event. For example, in Figure 3-9, one of the dimensions is the product that was sold. Each row in the dim_product table represents one type of product that is for sale, including 94 | Chapter 3: Storage and Retrieval -------------------------------------------------- Index: 209 Score: 0.5844371914863586 Chunk: More recently, a plethora of open source SQL-on- Hadoop projects have emerged; they are young but aiming to compete with commer cial data warehouse systems. These include Apache Hive, Spark SQL, Cloudera Impala, Facebook Presto, Apache Tajo, and Apache Drill 52, 53. Some of them are based on ideas from Googles Dremel 54. Stars and Snowflakes: Schemas for Analytics As explored in Chapter 2, a wide range of different data models are used in the realm of transaction processing, depending on the needs of the application. On the other hand, in analytics, there is much less diversity of data models. Many data warehouses are used in a fairly formulaic style, known as a star schema (also known as dimen sional modeling 55). The example schema in Figure 3-9 shows a data warehouse that might be found at a grocery retailer. At the center of the schema is a so-called fact table (in this example, it is called fact_sales). Each row of the fact table represents an event that occurred at a particular time (here, each row represents a customers purchase of a product). If we were analyzing website traffic rather than retail sales, each row might represent a page view or a click by a user. -------------------------------------------------- Index: 211 Score: 0.5843772888183594 Chunk: its stock-keeping unit (SKU), description, brand name, category, fat content, package size, etc. Each row in the fact_sales table uses a foreign key to indicate which prod uct was sold in that particular transaction. (For simplicity, if the customer buys sev eral different products at once, they are represented as separate rows in the fact table.) Even date and time are often represented using dimension tables, because this allows additional information about dates (such as public holidays) to be encoded, allowing queries to differentiate between sales on holidays and non-holidays. The name star schema comes from the fact that when the table relationships are visualized, the fact table is in the middle, surrounded by its dimension tables; the connections to these tables are like the rays of a star. A variation of this template is known as the snowflake schema, where dimensions are further broken down into subdimensions. For example, there could be separate tables for brands and product categories, and each row in the dim_product table could ref erence the brand and category as foreign keys, rather than storing them as strings in the dim_product table. Snowflake schemas are more normalized than star schemas, but star schemas are often preferred because they are simpler for analysts to work with 55. In a typical data warehouse, tables are often very wide: fact tables often have over 100 columns, sometimes several hundred 51. Dimension tables can also be very wide, as they include all the metadata that may be relevant for analysisfor example, the dim_store table may include details of which services are offered at each store, whether it has an in-store bakery, the square footage, the date when the store was first opened, when it was last remodeled, how far it is from the nearest highway, etc. -------------------------------------------------- Index: 153 Score: 0.5779402256011963 Chunk: An Advanced Course, edited by P. M. Stocker, P. M. D. Gray, and M. P. Atkinson, pages 1956, Cambridge University Press, 1984. ISBN: 978-0-521-25430-4 17 Charles W. Bachman: The Programmer as Navigator, Communications of the ACM, volume 16, number 11, pages 653658, November 1973. doi: 10.1145/355611.362534 18 Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton: Architecture of a Database System, Foundations and Trends in Databases, volume 1, number 2, pages 141259, November 2007. doi:10.1561/1900000002 19 Sandeep Parikh and Kelly Stirman: Schema Design for Time Series Data in MongoDB, blog.mongodb.org, October 30, 2013. 20 Martin Fowler: Schemaless Data Structures, martinfowler.com, January 7, 2013. 21 Amr Awadallah: Schema-on-Read vs. Schema-on-Write, at Berkeley EECS RAD Lab Retreat, Santa Cruz, CA, May 2009. 22 Martin Odersky: -------------------------------------------------- Index: 98 Score: 0.5624552965164185 Chunk: Its not possible to say in general which data model leads to simpler application code; it depends on the kinds of relationships that exist between data items. For highly interconnected data, the document model is awkward, the relational model is accept able, and graph models (see Graph-Like Data Models on page 49) are the most natural. Schema flexibility in the document model Most document databases, and the JSON support in relational databases, do not enforce any schema on the data in documents. XML support in relational databases usually comes with optional schema validation. No schema means that arbitrary keys and values can be added to a document, and when reading, clients have no guaran tees as to what fields the documents may contain. Document databases are sometimes called schemaless, but thats misleading, as the code that reads the data usually assumes some kind of structurei.e. , there is an implicit schema, but it is not enforced by the database 20. A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational Relational Model Versus Document Model | 39 -------------------------------------------------- Kept chunk(s): 1, 2, 3, 4, 5 Bot: Based on the provided context, I'll outline some best practices for using the Star Schema in data modeling: 1. **Design for Analytics**: The Star Schema is particularly well-suited for analytics workloads. Design your fact table and dimension tables with analysis requirements in mind. 2. **Use Fact Tables to Store Events**: Each row in the fact table represents an event, such as a customer purchase or a page view. This allows for maximum flexibility in analysis later on. 3. **Dimension Tables Represent Contextual Information**: Dimension tables provide contextual information about each event, such as product details (e.g., SKU, description), date and time, and geographic location. 4. **Use Foreign Keys to Link Fact Table to Dimensions**: Each row in the fact table uses foreign keys to link to specific dimension tables, allowing for easy querying and analysis of related data. 5. **Keep Fact Tables Wide but Not Too Wide**: Aim for 100-200 columns in your fact table, as this allows for efficient storage and retrieval of event-level data without sacrificing query performance. 6. **Use Dimension Tables Wisely**: Dimension tables can become very wide if not managed carefully. Limit the number of columns to only those necessary for analysis, and consider breaking down large dimension tables into smaller, more focused ones. 7. **Schema Normalization**: While Snowflake Schemas are more normalized than Star Schemas, Star Schemas are often preferred because they are simpler for analysts to work with. Keep your fact table well-normalized by minimizing redundant data and using foreign keys judiciously. 8. **Data Modeling Best Practices**: Apply general data modeling best practices, such as: * Use meaningful column names and avoid ambiguous or misleading ones. * Ensure column types align with the data they store (e.g., date fields should be datetime). * Minimize redundant data and use foreign keys to link related tables. * Use indexes and constraints to ensure data integrity and query performance. By following these best practices, you can effectively design and implement a Star Schema for your data warehouse or analytics system.