import pandas as pd
import numpy as np
import torch
from sentence_transformers import util, SentenceTransformer
import subprocess

def find_top_k_similar_embeddings(df, embeddings, query, embedding_model, k=5, device="cpu"):
    """
    Find the top k most similar embeddings to a given query and return the corresponding chunks.

    Parameters:
    - df (pandas.DataFrame): The DataFrame containing the embeddings and chunk information.
    - embeddings (torch.Tensor): A 2D tensor containing the embeddings.
    - query (str): The query string to compare against the embeddings.
    - embedding_model: The model used to embed the query.
    - k (int): The number of top similar embeddings to return. Default is 5.
    - device (str): The device to perform computations on ("cuda" or "cpu").

    Returns:
    - top_k_indices (torch.Tensor): The indices of the top k similar embeddings.
    - top_k_scores (torch.Tensor): The similarity scores of the top k embeddings.
    - top_k_chunks (list): The corresponding chunks from the DataFrame.
    """

    # Step 1: Embed the query string using the same model
    query_embedding = embedding_model.encode(query, convert_to_tensor=True).to(device)

    # Step 2: Compute cosine similarity (dot product) between query and all embeddings
    dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]

    # Step 3: Get the top k results
    top_k_scores, top_k_indices = torch.topk(dot_scores, k=k)

    # Step 4: Retrieve the corresponding chunks from the DataFrame
    top_k_chunks = df.iloc[top_k_indices.cpu().numpy()]["chunk"].tolist()

    # Print the results in order
    print(f"\nTop {k} most similar results:\n")
    for i in range(k):
        print(f"Index: {top_k_indices[i].item()}\nScore: {top_k_scores[i].item()}\nChunk:\n{top_k_chunks[i]}\n")
        print("-" * 50)

    return top_k_indices, top_k_scores, top_k_chunks

def load_and_prepare_embeddings(file_path, model_name_or_path="all-mpnet-base-v2"):
    """
    Load a CSV file containing embeddings, convert them to PyTorch tensors, 
    and prepare the embedding model.

    Parameters:
    - file_path (str): The path to the CSV file containing the embeddings.
    - model_name_or_path (str): The name or path of the embedding model to load.

    Returns:
    - df (pandas.DataFrame): The DataFrame containing the original data with embeddings converted to lists.
    - embeddings (torch.Tensor): A PyTorch tensor of embeddings.
    - embedding_model (SentenceTransformer): The embedding model loaded from SentenceTransformers.
    """

    # Load the CSV file into a DataFrame
    df = pd.read_csv(file_path)

    # Convert the embedding column back to a list of floats
    df["embeddings"] = df["embeddings"].apply(lambda x: np.fromstring(x.strip("[]"), sep=", "))

    # Convert the list of embeddings to a PyTorch tensor
    embeddings = torch.tensor(df["embeddings"].tolist(), dtype=torch.float32)

    # Set the device to GPU if available, otherwise CPU
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    embeddings = embeddings.to(device)

    # Load the embedding model and move it to the same device
    embedding_model = SentenceTransformer(model_name_or_path, device=device)

    return df, embeddings, embedding_model

# Set file path
file_path = r"C:\Users\steph\Desktop\RAG\output_csv\Book-2Designing-data-intensive-applications_processed.csv"
df, embeddings, embedding_model = load_and_prepare_embeddings(file_path)

display(df)

# Set query string and preform similarity search
query = "data modelling best practices"
top_k_indices, top_k_scores, top_k_chunks = find_top_k_similar_embeddings(df, embeddings, query, embedding_model, k=5)

Top 5 most similar results:

Index: 78
Score: 0.5955010056495667
Chunk:
groups of peoplefor example, the engineers at the database vendor and the applica tion developers using their databaseto work together effectively. There are many different kinds of data models, and every data model embodies assumptions about how it is going to be used. Some kinds of usage are easy and some are not supported; some operations are fast and some perform badly; some data transformations feel natural and some are awkward. It can take a lot of effort to master just one data model (think how many books there are on relational data modeling). Building software is hard enough, even when work ing with just one data model and without worrying about its inner workings. But since the data model has such a profound effect on what the software above it can and cant do, its important to choose one that is appropriate to the application. In this chapter we will look at a range of general-purpose data models for data stor age and querying (point 2 in the preceding list). In particular, we will compare the relational model, the document model, and a few graph-based data models. We will also look at various query languages and compare their use cases. In Chapter 3 we will discuss how storage engines work; that is, how these data models are actually implemented (point 3 in the list).

--------------------------------------------------
Index: 98
Score: 0.5462720394134521
Chunk:
Its not possible to say in general which data model leads to simpler application code; it depends on the kinds of relationships that exist between data items. For highly interconnected data, the document model is awkward, the relational model is accept able, and graph models (see Graph-Like Data Models on page 49) are the most natural. Schema flexibility in the document model Most document databases, and the JSON support in relational databases, do not enforce any schema on the data in documents. XML support in relational databases usually comes with optional schema validation. No schema means that arbitrary keys and values can be added to a document, and when reading, clients have no guaran tees as to what fields the documents may contain. Document databases are sometimes called schemaless, but thats misleading, as the code that reads the data usually assumes some kind of structurei.e. , there is an implicit schema, but it is not enforced by the database 20. A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational Relational Model Versus Document Model | 39

--------------------------------------------------
Index: 76
Score: 0.5414575338363647
Chunk:
CHAPTER 2 Data Models and Query Languages The limits of my language mean the limits of my world. Ludwig Wittgenstein, Tractatus Logico-Philosophicus (1922) Data models are perhaps the most important part of developing software, because they have such a profound effect: not only on how the software is written, but also on how we think about the problem that we are solving. Most applications are built by layering one data model on top of another. For each layer, the key question is: how is it represented in terms of the next-lower layer? For example: 1. As an application developer, you look at the real world (in which there are peo ple, organizations, goods, actions, money flows, sensors, etc.) and model it in terms of objects or data structures, and APIs that manipulate those data struc tures. Those structures are often specific to your application. 2. When you want to store those data structures, you express them in terms of a general-purpose data model, such as JSON or XML documents, tables in a rela tional database, or a graph model.

--------------------------------------------------
Index: 30
Score: 0.5116366147994995
Chunk:
But reality is not that simple. There are many database systems with different charac teristics, because different applications have different requirements. There are vari ous approaches to caching, several ways of building search indexes, and so on. When building an application, we still need to figure out which tools and which approaches are the most appropriate for the task at hand. And it can be hard to combine tools when you need to do something that a single tool cannot do alone. This book is a journey through both the principles and the practicalities of data sys tems, and how you can use them to build data-intensive applications. We will explore what different tools have in common, what distinguishes them, and how they achieve their characteristics. In this chapter, we will start by exploring the fundamentals of what we are trying to achieve: reliable, scalable, and maintainable data systems. Well clarify what those things mean, outline some ways of thinking about them, and go over the basics that we will need for later chapters. In the following chapters we will continue layer by layer, looking at different design decisions that need to be considered when working on a data-intensive application.

--------------------------------------------------
Index: 148
Score: 0.5060915946960449
Chunk:
The Datalog approach requires a different kind of thinking to the other query lan guages discussed in this chapter, but its a very powerful approach, because rules can be combined and reused in different queries. Its less convenient for simple one-off queries, but it can cope better if your data is complex. Summary Data models are a huge subject, and in this chapter we have taken a quick look at a broad variety of different models. We didnt have space to go into all the details of each model, but hopefully the overview has been enough to whet your appetite to find out more about the model that best fits your applications requirements. Historically, data started out being represented as one big tree (the hierarchical model), but that wasnt good for representing many-to-many relationships, so the relational model was invented to solve that problem. More recently, developers found that some applications dont fit well in the relational model either. New nonrelational NoSQL datastores have diverged in two main directions: 1. Document databases target use cases where data comes in self-contained docu ments and relationships between one document and another are rare. 2. Graph databases go in the opposite direction, targeting use cases where anything is potentially related to everything. All three models (document, relational, and graph) are widely used today, and each is good in its respective domain.

--------------------------------------------------

PROMPT_TEMPLATE = """
Given the following context:
{context}

User Query: {query}

Please provide a detailed response based on the above context.
"""

def call_ollama(context, query, model="llama3"):
    """
    Build the prompt using the context and user query,
    then call the Ollama CLI via subprocess to get the response.
    The prompt is passed via standard input.
    """
    prompt = PROMPT_TEMPLATE.format(context=context, query=query)
    
    # Construct the command without --prompt flag.
    cmd = [
        "ollama", "run", model
    ]
    
    try:
        # Pass the prompt as standard input using the 'input' parameter
        result = subprocess.run(cmd, input=prompt, capture_output=True, text=True, check=True)
        response = result.stdout.strip()
    except subprocess.CalledProcessError as e:
        response = f"Error calling Ollama: {e.stderr}"
    
    return response

def chat(min_similarity_score=0.5):
    print("Welcome to the RAG Chatbot. Type 'exit' to quit.")

    while True:
        user_input = input("You: ")
        if user_input.lower() == "exit":
            break

        # Assume df, embeddings, and embedding_model are pre-loaded and accessible here.
        # Replace 'find_top_k_similar_embeddings' with your actual function for finding embeddings.
        top_k_indices, top_k_scores, top_k_chunks = find_top_k_similar_embeddings(
            df, embeddings, user_input, embedding_model, k=5
        )
        
        kept_indices = []
        discarded_indices = []
        relevant_chunks = []

        for i in range(len(top_k_scores)):
            if top_k_scores[i].item() >= min_similarity_score:
                relevant_chunks.append(top_k_chunks[i])
                kept_indices.append(i + 1)
            else:
                discarded_indices.append(i + 1)

        if kept_indices:
            print(f"Kept chunk(s): {', '.join(map(str, kept_indices))}")
        if discarded_indices:
            print(f"Discarded chunk(s): {', '.join(map(str, discarded_indices))}")

        context = "\n".join(relevant_chunks) if relevant_chunks else "No relevant context found."

        # Get the response from Ollama via subprocess call
        result = call_ollama(context, user_input)
        
        print("Bot:", result)

chat()

Welcome to the RAG Chatbot. Type 'exit' to quit.

Top 5 most similar results:

Index: 1202
Score: 0.24690213799476624
Chunk:
About the Author Martin Kleppmann is a researcher in distributed systems at the University of Cam bridge, UK. Previously he was a software engineer and entrepreneur at internet com panies including LinkedIn and Rapportive, where he worked on large-scale data infrastructure. In the process he learned a few things the hard way, and he hopes this book will save you from repeating the same mistakes. Martin is a regular conference speaker, blogger, and open source contributor. He believes that profound technical ideas should be accessible to everyone, and that deeper understanding will help us develop better software. Colophon The animal on the cover of Designing Data-Intensive Applications is an Indian wild boar (Sus scrofa cristatus), a subspecies of wild boar found in India, Myanmar, Nepal, Sri Lanka, and Thailand. They are distinctive from European boars in that they have higher back bristles, no woolly undercoat, and a larger, straighter skull. The Indian wild boar has a coat of gray or black hair, with stiff bristles running along the spine. Males have protruding canine teeth (called tushes) that are used to fight with rivals or fend off predators.

--------------------------------------------------
Index: 1204
Score: 0.23008430004119873
Chunk:
Due to its aggression, it was depicted on the armor and weapons of Scandinavian, Germanic, and Anglo-Saxon warriors. In the Chinese zodiac, it symbolizes determination and impetuosity. Many of the animals on OReilly covers are endangered; all of them are important to the world. To learn more about how you can help, go to animals.oreilly.com. The cover image is from Shaws Zoology. The cover fonts are URW Typewriter and Guardian Sans. The text font is Adobe Minion Pro; the font in diagrams is Adobe Myriad Pro; the heading font is Adobe Myriad Condensed; and the code font is Dal ton Maags Ubuntu Mono.

--------------------------------------------------
Index: 1144
Score: 0.20618656277656555
Chunk:
Welcome to Algorithmic Prison, theatlantic.com, February 20, 2014. 83 Don Peck: Theyre Watching You at Work, theatlantic.com, December 2013. 84 Leigh Alexander: Is an Algorithm Any Less Racist Than a Human? theguar dian.com, August 3, 2016. 85 Jesse Emspak: How a Machine Learns Prejudice, scientificamerican.com, December 29, 2016. 86 Maciej Cegowski: The Moral Economy of Tech, idlewords.com, June 2016. 87 Cathy ONeil: Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown Publishing, 2016. ISBN: 978-0-553-41881-1 88 Julia Angwin: Make Algorithms Accountable, nytimes.com, August 1, 2016. 89 Bryce Goodman and Seth Flaxman: European Union Regulations on Algorith mic Decision-Making and a Right to Explanation, arXiv:1606.08813, August 31, 2016.

--------------------------------------------------
Index: 1203
Score: 0.20139464735984802
Chunk:
Males are larger than females, but the species aver ages 3335 inches tall at the shoulder and 200300 pounds in weight. Their natural predators include bears, tigers, and various big cats. These animals are nocturnal and omnivorousthey eat a wide variety of things, including roots, insects, carrion, nuts, berries, and small animals. Wild boars are also known to root through garbage and crop fields, causing a great deal of destruction and earning the enmity of farmers. They need to eat 4,0004,500 calories a day. Boars have a well-developed sense of smell, which helps them forage for underground plant material and burrowing animals. However, their eyesight is poor. Wild boars have long held significance in human culture. In Hindu lore, the boar is an avatar of the god Vishnu. In ancient Greek funerary monuments, it was a symbol of a gallant loser (in contrast to the victorious lion).

--------------------------------------------------
Index: 310
Score: 0.17519617080688477
Chunk:
44 Steve Vinoski: Convenience over Correctness, IEEE Internet Computing, vol ume 12, number 4, pages 8992, July 2008. doi:10.1109/ MIC.2008.75 142 | Chapter 4: Encoding and Evolution

--------------------------------------------------
Discarded chunk(s): 1, 2, 3, 4, 5

Exception in thread Thread-20 (_readerthread):
Traceback (most recent call last):
  File "C:\Users\steph\anaconda3\Lib\threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "C:\Users\steph\anaconda3\Lib\threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\steph\anaconda3\Lib\subprocess.py", line 1599, in _readerthread
    buffer.append(fh.read())
                  ^^^^^^^^^
  File "C:\Users\steph\anaconda3\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 440: character maps to <undefined>

Bot: A query about cats! Given that there is no relevant context, I'll provide a general and informative response.

**What are cats?**

Cats (Felis catus) are small, typically furry, carnivorous mammals. They belong to the family Felidae and are often considered the most popular pet in many parts of the world. There are over 70 recognized breeds of domestic cats, ranging from the tiny Singapura to the massive Maine Coon.

**Physical Characteristics**

Cats are characterized by their slender bodies, with a typical length of around 10-15 inches (25-38 cm) and a weight range of 8-12 pounds (3.5-5.4 kg). They have a short, smooth coat that can vary in color from pure white to deep black or any shade of gray, brown, or tabby. Their eyes are usually bright green, gold, or copper.

**Behavior and Habitat**

Cats are known for their independence, playfulness, and hunting skills. In the wild, they inhabit various environments, including forests, grasslands, and deserts. Domestic cats often adapt to human environments, such as homes, apartments, and farms. They are generally nocturnal, spending most of their time sleeping or resting during the day.

**Diet and Nutrition**

Cats are obligate carnivores, meaning they require a diet rich in protein from animal sources. Their natural diet consists of small mammals, birds, insects, and other invertebrates. In captivity, they are typically fed commercial cat food or homemade diets formulated to meet their nutritional needs.

**Health and Grooming**

Cats have unique grooming habits, using their tongues and claws to clean themselves. They require regular veterinary check-ups and vaccinations to maintain good health. Common health issues include urinary tract infections, dental problems, and obesity.

**History and Domestication**

The domestic cat has a long history dating back to ancient Egypt around 4,000 years ago. Cats were revered as sacred animals and often mummified for burial with their owners. Today, cats are popular pets worldwide, with many people enjoying the companionship and entertainment they provide.

I hope this general information about cats helps answer your query!

chat()

Welcome to the RAG Chatbot. Type 'exit' to quit.

Top 5 most similar results:

Index: 210
Score: 0.6181493997573853
Chunk:
Figure 3-9. Example of a star schema for use in a data warehouse. Usually, facts are captured as individual events, because this allows maximum flexi bility of analysis later. However, this means that the fact table can become extremely large. A big enterprise like Apple, Walmart, or eBay may have tens of petabytes of transaction history in its data warehouse, most of which is in fact tables 56. Some of the columns in the fact table are attributes, such as the price at which the product was sold and the cost of buying it from the supplier (allowing the profit mar gin to be calculated). Other columns in the fact table are foreign key references to other tables, called dimension tables. As each row in the fact table represents an event, the dimensions represent the who, what, where, when, how, and why of the event. For example, in Figure 3-9, one of the dimensions is the product that was sold. Each row in the dim_product table represents one type of product that is for sale, including 94 | Chapter 3: Storage and Retrieval

--------------------------------------------------
Index: 209
Score: 0.5844371914863586
Chunk:
More recently, a plethora of open source SQL-on- Hadoop projects have emerged; they are young but aiming to compete with commer cial data warehouse systems. These include Apache Hive, Spark SQL, Cloudera Impala, Facebook Presto, Apache Tajo, and Apache Drill 52, 53. Some of them are based on ideas from Googles Dremel 54. Stars and Snowflakes: Schemas for Analytics As explored in Chapter 2, a wide range of different data models are used in the realm of transaction processing, depending on the needs of the application. On the other hand, in analytics, there is much less diversity of data models. Many data warehouses are used in a fairly formulaic style, known as a star schema (also known as dimen sional modeling 55). The example schema in Figure 3-9 shows a data warehouse that might be found at a grocery retailer. At the center of the schema is a so-called fact table (in this example, it is called fact_sales). Each row of the fact table represents an event that occurred at a particular time (here, each row represents a customers purchase of a product). If we were analyzing website traffic rather than retail sales, each row might represent a page view or a click by a user.

--------------------------------------------------
Index: 211
Score: 0.5843772888183594
Chunk:
its stock-keeping unit (SKU), description, brand name, category, fat content, package size, etc. Each row in the fact_sales table uses a foreign key to indicate which prod uct was sold in that particular transaction. (For simplicity, if the customer buys sev eral different products at once, they are represented as separate rows in the fact table.) Even date and time are often represented using dimension tables, because this allows additional information about dates (such as public holidays) to be encoded, allowing queries to differentiate between sales on holidays and non-holidays. The name star schema comes from the fact that when the table relationships are visualized, the fact table is in the middle, surrounded by its dimension tables; the connections to these tables are like the rays of a star. A variation of this template is known as the snowflake schema, where dimensions are further broken down into subdimensions. For example, there could be separate tables for brands and product categories, and each row in the dim_product table could ref erence the brand and category as foreign keys, rather than storing them as strings in the dim_product table. Snowflake schemas are more normalized than star schemas, but star schemas are often preferred because they are simpler for analysts to work with 55. In a typical data warehouse, tables are often very wide: fact tables often have over 100 columns, sometimes several hundred 51. Dimension tables can also be very wide, as they include all the metadata that may be relevant for analysisfor example, the dim_store table may include details of which services are offered at each store, whether it has an in-store bakery, the square footage, the date when the store was first opened, when it was last remodeled, how far it is from the nearest highway, etc.

--------------------------------------------------
Index: 153
Score: 0.5779402256011963
Chunk:
An Advanced Course, edited by P. M. Stocker, P. M. D. Gray, and M. P. Atkinson, pages 1956, Cambridge University Press, 1984. ISBN: 978-0-521-25430-4 17 Charles W. Bachman: The Programmer as Navigator, Communications of the ACM, volume 16, number 11, pages 653658, November 1973. doi: 10.1145/355611.362534 18 Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton: Architecture of a Database System, Foundations and Trends in Databases, volume 1, number 2, pages 141259, November 2007. doi:10.1561/1900000002 19 Sandeep Parikh and Kelly Stirman: Schema Design for Time Series Data in MongoDB, blog.mongodb.org, October 30, 2013. 20 Martin Fowler: Schemaless Data Structures, martinfowler.com, January 7, 2013. 21 Amr Awadallah: Schema-on-Read vs. Schema-on-Write, at Berkeley EECS RAD Lab Retreat, Santa Cruz, CA, May 2009. 22 Martin Odersky:

--------------------------------------------------
Index: 98
Score: 0.5624552965164185
Chunk:
Its not possible to say in general which data model leads to simpler application code; it depends on the kinds of relationships that exist between data items. For highly interconnected data, the document model is awkward, the relational model is accept able, and graph models (see Graph-Like Data Models on page 49) are the most natural. Schema flexibility in the document model Most document databases, and the JSON support in relational databases, do not enforce any schema on the data in documents. XML support in relational databases usually comes with optional schema validation. No schema means that arbitrary keys and values can be added to a document, and when reading, clients have no guaran tees as to what fields the documents may contain. Document databases are sometimes called schemaless, but thats misleading, as the code that reads the data usually assumes some kind of structurei.e. , there is an implicit schema, but it is not enforced by the database 20. A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational Relational Model Versus Document Model | 39

--------------------------------------------------
Kept chunk(s): 1, 2, 3, 4, 5
Bot: Based on the provided context, I'll outline some best practices for using the Star Schema in data modeling:

1. **Design for Analytics**: The Star Schema is particularly well-suited for analytics workloads. Design your fact table and dimension tables with analysis requirements in mind.
2. **Use Fact Tables to Store Events**: Each row in the fact table represents an event, such as a customer purchase or a page view. This allows for maximum flexibility in analysis later on.
3. **Dimension Tables Represent Contextual Information**: Dimension tables provide contextual information about each event, such as product details (e.g., SKU, description), date and time, and geographic location.
4. **Use Foreign Keys to Link Fact Table to Dimensions**: Each row in the fact table uses foreign keys to link to specific dimension tables, allowing for easy querying and analysis of related data.
5. **Keep Fact Tables Wide but Not Too Wide**: Aim for 100-200 columns in your fact table, as this allows for efficient storage and retrieval of event-level data without sacrificing query performance.
6. **Use Dimension Tables Wisely**: Dimension tables can become very wide if not managed carefully. Limit the number of columns to only those necessary for analysis, and consider breaking down large dimension tables into smaller, more focused ones.
7. **Schema Normalization**: While Snowflake Schemas are more normalized than Star Schemas, Star Schemas are often preferred because they are simpler for analysts to work with. Keep your fact table well-normalized by minimizing redundant data and using foreign keys judiciously.
8. **Data Modeling Best Practices**: Apply general data modeling best practices, such as:
	* Use meaningful column names and avoid ambiguous or misleading ones.
	* Ensure column types align with the data they store (e.g., date fields should be datetime).
	* Minimize redundant data and use foreign keys to link related tables.
	* Use indexes and constraints to ensure data integrity and query performance.

By following these best practices, you can effectively design and implement a Star Schema for your data warehouse or analytics system.

	page_number	chunk	chunk_char_count	chunk_token_count	page_word_count	page_token_count	embeddings
0	4	978-1-449-37332-0 LSI Designing Data-Intensive...	805	144	250	313	[0.02232709527015686, 0.019246932119131088, -0...
1	4	First Edition Revision History for the First E...	1022	169	250	313	[0.00605520186945796, -0.04391200467944145, -0...
2	5	Technology is a powerful force in our society....	403	76	64	76	[0.021445734426379204, 0.09222898632287979, -0...
3	7	Computing is pop culture. ... Pop culture hold...	397	82	71	82	[-0.012340748682618141, 0.08791149407625198, -...
4	9	Table of Contents Preface. . . . . . . . . . ....	682	179	248	270	[-0.023812390863895416, -0.01836887188255787, ...
...	...	...	...	...	...	...	...
1200	611	uncertain (transaction status) (see in doubt) ...	2615	521	350	521	[0.01760762929916382, -0.026653055101633072, 0...
1201	612	windows (stream processing), 466, 468-472 infi...	1647	373	239	373	[-0.016259633004665375, 0.004297784063965082, ...
1202	613	About the Author Martin Kleppmann is a researc...	1167	219	437	515	[0.04322214052081108, 0.034118011593818665, -0...
1203	613	Males are larger than females, but the species...	893	175	437	515	[0.01628885790705681, 0.035985954105854034, -0...
1204	613	Due to its aggression, it was depicted on the ...	605	121	437	515	[0.05848781764507294, -0.007880114018917084, -...

Creating a Local RAG (Retrieval Augmented Generation) System with PyTorch and Ollama¶

Imports¶

Functions¶

Read Csv file Containing Embeddings¶

Test retreiving the top 5 chunks based on Similarity Score¶

Entering a query where no relevant chunks can be found will default back to normal LLM behaviour¶

Entering a query where relevant chunks can be found will pass those chunks in LLM context¶