CS377: Database Design - Vector Databases (100 Points)

Assignment Goals

The goals of this assignment are:
  1. To implement a program to create and populate a vector database using fundamental programming constructs and a flat file data store

Background Reading and References

Please refer to the following readings and examples offering templates to help get you started:

The Assignment

In this assignment, you will create and use a vector database for an application domain of your choosing, and use it to make recommendations by finding nearest neighbors to your query vector of features.

Using a Flat File to Store Vector Data

You can use a database, grid, or other structure to store vector data; however, for simplicity, we will use a flat file using the json library in Python. Flat files are not efficient for storage or retrieval, but will simplify our programming. We will discuss methods to more efficiently store and search data during the course.

The following Collection class will simulate a database of vector data by creating an array of dictionary objects; each dictionary will contain the vector of features. The dimension of the collection is equal to the length of the features vector (in other words, if each item in the collection has 7 features, then the dimension of the collection is 7).

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import json

class Collection:     
    def __init__(self, name, dimension):
        self.name = name
        self.dimension = dimension
        self.data = []
        
    def upsert(self, vectors, ids, metadata):
        for vector, id_, meta in zip(vectors, ids, metadata):
            # find index if id already exists
            index = next((index for (index, d) in enumerate(self.data) if d['id'] == id_), None)
            if index is None:
                # insert new item
                self.data.append({'id': id_, 'vector': vector, 'metadata': meta})
            else:
                # update existing item
                self.data[index]['vector'] = vector
                self.data[index]['metadata'] = meta
 
    def search(self, query_vectors, top_k):
        query_vectors = np.array(query_vectors)
        db_vectors = np.array([d['vector'] for d in self.data])
        scores = cosine_similarity(query_vectors, db_vectors)
        top_indices = scores.argsort()[0][::-1][:top_k]

        # return a list of objects representing the ID and score of each item in the search result
        class DictObj(object):
          def __init__(self, _id, _score):
            self.id = _id
            self.score = _score
            
        return [DictObj(self.data[i]['id'], scores[0][i]) for i in top_indices]

    def save(self, filename):
        with open(f'{filename}.json', 'w') as f:
            json.dump(self.data, f)

    @classmethod
    def load(cls, filename, dimension):
        c = cls(filename, dimension)
        with open(f'{filename}.json', 'r') as f:
            c.data = json.load(f)
        return c

You can instantiate one of these collections as follows, and then you can call upsert and search as you normally would.

restaurants = Collection(name="restaurants", dimension=7)

Part 1: Porting the Restaurants Tutorial

Using the tutorial example from class, set up the restaurants example using a flat file store above, and verify that you are able to query for a given feature vector.

Part 2: Implementing Your Own Vector Database and Making Recommendations with User Input

Be creative here! Modify your program to use feature vectors of your choice. You could search for videos, houses, books, or anything else that you can describe semantically using feature vectors. Create an initial vector database of sample data by reading vector features from a CSV file with well-described headers (so that another person could create sample data for your program!). Then, prompt the user within your program to enter their preferences, and search and return the top k nearest matches.

Part 3: Improving Search Performance Using Approximate Nearest Neighbors (ANN)

Modify the Collection class to support ANN for increased search efficiency over using the cosine similarity metric. Use the example from our class notes to get started with sample code demonstrating the usage of ANN via the annoy library in Python.

Modifying the Flat File

To save and load this into your flat file, you can modify the save and load functions of the Collection class as follows:

    def save(self, filename):
        self.annoy_index.save(f'{filename}.ann')
        with open(f'{filename}.json', 'w') as f:
            json.dump(self.data, f)

    @classmethod
    def load(cls, filename, dimension):
        c = cls(filename, dimension)
        c.annoy_index.load(f'{filename}.ann')
        with open(f'{filename}.json', 'r') as f:
            c.data = json.load(f)
        c.count = len(c.data)
        return c

Passing the Right Input to get_nns_by_vector

When you call annoy_index.get_nns_by_vector, the annoy library assumes that you are passing a single vector of size dimension. The cosine_similarity function takes a 2-dimensional array in which the first item is an array of size dimension. You can simply pass query_vectors[0] to annoy_index.get_nns_by_vector instead of query_vectors to remedy this.

Calculating a Score for each Nearest Neighbor

get_nns_by_vector does not return the scores, but you can obtain distances which you can insert into your result object for each score by calling annoy_index.get_distance(i, j) to get the distance between each item (i) in the return set of get_nns_by_vector and the query_vectors[0] input (j). Since the score is inversely proportional to the distance (that is, higher distances imply a lower score), you can set the score to 1/distance for each distance.

Submission

In your submission, please include answers to any questions asked on the assignment page in your README file. If you wrote code as part of this assignment, please describe your design, approach, and implementation in your README file as well. Finally, include answers to the following questions:
  • Describe what you did, how you did it, what challenges you encountered, and how you solved them.
  • Please answer any questions found throughout the narrative of this assignment.
  • If collaboration with a buddy was permitted, did you work with a buddy on this assignment? If so, who? If not, do you certify that this submission represents your own original work?
  • Please identify any and all portions of your submission that were not originally written by you (for example, code originally written by your buddy, or anything taken or adapted from a non-classroom resource). It is always OK to use your textbook and instructor notes; however, you are certifying that any portions not designated as coming from an outside person or source are your own original work.
  • Approximately how many hours it took you to finish this assignment (I will not judge you for this at all...I am simply using it to gauge if the assignments are too easy or hard)?
  • Your overall impression of the assignment. Did you love it, hate it, or were you neutral? One word answers are fine, but if you have any suggestions for the future let me know.
  • Using the grading specifications on this page, discuss briefly the grade you would give yourself and why. Discuss each item in the grading specification.
  • Any other concerns that you have. For instance, if you have a bug that you were unable to solve but you made progress, write that here. The more you articulate the problem the more partial credit you will receive (it is fine to leave this blank).

Assignment Rubric

Description Pre-Emerging (< 50%) Beginning (50%) Progressing (85%) Proficient (100%)
Algorithm Implementation (60%) The algorithm fails on the test inputs due to major issues, or the program fails to compile and/or run The algorithm fails on the test inputs due to one or more minor issues The algorithm is implemented to solve the problem correctly according to given test inputs, but would fail if executed in a general case due to a minor issue or omission in the algorithm design or implementation A reasonable algorithm is implemented to solve the problem which correctly solves the problem according to the given test inputs, and would be reasonably expected to solve the problem in the general case
Code Quality and Documentation (30%) Code commenting and structure are absent, or code structure departs significantly from best practice, and/or the code departs significantly from the style guide Code commenting and structure is limited in ways that reduce the readability of the program, and/or there are minor departures from the style guide Code documentation is present that re-states the explicit code definitions, and/or code is written that mostly adheres to the style guide Code is documented at non-trivial points in a manner that enhances the readability of the program, and code is written according to the style guide
Writeup and Submission (10%) An incomplete submission is provided The program is submitted, but not according to the directions in one or more ways (for example, because it is lacking a readme writeup or missing answers to written questions) The program is submitted according to the directions with a minor omission or correction needed, including a readme writeup describing the solution and answering nearly all questions posed in the instructions The program is submitted according to the directions, including a readme writeup describing the solution and answering all questions posed in the instructions

Please refer to the Style Guide for code quality examples and guidelines.