View on GitHub

CS377

Database Design

CS377: Database Design - Indexing a Denormalized NoSQL Datastore

Activity Goals

The goals of this activity are:

To create an index in a NoSQL database

Supplemental Reading

Feel free to visit these resources for supplemental background reading material.

The Activity

Directions

Consider the activity models and answer the questions provided. First reflect on these questions on your own briefly, before discussing and comparing your thoughts with your group. Appoint one member of your group to discuss your findings with the class, and the rest of the group should help that member prepare their response. Answer each question individually from the activity, and compare with your group to prepare for our whole-class discussion. After class, think about the questions in the reflective prompt and respond to those individually in your notebook. Report out on areas of disagreement or items for which you and your group identified alternative approaches. Write down and report out questions you encountered along the way for group discussion.

Model 1: Indexes in NoSQL databases

1

2

3

4

5

6

7

8

# TODO: obtain the collection as usual
# ... in a variable called collection
 
collection.ensureIndex({"airport": 1})
 
# you can have more than one index 
# ... depending on the values you intend to query together!
collection.ensureIndex({"student": 1, "grade": -1})

Questions

Look up which sort order is specified by 1 and which by -1.
Insert many records with a loop into a MongoDB database, and query it with and without an index. What speedup do you observe for varying numbers of N documents in your collection? You can use the .explain() method on the cursor returned by your call to find() to obtain information about the time required to execute the query, as well as whether an index (BtreeCursor) was used or not (BasicCursor).

Indexing a Denormalized NoSQL Datastore

In this report, we will explore the concept of indexing a denormalized NoSQL datastore. Denormalization is a technique used to optimize data retrieval in NoSQL databases by reducing joins and improving query performance. However, without proper indexing, denormalized data can result in slower queries and increased storage requirements. Indexing plays a crucial role in optimizing query performance in denormalized NoSQL datastores.

Indexing in NoSQL Datastores

NoSQL databases are designed for high scalability and performance, and they employ various indexing techniques to improve query efficiency. In a denormalized NoSQL datastore, where data is intentionally duplicated to avoid joins, indexing becomes even more important to ensure fast query execution.

Types of Indexes

There are several types of indexes commonly used in NoSQL databases, including:

B-tree indexes: B-tree indexes are widely used in NoSQL databases due to their efficient performance characteristics. They allow efficient range queries and support sorting. B-tree indexes can be used for both primary and secondary key-based lookups.
Hash indexes: Hash indexes use a hashing function to map keys to their corresponding values. They are fast for exact match queries but do not support range queries or sorting.
Bitmap indexes: Bitmap indexes are well-suited for databases with low cardinality attributes. They maintain a bitmap for each possible attribute value, allowing for efficient filtering and aggregations.

Indexing Strategies for Denormalized Datastores

In denormalized NoSQL datastores, where data duplication is intentionally performed, indexes should be carefully chosen and created to optimize query performance. Here are some indexing strategies commonly used in denormalized datastores:

Composite indexes: Composite indexes combine multiple attributes into a single index, enabling efficient querying on multiple attributes simultaneously. This can significantly improve performance in denormalized datastores.
Covering indexes: Covering indexes store all the columns required by a query in the index itself, eliminating the need to access the actual data rows. This helps reduce disk I/O and improves query performance.
Partial indexes: Partial indexes index a subset of the data to reduce the size of the index and improve query performance. This can be useful when working with large denormalized datastores.

Python Examples

Creating an Index in MongoDB

import pymongo

# Connect to MongoDB
client = pymongo.MongoClient("mongodb://localhost:27017")

# Get reference to database and collection
db = client["mydatabase"]
collection = db["mycollection"]

# Create index on "name" attribute
collection.create_index("name")

# Create compound index on "name" and "age" attributes
collection.create_index([("name", pymongo.ASCENDING), ("age", pymongo.ASCENDING)])

Creating an Index in Apache Cassandra

from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement

# Connect to Cassandra
cluster = Cluster(['127.0.0.1'])
session = cluster.connect()

# Create index on "name" attribute
create_index_query = "CREATE INDEX ON mykeyspace.mytable (name)"
session.execute(create_index_query)

# Create compound index on "name" and "age" attributes
create_compound_index_query = "CREATE INDEX ON mykeyspace.mytable (name, age)"
session.execute(create_compound_index_query)

Submission

I encourage you to submit your answers to the questions (and ask your own questions!) using the Class Activity Questions discussion board. You may also respond to questions or comments made by others, or ask follow-up questions there. Answer any reflective prompt questions in the Reflective Journal section of your OneNote Classroom personal section. You can find the link to the class notebook on the syllabus.