CS377: Database Design - Aggregation with NoSQL

Activity Goals

The goals of this activity are:

To define functions to aggregate values in a MongoDB data store

The Activity

Directions

Consider the activity models and answer the questions provided. First reflect on these questions on your own briefly, before discussing and comparing your thoughts with your group. Appoint one member of your group to discuss your findings with the class, and the rest of the group should help that member prepare their response. Answer each question individually from the activity, and compare with your group to prepare for our whole-class discussion. After class, think about the questions in the reflective prompt and respond to those individually in your notebook. Report out on areas of disagreement or items for which you and your group identified alternative approaches. Write down and report out questions you encountered along the way for group discussion.

Model 1: Aggregation Methods

	##########################################################################
	# THE MICRO MONGODB+PYTHON BOOK
	#
	# A simple adaptation to Python + summary of
	# The Little MongoDB Book: https://github.com/karlseguin/the-little-mongodb-book
	#
	# javier arias losada
	# twitter: @javier_arilos

	# Usage suggestion: run line by line bu using pydemo project (https://github.com/pablito56/pydemo)
	# You will need a MongoDB instance up and running.
	# >> pip install pymongo
	# >> pip install git+https://github.com/pablito56/pydemo.git
	# >> pydemo micro-pymongo.py

	# ******** Chapter 1 - The Basics:
	# *
	# * https://github.com/karlseguin/the-little-mongodb-book/blob/master/en/mongodb.markdown#chapter-1---the-basics
	#
	# MongoDB (from "humongous") is an open-source document database, (...) MongoDB features:
	# Document-Oriented: JSON-style, with dynamic schemas.
	# Full Index Support: Index on any attribute.
	# Replication & High Availability: Across LANs and WANs.
	# Auto-Sharding: Scale horizontally.
	# Querying: Rich, document-based queries.
	# Fast In-Place Updates: Atomic modifiers.
	# Map/Reduce: Flexible aggregation and data processing.
	# GridFS: Store files of any size.
	# MongoDB Management: Monitoring and backup.

	# databases >> collections >> documents >> fields

	# Pymongo + MongoDB make it very easy to work:
	# create a client
	import pymongo
	from datetime import datetime
	client = pymongo.MongoClient()

	# use micro database
	db = client['micro-mongodb-python-book']

	# insert a document (a Python dict) into unicors collection
	db.unicorns.insert({'name': 'Aurora', 'gender': 'f', 'weight': 450})

	# both database and collection are, lazy initialized if do not exist

	# mongodb is schemaless, we can insert a document with different structure
	db.unicorns.insert({'client_id': 'Mickey', 'specie': 'mouse', 'weight': 0.450})

	# find returns a Cursor object, we can iterate over it.
	for doc in db.unicorns.find():
	print doc

	# Querying data and selectors:
	# remove documents in the collection
	db.unicorns.remove()

	# First, insert some unicorns' data for later querying
	db.unicorns.insert({'name': 'Horny',
	'dob': datetime(1992,2,13,7,47),
	'loves': ['carrot','papaya'],
	'weight': 600,
	'gender': 'm',
	'vampires': 63});
	db.unicorns.insert({'name': 'Aurora',
	'dob': datetime(1991, 10, 24, 13, 0),
	'loves': ['carrot', 'grape'],
	'weight': 450,
	'gender': 'f',
	'vampires': 43});
	db.unicorns.insert({'name': 'Unicrom',
	'dob': datetime(1973, 1, 9, 22, 10),
	'loves': ['energon', 'redbull'],
	'weight': 984,
	'gender': 'm',
	'vampires': 182});
	db.unicorns.insert({'name': 'Roooooodles',
	'dob': datetime(1979, 7, 18, 18, 44),
	'loves': ['apple'],
	'weight': 575,
	'gender': 'm',
	'vampires': 99});
	db.unicorns.insert({'name': 'Solnara',
	'dob': datetime(1985, 6, 4, 2, 1),
	'loves':['apple', 'carrot',
	'chocolate'],
	'weight':550,
	'gender':'f',
	'vampires':80});
	db.unicorns.insert({'name':'Ayna',
	'dob': datetime(1998, 2, 7, 8, 30),
	'loves': ['strawberry', 'lemon'],
	'weight': 733,
	'gender': 'f',
	'vampires': 40});
	db.unicorns.insert({'name':'Kenny',
	'dob': datetime(1997, 6, 1, 10, 42),
	'loves': ['grape', 'lemon'],
	'weight': 690,
	'gender': 'm',
	'vampires': 39});
	db.unicorns.insert({'name': 'Raleigh',
	'dob': datetime(2005, 4, 3, 0, 57),
	'loves': ['apple', 'sugar'],
	'weight': 421,
	'gender': 'm',
	'vampires': 2});
	db.unicorns.insert({'name': 'Leia',
	'dob': datetime(2001, 9, 8, 14, 53),
	'loves': ['apple', 'watermelon'],
	'weight': 601,
	'gender': 'f',
	'vampires': 33});
	db.unicorns.insert({'name': 'Pilot',
	'dob': datetime(1997, 2, 1, 5, 3),
	'loves': ['apple', 'watermelon'],
	'weight': 650,
	'gender': 'm',
	'vampires': 54});
	db.unicorns.insert({'name': 'Nimue',
	'dob': datetime(1999, 11, 20, 16, 15),
	'loves': ['grape', 'carrot'],
	'weight': 540,
	'gender': 'f'});
	db.unicorns.insert({'name': 'Dunx',
	'dob': datetime(1976, 6, 18, 18, 18),
	'loves': ['grape', 'watermelon'],
	'weight': 704,
	'gender': 'm',
	'vampires': 165});
	# Inserted some unicorns' data for later querying

	# MongoDB selectors are the equivalent to WHERE clauses in relational DB's
	# a selector is an object instructing how to match objects in the collection

	# find all male unicorns
	for doc in db.unicorns.find({'gender': 'm'}):
	print doc

	# find all male unicorns that weigh more than 700 pounds
	for doc in db.unicorns.find({'gender': 'm', 'weight': {'$gt': 700}}):
	print doc


	# female unicorns which either love apples or oranges or weigh less than 500 pounds
	cursor = db.unicorns.find({'gender': 'f', '$or': [{'loves': 'apple'}, {'loves': 'orange'}, {'weight': {'$lt': 500}}]})
	for doc in cursor:
	print doc

	# ******** Chapter 2 - Update:
	# *
	# * https://github.com/karlseguin/the-little-mongodb-book/blob/master/en/mongodb.markdown#chapter-2---updating
	#

	# We have now many unicorn documents, let's update one of them:
	db.unicorns.find_one({'name': 'Roooooodles'})

	# Update takes a minimum of two arguments, one Selector object to match and an update object
	db.unicorns.update({'name': 'Roooooodles'}, {'weight': 590})

	# Let's check the result
	db.unicorns.find_one({'name': 'Roooooodles'})
	# No results? How that can be? Check the result of the following

	db.unicorns.find_one({'weight': 590})
	# Update, by default will replace the matched document with the new document provided

	# How can we update just some fields and keep the rest unchanged? $set operator
	# Let's recover our wrongly updated unicorn:
	db.unicorns.update({'weight': 590}, {'$set': {
	'name': 'Roooooodles',
	'dob': datetime(1979, 7, 18, 18, 44),
	'loves': ['apple'],
	'gender': 'm',
	'vampires': 99}})

	db.unicorns.find_one({'name': 'Roooooodles'})
	# In all previous update operations, the Roooooodles' document _id is the same

	# Other update modifiers:
	# $inc for incrementing/decrementing a numeric field
	db.unicorns.update({'name': 'Roooooodles'}, {'$inc': {'vampires': -2}})
	db.unicorns.find_one({'name': 'Roooooodles'})

	# $push for adding element to an array field
	db.unicorns.update({'name': 'Roooooodles'}, {'$push': {'loves': 'carrot'}})

	# update modifiers can be combined to do many operations on a single operation
	db.unicorns.update({'name': 'Roooooodles'}, {'$push': {'loves': 'potato'}, '$inc': {'vampires': 6}})
	db.unicorns.find_one({'name': 'Roooooodles'})

	# many push and inc can on multiple elements be done in a single operation
	db.unicorns.update({'name': 'Roooooodles'}, {'$push': {'loves': {'$each': ['apple', 'durian']}}, '$inc': {'vampires': 6, 'eggs': 1}})
	db.unicorns.find_one({'name': 'Roooooodles'})

	# mongodb supports Upserts (Update or Insert)
	# Let's count hits on our application pages
	db.hits.update({'page': 'unicorns'}, {'$inc': {'hits': 1}})
	db.hits.find_one()

	db.hits.update({'page': 'unicorns'}, {'$inc': {'hits': 1}}, upsert=True)

	db.hits.find_one()

	# By default, update modifies just one document
	db.unicorns.update({}, {'$set': {'vaccinated': True}});
	for unicorn in db.unicorns.find({'vaccinated': True}):
	print unicorn

	# multiple updates can be applied in a single operation, by setting named parameter multi=True
	db.unicorns.update({}, {'$set': {'vaccinated': True}}, multi=True);
	for unicorn in db.unicorns.find({'vaccinated': True}).limit(5):
	print unicorn

	# ******** Chapter 3 - Mastering Find:
	# *
	# * https://github.com/karlseguin/the-little-mongodb-book/blob/master/en/mongodb.markdown#chapter-3---mastering-find
	#

	# We can ask mongodb to return just some fields when selecting documents
	# _id is included by default unless we explicitly exclude it
	# the rest of fields are not included by default
	for doc in db.unicorns.find({}, {'_id': 0, 'name': 1}):
	print doc

	# Sorting documents
	for doc in db.unicorns.find({}, {'name': 1, 'weight': 1}).sort([('name', pymongo.ASCENDING), ('vampires', pymongo.DESCENDING)]):
	print doc

	# Paging results
	for doc in db.unicorns.find({}).sort([('name', pymongo.ASCENDING)]).skip(3).limit(2):
	print doc

	# Counting results
	db.unicorns.find({'vampires': {'$gt': 50}}).count()

	# ******** Chapter 4 - Data Modelling:
	# *
	# * https://github.com/karlseguin/the-little-mongodb-book/blob/master/en/mongodb.markdown#chapter-4---data-modeling
	#

	# No joins are available in MongoDB.
	# Let's going to create three employees: Leto, Duncan and Moneo. Leto is the manager.
	from bson.objectid import ObjectId
	db.employees.insert({'_id': ObjectId('4d85c7039ab0fd70a117d730'), 'name': 'Leto'})
	db.employees.insert({'_id': ObjectId('4d85c7039ab0fd70a117d731'), 'name': 'Duncan',
	'manager': ObjectId('4d85c7039ab0fd70a117d730')});
	db.employees.insert({'_id': ObjectId('4d85c7039ab0fd70a117d732'), 'name': 'Moneo',
	'manager': ObjectId('4d85c7039ab0fd70a117d730')});

	for employee in db.employees.find({'manager': ObjectId('4d85c7039ab0fd70a117d730')}):
	print employee

	# This relationships can be stored in arrays to model 1:N relationships
	# Let's supose some employees can have many managers
	db.employees.insert({'_id': ObjectId('4d85c7039ab0fd70a117d733'), 'name': 'Siona',
	'manager': [ObjectId('4d85c7039ab0fd70a117d730'),
	ObjectId('4d85c7039ab0fd70a117d732')]
	})

	# The previous query still works
	for employee in db.employees.find({'manager': ObjectId('4d85c7039ab0fd70a117d730')}):
	print employee

	# We can also model by nesting documents inside documents:
	db.employees.insert({'_id': ObjectId('4d85c7039ab0fd70a117d734'), 'name': 'Ghanima',
	'family': {'mother': 'Chani',
	'father': 'Paul',
	'brother': ObjectId('4d85c7039ab0fd70a117d730')}})

	# Embedded documents can be queried with dot-notation:
	for employee in db.employees.find({'family.mother': 'Chani'}):
	print employee

	# DBRef objects allow to reference a collection and document from our document.
	# Some drivers may get automatically the referenced object. Pymongo does so.

	# Want to see the employee => manager example with DBRef? https://gist.github.com/javierarilos/d5fce5d50b1b9b8e2784

	# ******** Chapter 5 - When to use MongoDB:
	# *
	# * https://github.com/karlseguin/the-little-mongodb-book/blob/master/en/mongodb.markdown#chapter-5---when-to-use-mongodb
	#

	# Out of the scope of this doc, see the link if you want to know more.

	# Geospatial queries: http://mongly.openmymind.net/geo/index

	# ******** Chapter 6 - Map and Reduce:
	# *
	# * https://github.com/karlseguin/the-little-mongodb-book/blob/master/en/mongodb.markdown#chapter-6---mapreduce
	#

	# Out of the scope of this doc, see the link if you want to know more.

	# ******** Chapter 7 - Performance and Tools:
	# *
	# https://github.com/karlseguin/the-little-mongodb-book/blob/master/en/mongodb.markdown#chapter-7---performance-and-tools#
	#

	# Creating and deleting an index on name field
	db.unicorns.ensure_index([('name', pymongo.DESCENDING)])
	db.unicorns.drop_index([('name', pymongo.DESCENDING)])

	# An index can be set as Unique (not repeated values for the indexed field)
	db.unicorns.ensure_index([('name', pymongo.DESCENDING)], unique=True)

	# Indexes can be set in embedded fiels and in array fields.
	# Indexes can be compound on different fields:
	db.unicorns.ensure_index([('name', pymongo.DESCENDING), ('vampires', pymongo.ASCENDING)])

	# Explain, get some information on how query executes
	db.unidorns.find().explain()

view raw micro-mongodb-python-book.py hosted with ❤ by GitHub

Questions

Investigate how to limit to two results on a find, but to make those results the second and third documents from the sorted result set.

Model 2: Quantitative Aggregation Methods

# TODO: create the MongoDB pymongo client 
# ... with a variable called client
 
# Compute the average of the $grade field in the collection, in a key called TotalGrade
collection.aggregate([{$group: {"_id": "_id", "TotalGrade": {"$avg": "$grade"}}}}])
 
# Compute the average grade of each assignment type
collection.aggregate([{$group: {"_id": "$asmttype", "TotalGrade": {"$avg": "$grade"}}}}])

Questions

By specifying an _id of _id, each item is individually counted in the average. Suppose asmttype is a key in your document; what do you think using that key as the _id of the aggregation does to the group? Try it to find out!

NoSQL Data Processing and Aggregation

NoSQL databases have gained significant popularity in recent years due to their ability to handle large volumes of data and offer flexible data modeling. One critical aspect of working with NoSQL databases is processing and aggregating data efficiently. We will explore the techniques and best practices for processing and aggregating data in NoSQL databases, along with relevant code examples using Python.

Data Processing Techniques

MapReduce

MapReduce is a widely used technique for processing and aggregating data in NoSQL databases. It divides the data processing task into two stages: Map and Reduce.

In the Map stage, the input data is divided into smaller chunks, and a map function is applied to each chunk independently. The map function transforms the input data into a set of key-value pairs.

In the Reduce stage, the output of the map function is grouped by the keys and passed to a reduce function. The reduce function performs aggregation operations on the grouped data, such as sum, count, or average.

Example using Python:

# Import required libraries
from functools import reduce

# Sample input data
data = [1, 2, 3, 4, 5]

# Map function
mapped_data = list(map(lambda x: (x, x**2), data))

# Reduce function
reduced_data = reduce(lambda x, y: (x[0] + y[0], x[1] + y[1]), mapped_data)

print(reduced_data)

Referring to the above example, the map function transforms each element in the input data into a tuple of the value itself and its square. The reduce function then accumulates the squares and their values into a final result of (15, 55).

Distributed Query Processing

NoSQL databases often support distributed query processing, allowing data processing tasks to be distributed across multiple machines. This technique enables parallel execution of queries, leading to faster processing times for large-scale data.

Distributed query processing involves dividing the data into smaller partitions and processing those partitions in parallel. The results are then combined to produce the final result.

Example using Python and MongoDB:

# Import required libraries
from pymongo import MongoClient

# Connect to MongoDB
client = MongoClient()
db = client['mydb']
collection = db['mycollection']

# Perform distributed query processing
result = collection.aggregate([
  { "$group": { "_id": "$category", "totalAmount": { "$sum": "$amount" } } },
  { "$sort": { "totalAmount": -1 } }
])

for doc in result:
  print(doc)

In the above example, we connect to a MongoDB database and perform distributed query processing using the aggregate function. The pipeline consists of two stages: grouping by category and summing the amount field, and sorting the results in descending order.

Improving Performance with an In-Memory Database Using Apache Spark

In this Python example, Apache Spark is used to process data from a NoSQL database (MongoDB). The code reads data from MongoDB, performs some processing on age and gender columns, and writes the processed data back to MongoDB.

# Python code for NoSQL data processing using Apache Spark

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder \
    .appName("NoSQL Data Processing") \
    .getOrCreate()

# Read data from MongoDB
mongo_uri = "mongodb://localhost/mydb.myCollection"
df = spark.read.format("mongo").option("uri", mongo_uri).load()

# Process the data using Spark SQL or Spark DataFrame operations
processed_df = df.filter(df["age"] > 30).groupBy("gender").count()

# Write the processed data back to MongoDB
processed_df.write.format("mongo").mode("append").option("uri", mongo_uri).save()

# Stop the Spark session
spark.stop()

Conclusion

NoSQL databases offer powerful data processing and aggregation capabilities through techniques like MapReduce and distributed query processing. These techniques enable efficient processing of large volumes of data while providing flexibility in data modeling.

Effective use of these techniques requires a good understanding of the underlying concepts and the specific features provided by the chosen NoSQL database. Additionally, Python provides rich libraries and frameworks that facilitate working with NoSQL databases efficiently.

Submission

I encourage you to submit your answers to the questions (and ask your own questions!) using the Class Activity Questions discussion board. You may also respond to questions or comments made by others, or ask follow-up questions there. Answer any reflective prompt questions in the Reflective Journal section of your OneNote Classroom personal section. You can find the link to the class notebook on the syllabus.