CS377: Database Design - Aggregation with NoSQL
Activity Goals
The goals of this activity are:- To define functions to aggregate values in a MongoDB data store
The Activity
Directions
Consider the activity models and answer the questions provided. First reflect on these questions on your own briefly, before discussing and comparing your thoughts with your group. Appoint one member of your group to discuss your findings with the class, and the rest of the group should help that member prepare their response. Answer each question individually from the activity, and compare with your group to prepare for our whole-class discussion. After class, think about the questions in the reflective prompt and respond to those individually in your notebook. Report out on areas of disagreement or items for which you and your group identified alternative approaches. Write down and report out questions you encountered along the way for group discussion.Model 1: Aggregation Methods
########################################################################## | |
# THE MICRO MONGODB+PYTHON BOOK | |
# | |
# A simple adaptation to Python + summary of | |
# The Little MongoDB Book: https://github.com/karlseguin/the-little-mongodb-book | |
# | |
# javier arias losada | |
# twitter: @javier_arilos | |
# Usage suggestion: run line by line bu using pydemo project (https://github.com/pablito56/pydemo) | |
# You will need a MongoDB instance up and running. | |
# >> pip install pymongo | |
# >> pip install git+https://github.com/pablito56/pydemo.git | |
# >> pydemo micro-pymongo.py | |
# ******** Chapter 1 - The Basics: | |
# * | |
# * https://github.com/karlseguin/the-little-mongodb-book/blob/master/en/mongodb.markdown#chapter-1---the-basics | |
# | |
# MongoDB (from "humongous") is an open-source document database, (...) MongoDB features: | |
# Document-Oriented: JSON-style, with dynamic schemas. | |
# Full Index Support: Index on any attribute. | |
# Replication & High Availability: Across LANs and WANs. | |
# Auto-Sharding: Scale horizontally. | |
# Querying: Rich, document-based queries. | |
# Fast In-Place Updates: Atomic modifiers. | |
# Map/Reduce: Flexible aggregation and data processing. | |
# GridFS: Store files of any size. | |
# MongoDB Management: Monitoring and backup. | |
# databases >> collections >> documents >> fields | |
# Pymongo + MongoDB make it very easy to work: | |
# create a client | |
import pymongo | |
from datetime import datetime | |
client = pymongo.MongoClient() | |
# use micro database | |
db = client['micro-mongodb-python-book'] | |
# insert a document (a Python dict) into unicors collection | |
db.unicorns.insert({'name': 'Aurora', 'gender': 'f', 'weight': 450}) | |
# both database and collection are, lazy initialized if do not exist | |
# mongodb is schemaless, we can insert a document with different structure | |
db.unicorns.insert({'client_id': 'Mickey', 'specie': 'mouse', 'weight': 0.450}) | |
# find returns a Cursor object, we can iterate over it. | |
for doc in db.unicorns.find(): | |
print doc | |
# Querying data and selectors: | |
# remove documents in the collection | |
db.unicorns.remove() | |
# First, insert some unicorns' data for later querying | |
db.unicorns.insert({'name': 'Horny', | |
'dob': datetime(1992,2,13,7,47), | |
'loves': ['carrot','papaya'], | |
'weight': 600, | |
'gender': 'm', | |
'vampires': 63}); | |
db.unicorns.insert({'name': 'Aurora', | |
'dob': datetime(1991, 10, 24, 13, 0), | |
'loves': ['carrot', 'grape'], | |
'weight': 450, | |
'gender': 'f', | |
'vampires': 43}); | |
db.unicorns.insert({'name': 'Unicrom', | |
'dob': datetime(1973, 1, 9, 22, 10), | |
'loves': ['energon', 'redbull'], | |
'weight': 984, | |
'gender': 'm', | |
'vampires': 182}); | |
db.unicorns.insert({'name': 'Roooooodles', | |
'dob': datetime(1979, 7, 18, 18, 44), | |
'loves': ['apple'], | |
'weight': 575, | |
'gender': 'm', | |
'vampires': 99}); | |
db.unicorns.insert({'name': 'Solnara', | |
'dob': datetime(1985, 6, 4, 2, 1), | |
'loves':['apple', 'carrot', | |
'chocolate'], | |
'weight':550, | |
'gender':'f', | |
'vampires':80}); | |
db.unicorns.insert({'name':'Ayna', | |
'dob': datetime(1998, 2, 7, 8, 30), | |
'loves': ['strawberry', 'lemon'], | |
'weight': 733, | |
'gender': 'f', | |
'vampires': 40}); | |
db.unicorns.insert({'name':'Kenny', | |
'dob': datetime(1997, 6, 1, 10, 42), | |
'loves': ['grape', 'lemon'], | |
'weight': 690, | |
'gender': 'm', | |
'vampires': 39}); | |
db.unicorns.insert({'name': 'Raleigh', | |
'dob': datetime(2005, 4, 3, 0, 57), | |
'loves': ['apple', 'sugar'], | |
'weight': 421, | |
'gender': 'm', | |
'vampires': 2}); | |
db.unicorns.insert({'name': 'Leia', | |
'dob': datetime(2001, 9, 8, 14, 53), | |
'loves': ['apple', 'watermelon'], | |
'weight': 601, | |
'gender': 'f', | |
'vampires': 33}); | |
db.unicorns.insert({'name': 'Pilot', | |
'dob': datetime(1997, 2, 1, 5, 3), | |
'loves': ['apple', 'watermelon'], | |
'weight': 650, | |
'gender': 'm', | |
'vampires': 54}); | |
db.unicorns.insert({'name': 'Nimue', | |
'dob': datetime(1999, 11, 20, 16, 15), | |
'loves': ['grape', 'carrot'], | |
'weight': 540, | |
'gender': 'f'}); | |
db.unicorns.insert({'name': 'Dunx', | |
'dob': datetime(1976, 6, 18, 18, 18), | |
'loves': ['grape', 'watermelon'], | |
'weight': 704, | |
'gender': 'm', | |
'vampires': 165}); | |
# Inserted some unicorns' data for later querying | |
# MongoDB selectors are the equivalent to WHERE clauses in relational DB's | |
# a selector is an object instructing how to match objects in the collection | |
# find all male unicorns | |
for doc in db.unicorns.find({'gender': 'm'}): | |
print doc | |
# find all male unicorns that weigh more than 700 pounds | |
for doc in db.unicorns.find({'gender': 'm', 'weight': {'$gt': 700}}): | |
print doc | |
# female unicorns which either love apples or oranges or weigh less than 500 pounds | |
cursor = db.unicorns.find({'gender': 'f', '$or': [{'loves': 'apple'}, {'loves': 'orange'}, {'weight': {'$lt': 500}}]}) | |
for doc in cursor: | |
print doc | |
# ******** Chapter 2 - Update: | |
# * | |
# * https://github.com/karlseguin/the-little-mongodb-book/blob/master/en/mongodb.markdown#chapter-2---updating | |
# | |
# We have now many unicorn documents, let's update one of them: | |
db.unicorns.find_one({'name': 'Roooooodles'}) | |
# Update takes a minimum of two arguments, one Selector object to match and an update object | |
db.unicorns.update({'name': 'Roooooodles'}, {'weight': 590}) | |
# Let's check the result | |
db.unicorns.find_one({'name': 'Roooooodles'}) | |
# No results? How that can be? Check the result of the following | |
db.unicorns.find_one({'weight': 590}) | |
# Update, by default will replace the matched document with the new document provided | |
# How can we update just some fields and keep the rest unchanged? $set operator | |
# Let's recover our wrongly updated unicorn: | |
db.unicorns.update({'weight': 590}, {'$set': { | |
'name': 'Roooooodles', | |
'dob': datetime(1979, 7, 18, 18, 44), | |
'loves': ['apple'], | |
'gender': 'm', | |
'vampires': 99}}) | |
db.unicorns.find_one({'name': 'Roooooodles'}) | |
# In all previous update operations, the Roooooodles' document _id is the same | |
# Other update modifiers: | |
# $inc for incrementing/decrementing a numeric field | |
db.unicorns.update({'name': 'Roooooodles'}, {'$inc': {'vampires': -2}}) | |
db.unicorns.find_one({'name': 'Roooooodles'}) | |
# $push for adding element to an array field | |
db.unicorns.update({'name': 'Roooooodles'}, {'$push': {'loves': 'carrot'}}) | |
# update modifiers can be combined to do many operations on a single operation | |
db.unicorns.update({'name': 'Roooooodles'}, {'$push': {'loves': 'potato'}, '$inc': {'vampires': 6}}) | |
db.unicorns.find_one({'name': 'Roooooodles'}) | |
# many push and inc can on multiple elements be done in a single operation | |
db.unicorns.update({'name': 'Roooooodles'}, {'$push': {'loves': {'$each': ['apple', 'durian']}}, '$inc': {'vampires': 6, 'eggs': 1}}) | |
db.unicorns.find_one({'name': 'Roooooodles'}) | |
# mongodb supports Upserts (Update or Insert) | |
# Let's count hits on our application pages | |
db.hits.update({'page': 'unicorns'}, {'$inc': {'hits': 1}}) | |
db.hits.find_one() | |
db.hits.update({'page': 'unicorns'}, {'$inc': {'hits': 1}}, upsert=True) | |
db.hits.find_one() | |
# By default, update modifies just one document | |
db.unicorns.update({}, {'$set': {'vaccinated': True}}); | |
for unicorn in db.unicorns.find({'vaccinated': True}): | |
print unicorn | |
# multiple updates can be applied in a single operation, by setting named parameter multi=True | |
db.unicorns.update({}, {'$set': {'vaccinated': True}}, multi=True); | |
for unicorn in db.unicorns.find({'vaccinated': True}).limit(5): | |
print unicorn | |
# ******** Chapter 3 - Mastering Find: | |
# * | |
# * https://github.com/karlseguin/the-little-mongodb-book/blob/master/en/mongodb.markdown#chapter-3---mastering-find | |
# | |
# We can ask mongodb to return just some fields when selecting documents | |
# _id is included by default unless we explicitly exclude it | |
# the rest of fields are not included by default | |
for doc in db.unicorns.find({}, {'_id': 0, 'name': 1}): | |
print doc | |
# Sorting documents | |
for doc in db.unicorns.find({}, {'name': 1, 'weight': 1}).sort([('name', pymongo.ASCENDING), ('vampires', pymongo.DESCENDING)]): | |
print doc | |
# Paging results | |
for doc in db.unicorns.find({}).sort([('name', pymongo.ASCENDING)]).skip(3).limit(2): | |
print doc | |
# Counting results | |
db.unicorns.find({'vampires': {'$gt': 50}}).count() | |
# ******** Chapter 4 - Data Modelling: | |
# * | |
# * https://github.com/karlseguin/the-little-mongodb-book/blob/master/en/mongodb.markdown#chapter-4---data-modeling | |
# | |
# No joins are available in MongoDB. | |
# Let's going to create three employees: Leto, Duncan and Moneo. Leto is the manager. | |
from bson.objectid import ObjectId | |
db.employees.insert({'_id': ObjectId('4d85c7039ab0fd70a117d730'), 'name': 'Leto'}) | |
db.employees.insert({'_id': ObjectId('4d85c7039ab0fd70a117d731'), 'name': 'Duncan', | |
'manager': ObjectId('4d85c7039ab0fd70a117d730')}); | |
db.employees.insert({'_id': ObjectId('4d85c7039ab0fd70a117d732'), 'name': 'Moneo', | |
'manager': ObjectId('4d85c7039ab0fd70a117d730')}); | |
for employee in db.employees.find({'manager': ObjectId('4d85c7039ab0fd70a117d730')}): | |
print employee | |
# This relationships can be stored in arrays to model 1:N relationships | |
# Let's supose some employees can have many managers | |
db.employees.insert({'_id': ObjectId('4d85c7039ab0fd70a117d733'), 'name': 'Siona', | |
'manager': [ObjectId('4d85c7039ab0fd70a117d730'), | |
ObjectId('4d85c7039ab0fd70a117d732')] | |
}) | |
# The previous query still works | |
for employee in db.employees.find({'manager': ObjectId('4d85c7039ab0fd70a117d730')}): | |
print employee | |
# We can also model by nesting documents inside documents: | |
db.employees.insert({'_id': ObjectId('4d85c7039ab0fd70a117d734'), 'name': 'Ghanima', | |
'family': {'mother': 'Chani', | |
'father': 'Paul', | |
'brother': ObjectId('4d85c7039ab0fd70a117d730')}}) | |
# Embedded documents can be queried with dot-notation: | |
for employee in db.employees.find({'family.mother': 'Chani'}): | |
print employee | |
# DBRef objects allow to reference a collection and document from our document. | |
# Some drivers may get automatically the referenced object. Pymongo does so. | |
# Want to see the employee => manager example with DBRef? https://gist.github.com/javierarilos/d5fce5d50b1b9b8e2784 | |
# ******** Chapter 5 - When to use MongoDB: | |
# * | |
# * https://github.com/karlseguin/the-little-mongodb-book/blob/master/en/mongodb.markdown#chapter-5---when-to-use-mongodb | |
# | |
# Out of the scope of this doc, see the link if you want to know more. | |
# Geospatial queries: http://mongly.openmymind.net/geo/index | |
# ******** Chapter 6 - Map and Reduce: | |
# * | |
# * https://github.com/karlseguin/the-little-mongodb-book/blob/master/en/mongodb.markdown#chapter-6---mapreduce | |
# | |
# Out of the scope of this doc, see the link if you want to know more. | |
# ******** Chapter 7 - Performance and Tools: | |
# * | |
# https://github.com/karlseguin/the-little-mongodb-book/blob/master/en/mongodb.markdown#chapter-7---performance-and-tools# | |
# | |
# Creating and deleting an index on name field | |
db.unicorns.ensure_index([('name', pymongo.DESCENDING)]) | |
db.unicorns.drop_index([('name', pymongo.DESCENDING)]) | |
# An index can be set as Unique (not repeated values for the indexed field) | |
db.unicorns.ensure_index([('name', pymongo.DESCENDING)], unique=True) | |
# Indexes can be set in embedded fiels and in array fields. | |
# Indexes can be compound on different fields: | |
db.unicorns.ensure_index([('name', pymongo.DESCENDING), ('vampires', pymongo.ASCENDING)]) | |
# Explain, get some information on how query executes | |
db.unidorns.find().explain() |
Questions
- Investigate how to limit to two results on a
find
, but to make those results the second and third documents from the sorted result set.
Model 2: Quantitative Aggregation Methods
1 2 3 4 5 6 7 8 | # TODO: create the MongoDB pymongo client # ... with a variable called client # Compute the average of the $grade field in the collection, in a key called TotalGrade collection.aggregate([{$group: { "_id" : "_id" , "TotalGrade" : { "$avg" : "$grade" }}}}]) # Compute the average grade of each assignment type collection.aggregate([{$group: { "_id" : "$asmttype" , "TotalGrade" : { "$avg" : "$grade" }}}}]) |
Questions
- By specifying an
_id
of_id
, each item is individually counted in the average. Supposeasmttype
is a key in your document; what do you think using that key as the_id
of the aggregation does to the group? Try it to find out!
NoSQL Data Processing and Aggregation
NoSQL databases have gained significant popularity in recent years due to their ability to handle large volumes of data and offer flexible data modeling. One critical aspect of working with NoSQL databases is processing and aggregating data efficiently. We will explore the techniques and best practices for processing and aggregating data in NoSQL databases, along with relevant code examples using Python.
Data Processing Techniques
MapReduce
MapReduce is a widely used technique for processing and aggregating data in NoSQL databases. It divides the data processing task into two stages: Map and Reduce.
In the Map stage, the input data is divided into smaller chunks, and a map function is applied to each chunk independently. The map function transforms the input data into a set of key-value pairs.
In the Reduce stage, the output of the map function is grouped by the keys and passed to a reduce function. The reduce function performs aggregation operations on the grouped data, such as sum, count, or average.
Example using Python:
# Import required libraries
from functools import reduce
# Sample input data
data = [1, 2, 3, 4, 5]
# Map function
mapped_data = list(map(lambda x: (x, x**2), data))
# Reduce function
reduced_data = reduce(lambda x, y: (x[0] + y[0], x[1] + y[1]), mapped_data)
print(reduced_data)
Referring to the above example, the map function transforms each element in the input data into a tuple of the value itself and its square. The reduce function then accumulates the squares and their values into a final result of (15, 55).
Distributed Query Processing
NoSQL databases often support distributed query processing, allowing data processing tasks to be distributed across multiple machines. This technique enables parallel execution of queries, leading to faster processing times for large-scale data.
Distributed query processing involves dividing the data into smaller partitions and processing those partitions in parallel. The results are then combined to produce the final result.
Example using Python and MongoDB:
# Import required libraries
from pymongo import MongoClient
# Connect to MongoDB
client = MongoClient()
db = client['mydb']
collection = db['mycollection']
# Perform distributed query processing
result = collection.aggregate([
{ "$group": { "_id": "$category", "totalAmount": { "$sum": "$amount" } } },
{ "$sort": { "totalAmount": -1 } }
])
for doc in result:
print(doc)
In the above example, we connect to a MongoDB database and perform distributed query processing using the aggregate
function. The pipeline consists of two stages: grouping by category and summing the amount field, and sorting the results in descending order.
Improving Performance with an In-Memory Database Using Apache Spark
In this Python example, Apache Spark is used to process data from a NoSQL database (MongoDB). The code reads data from MongoDB, performs some processing on age and gender columns, and writes the processed data back to MongoDB.
# Python code for NoSQL data processing using Apache Spark
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder \
.appName("NoSQL Data Processing") \
.getOrCreate()
# Read data from MongoDB
mongo_uri = "mongodb://localhost/mydb.myCollection"
df = spark.read.format("mongo").option("uri", mongo_uri).load()
# Process the data using Spark SQL or Spark DataFrame operations
processed_df = df.filter(df["age"] > 30).groupBy("gender").count()
# Write the processed data back to MongoDB
processed_df.write.format("mongo").mode("append").option("uri", mongo_uri).save()
# Stop the Spark session
spark.stop()
Conclusion
NoSQL databases offer powerful data processing and aggregation capabilities through techniques like MapReduce and distributed query processing. These techniques enable efficient processing of large volumes of data while providing flexibility in data modeling.
Effective use of these techniques requires a good understanding of the underlying concepts and the specific features provided by the chosen NoSQL database. Additionally, Python provides rich libraries and frameworks that facilitate working with NoSQL databases efficiently.