CS377: Database Design - Aggregation in Relational Databases


Activity Goals

The goals of this activity are:
  1. To explain the use of primary and foreign keys in database systems
  2. To aggregate records across multiple tables with referential keys

Supplemental Reading

Feel free to visit these resources for supplemental background reading material.

The Activity

Directions

Consider the activity models and answer the questions provided. First reflect on these questions on your own briefly, before discussing and comparing your thoughts with your group. Appoint one member of your group to discuss your findings with the class, and the rest of the group should help that member prepare their response. Answer each question individually from the activity, and compare with your group to prepare for our whole-class discussion. After class, think about the questions in the reflective prompt and respond to those individually in your notebook. Report out on areas of disagreement or items for which you and your group identified alternative approaches. Write down and report out questions you encountered along the way for group discussion.

Model 1: Tables

FirstNameLastNameAge
AlexSmith20
LeeJones21
BrianMcMullen18
SamanthaJohnson22
LeeJones24

CourseNumCourseName
CS377Database Design
CS173Intro to Computer Science
CS174Object Oriented Programming
CS275Software Engineering

Questions

  1. How might you compare database tables and records, with software classes and objects?
  2. How can you differentiate between the two students named Lee?
  3. Not all cultures use "first" and "last" names; what might be a better choice for these column names?

Model 2: Primary and Foreign Keys

FirstNameLastNameAgeCourse
AlexSmith20CS377 Database Design
LeeJones21CS377 Database Design
BrianMcMullen18CS173 Intro to Computer Science
SamanthaJohnson22CS174 Object Oriented Programming
LeeJones24CS174 Object Oriented Programming
LeeJones21CS275 Software Engineering

Questions

  1. What is unfortunate about the structure of these tables? What could we do to improve upon it?
  2. Modify the tables above to add a "serial number" or a primary key to each of the tables.
  3. Can you think of a way to identify courses without having to number them? What might a more "natural key" be?
  4. What disadvantage could arise from using the data itself (like "CS377") as a key, particularly if that data changes over time?
  5. Eliminate the redundant data in these tables by replacing any redundancies with a new column and row specifying the appropriate key value that can be looked up in the corresponding table. This is called a foreign key.

Model 3: Aggregation

FirstNameLastNameAgeCourse
AlexSmith20CS377 Database Design
LeeJones21CS377 Database Design
BrianMcMullen18CS173 Intro to Computer Science
SamanthaJohnson22CS174 Object Oriented Programming
LeeJones24CS174 Object Oriented Programming
LeeJones21CS275 Software Engineering

Questions

  1. Using your new table structure, describe an algorithm to determine the average age of students enrolled in CS377.
  2. What other types of aggregation functions might be useful across these tables?
  3. Extend the course database to show enrollments in a particular semester; once the class agrees on a design, draw out the tables and data including the key values.

Processing and aggregating data records is a fundamental aspect of data analysis and management. In this report, we will explore the concept of processing and aggregating data records, focusing on how it is done in databases and using the SQL language. We will discuss various techniques and algorithms used for efficient processing and aggregation and provide Python examples to illustrate the concepts.

Database Introduction

SQL (Structured Query Language) is a domain-specific language designed for managing data in a relational database management system (RDBMS). SQL provides various operations for processing and aggregating data records, including filtering, joining, sorting, and grouping.

SQL Aggregating Functions

SQLite provides a range of aggregating functions, such as COUNT, SUM, AVG, MIN, and MAX. These functions can be applied to columns or expressions, and they enable us to perform calculations on groups of rows or the entire result set.

For example, to calculate the total number of records in a table, we can use the COUNT function as follows:

SELECT COUNT(*) FROM table_name;

To compute the average value of a column, we can use the AVG function:

SELECT AVG(column_name) FROM table_name;

Python Examples

Here are some Python examples demonstrating how to process and aggregate data records using pandas, a popular data manipulation library:

import pandas as pd

# Load data from a CSV file
data = pd.read_csv('data.csv')

# Filtering data
filtered_data = data[data['age'] > 30]

# Aggregating data
aggregated_data = data.groupby('category').agg({'value': 'sum', 'count': 'mean'})

In the first example, we filter the data records to only include those where the age is greater than 30. In the second example, we aggregate the data by category, calculating the sum of the ‘value’ column and the mean of the ‘count’ column.

Conclusion

Processing and aggregating data records is a crucial task in data analysis and management. SQL provides powerful tools for performing these operations within databases, while Python libraries like pandas offer similar functionality for working with structured data outside of databases. By efficiently processing and aggregating data, we can gain valuable insights and make informed decisions based on the information stored in our datasets.

Submission

I encourage you to submit your answers to the questions (and ask your own questions!) using the Class Activity Questions discussion board. You may also respond to questions or comments made by others, or ask follow-up questions there. Answer any reflective prompt questions in the Reflective Journal section of your OneNote Classroom personal section. You can find the link to the class notebook on the syllabus.