Introduction
MongoDB is a source available NoSQL DB from MongoDB corp and was first released in 2009. NoSQL rose to popularity due to its schema flexibility, and ability to scale horizontally as NoSQL uses the Base approach.
Some salient points on NoSQL
- NoSQL stands for Not Only SQL
- No predefined schema
- supports evntual consistency
- high performance, high availablity, and highly scalable
- NoSQL uses the BASE approach
Mongo uses a slightly different naming convention than typical SQL DB’s
Table -> Collection
Row -> Document
Column -> Field
Join -> Embedded Document
- Introduction
- BASE VS ACID
- CAP Theorem
- Different types of NoSQL DB’s
- Mongo DB Features
- Scalability in Mongo VS RDBMS
- Data Modeling
- Mongo DB Commands
- Database, Collection related operations
- document related operations
- query projection, update, remove
- Create a New Collection
- Logical operators
- Aggregation Pipeline
- Sum of sal per job
- Avg of sal per dept
- Print total sal, avg sal, max sal, min sal per job
- Print all jobs for which total sal is more than 5700
- Display depts total sal in desc order
- Find the dept that spends max on sal
- Find the dept that spends minimum on sal
- Display ename, deptno & sal of all emps whose sal >= 2500
- Display ename, deptno & sal of all emps whose sal >= 2500 in the DESC order of sal
- Analyse data per dept per job [find the count of emps per deptno per job]
- find the job with max AVG sal
- find number of managers, analysts and clerks in company
- Print ename and dept name and dept location
- print depts and emps in that dept
- Print emp name and his manager details
BASE VS ACID
- BASE: Base provides lesser assurance than ACID, however, scales well and is also able to adjust to changes easily.
- Basically Available i.e Guaranteed availability in case of failure,
- Soft state i.e State of DB can change without application interaction due to eventual consistency,
- eventual consistency i.e will be eventually consistent after the application input. The data will be replicated to different nodes and will eventually reach a consistent state.
- ACID: Atomicity, Consistency, Idempotency, and Durability
- Atomicity means that transactions must complete or fail as a whole, so a lot of bookkeeping must be done behind the scenes to guarantee this.
- Consistency ensures that a transaction can only bring the database from one valid state to another, any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination. However this is different to Consistency in CAP theorem.
- Isolation ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially
- Durability guarantees that once a transaction has been committed, it will remain committed even in the case of a system failure.
The SQL API lacks a mechanism to describe queries where ACID’s requirements are relaxed. This is why the BASE databases are all NoSQL.
CAP Theorem
Also known as Brewer’s theorem states that it is impossible for a Distributed system to provide more than two of the following three properties.
- Consistency : Consistency means that data is the same across the cluster, so you can read or write from/to any node and get the same data
- Availablity : Availability means the ability to access the cluster even if a node in the cluster goes down.
- Partition : Partition tolerance means that the cluster continues to function even if there is a “partition” (communication break) between two nodes (both nodes are up, but can’t communicate).
However, in reality, you can’t just say that MongoDB is CP/AP/CA, because it actually is a trade-off between C, A, and P, depending on both database/driver configurations.
MongoDB is strongly consistent by default – if you do a write and then do a read, assuming the write was successful you will always be able to read the result of the write you just read. This is because MongoDB is a single-master system and all reads go to the primary by default.
MongoDB never allows writing to secondary and you lose on high availability in CAP theorem. If you optionally enable reading from the secondaries then MongoDB becomes eventually consistent where it’s possible to read out-of-date results.
MongoDB also gets high availability through automatic failover in replica sets:
http://www.mongodb.org/display/DOCS/Replica+Sets
Different types of NoSQL DB’s
- Key Value Db : DynamoDB, Redis
- Data in Key value pair
- Document DB : Mongo
- Rows will be documents
- Column DB : Cassandra, Redshift
- Data is stored in columns
- Analytics and warehousing.
- Graph DB : Neo4j
- Data stores with Node relations
Mongo DB Features
whereas ACID is an acronym for Atomicity, Consistency, idempotency, and Durability.
- Publicly available on 2009
- Stores JSON to BSON when storing to DB
- max document size is 16MB
- indexes are supported for faster querying.
- Aggregation of processing into differnt pipelines.
- Supports TTL collections useful for caching, logging etc.
- Supports large file storage.
Scalability in Mongo VS RDBMS
Scalable has to be broken down into its constituents:Read scaling & Write scaling
ACID-compliant databases (like traditional RDBMS’s) can scale reads. They are not inherently less efficient than NoSQL databases because the (possible) performance bottlenecks are introduced by things NoSQL (sometimes) lacks (like joins and where restrictions) which you can opt not to use. Clustered SQL RDBMS’s can scale reads by introducing additional nodes in the cluster. There are constraints to how far read operations can be scaled, but these are imposed by the difficulty of scaling up writes as you introduce more nodes into the cluster.
There are various constraints imposed by the ACID principle, To scale up write operations or the number of nodes in a cluster beyond a certain point you have to be able to relax some of the ACID requirements:
Dropping Atomicity lets yo shorten the duration the DB is locked. Ex: Mongo, Couch DB. Similarly Dropping Consisitency lets you scale write operation. Ex: Cassandra. Dropping durability lets you to write command without flushing to disk. Example memcache, redis.
NoSQL typically follows Base instead of ACID model, they give up the A, C, or D requirement and improve scalability.
Joeri Sebrechts
Data Modeling
Embedded Data Model
- Suitable for smaller dataset with one-to-one or one-to-many relationships.
- Faster read operation since data Fetch in single db operation
- Unsuitable if data is huge due to 16MB document limit.
Normalized Data Model
Use a normalized data model to represent more complex many to many relations, and prevent data duplication similar to how we use in the traditional SQL databases.
Aggregation Pipelie
Data is fed a multi-stage pipeline that transforms the documents and passes them to the next stage and the aggregated result is returned.
Mongo DB Commands
Database, Collection related operations
- show dbs
- use mydb
- db.stats()
- db
- show collections
- db.createCollection(‘product’)
- db.product.drop()
document related operations
- db.product.insert({title: “product1”, price: 2000, description: “test product 1”})
- db.product.insertMany([ {title: “product 2”, price: 3000, description: “test product”}, {title: “product 3”, price: 4000, description: “test product “}, {title: “product 4”, price: 4000, description: “test product”}, {title: “product 5”, price: 4000, description: “test product”}, {title: “product 6”, price: 4000, description: “test product”}, {title: “product 7”, price: 4000, description: “test product”}, {title: “product 8”, price: 4000, description: “test product”}, {title: “product 9”, price: 4000, description: “test product”}, {title: “product 10”, price: 4000, description: “test product”},])
- db.product.find()
- db.product.find().count()
- db.product.find().hasNext()
- db.product.find().limit(5)
- db.product.find().skip(5)
- db.product.find().skip(count * 0).limit(count)
- db.product.find().pretty()
- db.product.find({title: “product 7”})
- db.product.find({price: { $gt: 3000 }})
- db.product.find({price: { $lte: 3000 }})
- db.product.find({price: { $in: [2000, 3000] }})
- db.product.find( { $or: [ {title: “product 6”}, {price: 2000} ] } )
- db.product.find( { $and: [ {title: “product 6”}, {price: 2000} ] } )
query projection, update, remove
- db.product.find({}, {title: 1})
- db.product.find({}, {price: 0, description: 0})
- db.product.update( {_id: ObjectId(’61d14117b4044b4a7e194703′)}, { title: ‘product 1 updated’, price: 2000, description: ‘test product’ } )
- db.product.update( {_id: ObjectId(’61d14117b4044b4a7e194703′)}, { $set: { price: 4000 } } )
- db.product.updateMany({price: 4000}, {$set: {price: 6000}})
- db.product.remove({“_id” : ObjectId(“61d14117b4044b4a7e194703”)})
- db.product.deleteOne({“_id” : ObjectId(“61d14117b4044b4a7e194703”)})
- db.product.deleteMany({price: 6000})
Create a New Collection
use classwork;
show collections;
db.dept.insert({_id:10,dname:"ACCOUNTING",loc:"NEW YORK"});
db.dept.insert({_id:20,dname:"RESEARCH",loc:"DALLAS"});
db.dept.insert({_id:30,dname:"SALES",loc:"CHICAGO"});
db.dept.insert({_id:40,dname:"OPERATIONS",loc:"BOSTON"});
db.emp.insert({_id:7369,ename:"SMITH",job:"CLERK",mgr:7902,sal:800.00,deptno:20, comm:null});
db.emp.insert({_id:7499,ename:"ALLEN",job:"SALESMAN",mgr:7698,sal:1600.00,comm:300.00,deptno:30});
db.emp.insert({_id:7521,ename:"WARD",job:"SALESMAN",mgr:7698,sal:1250.00,comm:500.00,deptno:30});
db.emp.insert({_id:7566,ename:"JONES",job:"MANAGER",mgr:7839,sal:2975.00,deptno:20, comm:null});
db.emp.insert({_id:7654,ename:"MARTIN",job:"SALESMAN",mgr:7698,sal:1250.00,comm:1400.00,deptno:30});
db.emp.insert({_id:7698,ename:"BLAKE",job:"MANAGER",mgr:7839,sal:2850.00,deptno:30});
db.emp.insert({_id:7782,ename:"CLARK",job:"MANAGER",mgr:7839,sal:2450.00,deptno:10});
db.emp.insert({_id:7788,ename:"SCOTT",job:"ANALYST",mgr:7566,sal:3000.00,deptno:20});
db.emp.insert({_id:7839,ename:"KING",job:"PRESIDENT",sal:5000.00,deptno:10});
db.emp.insert({_id:7844,ename:"TURNER",job:"SALESMAN",mgr:7698,sal:1500.00,comm:0.00,deptno:30});
db.emp.insert({_id:7876,ename:"ADAMS",job:"CLERK",mgr:7788,sal:1100.00,deptno:20, comm:null});
db.emp.insert({_id:7900,ename:"JAMES",job:"CLERK",mgr:7698,sal:950.00,deptno:30, comm:null});
db.emp.insert({_id:7902,ename:"FORD",job:"ANALYST",mgr:7566,sal:3000.00,deptno:20});
db.emp.insert({_id:7934,ename:"MILLER",job:"CLERK",mgr:7782,sal:1300.00,deptno:10});
Logical operators
- db.emp.find({$or: [ {sal: {$gt: 4000}}, {job: ‘ANALYST’} ]})
- db.emp.find({$and: [ {deptno: {$ne: 20}}, {job: {$ne: ‘SALESMAN’}} ]})
- db.emp.find({ deptno: {$ne: 20}, job: {$ne: ‘SALESMAN’} })
- db.emp.find(
- { $or: [ { $and: [ { job: ‘MANAGER’ }, { deptno: 30 } ] },
- { $and: [ { job: ‘SALESMAN’ }, { deptno: 30 }, { sal: { $lte: 1500 } } ] } ]})
- db.emp.find(
- { $or: [ { job: ‘MANAGER’, deptno: 30 },
- { job: ‘SALESMAN’, deptno: 30, sal: { $lte: 1500 } } ]})
Aggregation Pipeline
Operators : $project(select), $match(where), $group(group by), $unwind(extray array), $lookup(left join), $out(into collection)
Sum of sal per job
db.emp.aggregate([
// stage1: get the data grouped by job
{
$group: {
_id: '$job',
sum: {$sum: '$sal'}
}
}
])
Avg of sal per dept
db.emp.aggregate([
// stage1: get the data grouped by deptno
{
$group: {
_id: '$deptno',
averageSalary: {$avg: '$sal'}
}
}
])
Print total sal, avg sal, max sal, min sal per job
db.emp.aggregate([
// stage1: get the data grouped by job
{
$group: {
_id: '$job',
totalSalary: { $sum: '$sal' },
averageSalary: { $avg: '$sal' },
minSalary: { $min: '$sal' },
maxSalary: { $max: '$sal' },
}
}
])
Print all jobs for which total sal is more than 5700
db.emp.aggregate([
// stage1: get the data grouped by job
{
$group: {
_id: '$job',
totalSalary: {$sum: '$sal'}
}
},
// stage2: match the condition [totalSalary > 5700]
{
$match: {
totalSalary: { $gt: 5700 }
}
}
])
Display depts total sal in desc order
db.emp.aggregate([
// stage1: get the data grouped by deptno
{
$group: {
_id: '$deptno',
totalSalary: {$sum: '$sal'}
}
},
// stage2: get the records sorted in desc order
{
$sort: {
totalSalary: -1
}
}
])
Find the dept that spends max on sal
db.emp.aggregate([
// stage1: get the data grouped by deptno
{
$group: {
_id: '$deptno',
totalSalary: {$sum: '$sal'}
}
},
// stage2: get the records sorted in desc order
{
$sort: {
totalSalary: -1
}
},
// stage3: limit the number of records to 1
{
$limit: 1
}
])
Find the dept that spends minimum on sal
db.emp.aggregate([
// stage1: get the data grouped by deptno
{
$group: {
_id: '$deptno',
totalSalary: {$sum: '$sal'}
}
},
// stage2: get the records sorted in desc order
{
$sort: {
totalSalary: 1
}
},
// stage3: limit the number of records to 1
{
$limit: 1
}
])
Display ename, deptno & sal of all emps whose sal >= 2500
db.emp.find(
{ sal: { $gte: 2500 } },
{ ename:1, deptno: 1, _id: 0, sal: 1 })
```
```
db.emp.aggregate([
// stage1: find the emp having sal >= 2500
{
$match: {
sal: { $gte: 2500 }
}
},
// stage2: select the required columns
{
$project: {
ename:1, deptno: 1, _id: 0, sal
}
}
])
Display ename, deptno & sal of all emps whose sal >= 2500 in the DESC order of sal
db.emp.aggregate([
// stage1: find the emp having sal >= 2500
{
$match: {
sal: { $gte: 2500 }
}
},
// stage2: sort the records by sal in desc order
{
$sort: {
sal: -1
}
},
// stage3: select the required columns
{
$project: {
ename:1, deptno: 1, _id: 0, sal: 1
}
}
])
Analyse data per dept per job [find the count of emps per deptno per job]
db.emp.aggregate([
// stage1: find the emp per dept per job
{
$group: {
_id: {
deptno: '$deptno',
job: '$job'
},
totalEmployees: {
$sum: 1
}
}
}
])
find the job with max AVG sal
db.emp.aggregate([
// stage1: find the emp per dept per job
{
$group: {
_id: '$job',
averageSalary: {
$avg: '$sal'
}
}
},
// stage2: sort the records with desc order
{
$sort: {
averageSalary: -1
}
},
// stage3: limit the result to 1 record
{
$limit: 1
}
])
find number of managers, analysts and clerks in company
db.emp.aggregate([
// stage1: match the job for getting managers, analysts and clerks
{
$match: {
job: { $in: ['MANAGER', 'ANALYST', 'CLERK'] }
}
},
// stage2: find the emp per dept per job
{
$group: {
_id: '$job',
count: {
$sum: 1
}
}
}
])
Print ename and dept name and dept location
db.emp.aggregate([
// stage1: lookup the information from dept collection
{
$lookup: {
from: 'dept', // get the related dept record
localField: 'deptno', // the field which is used for finding the dept from dept collection
foreignField: '_id', // the dept field which matches with deptno of emp collection
as: 'department'
}
},
// stage2: unwind the array into object
{
$unwind: '$department'
},
// stage3: add new fields for dept name and location
{
$addFields: {
deptName: '$department.dname',
deptLocation: '$department.loc'
}
},
// stage4: select required fields
{
$project: {
deptName: 1, deptLocation: 1, ename: 1, _id: 0
}
}
])
print depts and emps in that dept
db.dept.aggregate([
// stage1: find all employees belong to the dept
{
$lookup: {
from: 'emp', // get the related dept record
localField: '_id', // the field which is used for finding the dept from dept collection
foreignField: 'deptno', // the dept field which matches with deptno of emp collection
as: 'employees'
}
},
// stage2: get only dept name and employee name
{
$project: {
dname: 1, 'employees.ename': 1, _id: 0
}
}
])
Print emp name and his manager details
db.emp.aggregate([
// stage1: lookup the information from dept collection
{
$lookup: {
from: 'emp', // get the related dept record
localField: 'mgr', // the field which is used for finding the dept from dept collection
foreignField: '_id', // the dept field which matches with deptno of emp collection
as: 'manager'
}
},
// stage2: unwind the array into object
{
$unwind: '$manager'
},
// stage3: add new fields for dept name and location
{
$addFields: {
managerName: '$manager.ename',
}
},
// stage4: select required fields
{
$project: {
ename: 1, managerName: 1, _id: 0
}
}
])