Lab 3 - MongoDB

--- tags: IntroToBigData title: Lab 3 - MongoDB --- # Lab 3 - MongoDB **Course:** Big Data - IU S23 **Author:** Firas Jolha ## Dataset - [sample_mflix_dataset](https://disk.yandex.com/d/ufcQ3bQD3lJsfw) - ~16MB zipped - ~60MB unzipped ## Readings - [The MongoDB 4.2 Manual](https://www.mongodb.com/docs/v4.2/) # Agenda [toc] # Prerequisites - Installed Hortonworks Data Platform (HDP) Sandbox' - Installed pip and pandas packages - Added Python interpreter to Zeppelin # Objectives - Install MongoDB server on HDP Sandbox - Install MongoDB Compass. - Build a document store (document-oriented database) in MongoDB - Learn how to define collections/documents in MongoDB - Learn how to find data in MongoDB - Learn how to perform CRUD operations and aggregations on the data in MongoDB - Import and retrieve data using MongoDB query language # Introduction MongoDB is a NoSQL database which stores the data in form of key-value pairs. It is an Open Source, Document Database which provides high performance and scalability along with data modelling and data management of huge sets of data in an enterprise application. MongoDB also provides the feature of Auto-Scaling. Since, MongoDB is a cross platform database and can be installed across different platforms like Windows, Linux etc.   # Install MongoDB You can install MongoDB on the cluster node online via `yum` (see [tutorial](https://www.mongodb.com/docs/v4.2/tutorial/install-mongodb-on-red-hat/#install-mongodb-community-edition) for more info) or offline via `wget` and `rpm`. I recommend the offline installation to avoid any issues with `yum`. You need to download the packages *mongodb-org, mongodb-org-mongos, mongodb-org-server, mongodb-org-shell, mongodb-org-tools* for version 4.2.9. I saved the links for you and put them below. You just need to copy the following command line and paste it on your terminal of cluster node then press *Enter*.  ```powershell! wget https://repo.mongodb.org/yum/redhat/7/mongodb-org/4.2/x86_64/RPMS/mongodb-org-4.2.9-1.el7.x86_64.rpm https://repo.mongodb.org/yum/redhat/7/mongodb-org/4.2/x86_64/RPMS/mongodb-org-mongos-4.2.9-1.el7.x86_64.rpm https://repo.mongodb.org/yum/redhat/7/mongodb-org/4.2/x86_64/RPMS/mongodb-org-server-4.2.9-1.el7.x86_64.rpm https://repo.mongodb.org/yum/redhat/7/mongodb-org/4.2/x86_64/RPMS/mongodb-org-shell-4.2.9-1.el7.x86_64.rpm https://repo.mongodb.org/yum/redhat/7/mongodb-org/4.2/x86_64/RPMS/mongodb-org-tools-4.2.9-1.el7.x86_64.rpm ``` After downloading the packages, you need to install them via `rpm` as follows: ```powershell! rpm -i --force mongodb-org-4.2.9-1.el7.x86_64.rpm mongodb-org-server-4.2.9-1.el7.x86_64.rpm mongodb-org-shell-4.2.9-1.el7.x86_64.rpm mongodb-org-mongos-4.2.9-1.el7.x86_64.rpm mongodb-org-tools-4.2.9-1.el7.x86_64.rpm ``` :::warning If you tried installing more recent versions of MongoDB, you may get some errors. For instance, the version 4.4 has some conflicting dependencies and it is not recommended whereas the version 5.0 requires VT-x enabled CPU where the virtualization instruction should be enabled for the Docker container. There is also another way to install MongoDB on the cluster node which is via the open source tool published in the repository as follows. ```powershell! git clone https://github.com/nikunjness/mongo-ambari.git ``` This open source tool is quite old and will install MongoDB version 3.2 but it allows to add MongoDB as a service to the Ambari dashboard where management and configuration could be easier but it is not a big deal. :::      MongoDB will come with several tools for database management such `mongo` or `mongosh` for running MongoDB Shell, `mongoimport` for importing files into MongoDB databases, and `mongodump` for dumping the database. ![](https://i.imgur.com/q9LlXgH.png) :::info `mongosh` is the new name given to MongoDB shell in recent versions whereas the old versions use the name `mongo`. ::: # Install PyMongo package You can work with MongoDB databases directly using the shell `mongo`/`mongosh` but MongoDB provides official driver libraries for multiple programming langauges including Python where performing operations on the databases will be more flexible and easier, since you are familiar with Python and can utilize other packages for data manipulation such as `pandas`. We will use the official package `pymongo` version *3.13.0* which can be installed via `pip` as follows: ```sh pip2 install pymongo ``` `pip2` will install the latest version *(3.13.0)* for Python 2 whereas the latest version of the library for Python 3 is *4.3.3*. If you would like to try it, you can try to install it on your local machine but for this lab's purposes you need to install the package on the cluster node. :::info If you added Python interpreter to Zeppelin, then you can immediately create a notebook and start working on `pymongo`. :::   # MongoDB Databases and Collections In MongoDB, databases hold collections of documents. MongoDB stores documents in collections. Collections are analogous to tables and documents to records in relational databases. If a database does not exist, MongoDB creates the database when you first store data for it. In the same way, if a collection does not exist, MongoDB creates the collection when you first store data for it. ![](https://i.imgur.com/Zt9JbPB.png) <center> <img src='https://i.imgur.com/15fxn4F.png' width="400"/> A collection in MongoDB </center> Document in MongoDB is nothing but the set of key-value pairs. These documents will have dynamic schema which means that the documents in the same collection do not need to possess the same set of fields. Since MongoDB is considered as a schema-less database, each collection can hold different type of objects. Every object in a collection is known as Document, which is represented in a JSON like (JavaScript Object Notation) structure. Data is stored and queried in **BSON**. BSON is a binary representation of JSON documents, though it contains more data types than JSON. MongoDB documents are composed of field-and-value pairs. The value of a field can be any of the BSON [data types](https://www.mongodb.com/docs/v4.2/reference/bson-types/), including other documents, arrays, and arrays of documents. <center> <img src='https://i.imgur.com/Pu5pJMJ.png' width="400"/> A document in MongoDB </center> The following document contains 6 fields, including `_id` of type [`ObjectId`](https://www.mongodb.com/docs/v4.2/reference/bson-types/#objectid), `name` holds an **embedded document** that contains the fields `first` and `last`, `birth` and `death` hold values of the *Date* type, `contribs` holds an array of strings, and `views` holds a value of the *NumberLong* type.  ```go var mydoc = { _id: ObjectId("5099803df3f4948bd2f98391"), name: { first: "Alan", last: "Turing" }, birth: new Date('Jun 23, 1912'), death: new Date('Jun 07, 1954'), contribs: [ "Turing machine", "Turing test", "Turingery" ], views : NumberLong(1250000) } ``` Field names are strings. The field name `_id` is reserved for use as a primary key; its value must be unique in the collection, is immutable, and may be of any type other than an array, it is a combination of machine identifier, timestamp and process id to keep it unique, but user can changes it to anything. Field names cannot contain the null character. MongoDB uses the dot notation to access the elements of an array and to access the fields of an embedded document. :::warning **Query filter documents** are documents used specially in `db.collection.find` method (equivalent to SELECT in SQL) to specify the conditions that determine which records to select for read, update, and delete operations. You can use `<field>:<value>` expressions to specify the equality condition and query operator expressions. ```go { <field1>: <value1>, <field2>: { <operator>: <value> }, ... } ``` ::: ## Arrays, Dot Notation and Embedded documents MongoDB uses the dot notation to access the elements of an array and to access the fields of an embedded document. For example, you access the field `views` for the previous document as `mydoc`**.**`views`. To specify or access an element of an array by the zero-based index position, concatenate the array name with the dot (.) and zero-based index position, and enclose in quotes. ```go "<array>.<index>" ``` To specify the third element in the `contribs` array, use the dot notation `"contribs.2"`. To specify or access a field of an embedded document with dot notation, concatenate the embedded document name with the dot (.) and the field name, and enclose in quotes: ```go "<embedded document>.<field>" ``` For example, given the following field in a document: ```go { ... name: { first: "Alan", last: "Turing" }, contact: { phone: { type: "cell", number: "111-222-3333" } }, ... } ``` To specify the number in the phone document in the `contact` field, use the dot notation `"contact.phone.number"`. :::warning The maximum BSON document size is 16 megabytes. To store documents larger than the that, MongoDB provides the [GridFS API](https://www.mongodb.com/docs/v4.2/core/gridfs/) which is about dividing the large file into parts or chunks and storing each chunk as a separate document. ::: :::warning The fields in a BSON document are ordered. For example, `{a: 1, b: 1}` is not equal to `{b: 1, a: 1}`. ::: # Data Description The `sample_mflix` dataset contains data on movies and movie theaters. The dataset also contains certain metadata, including users and comments on specific movies. In this lab, we will work on [`sample_mflix`](https://www.mongodb.com/docs/atlas/sample-data/sample-mflix/) dataset from MongoDB website. <center> <img src='https://i.imgur.com/3UwL0kY.png' width="200"/> A fake image of Mflix brand </center> The dataset consists of 5 `.json` files where each file will be loaded to create a **`collection`** and each line corresponds to a **`document`** in the collection. These files are dumped from a MongoDB database and the data is stored as Extended JSON v2.0. This format is introduced in MongoDB to preserve the datatypes and attributes of certain fields when converting from BSON to JSON. There is also Extended JSON v1.0 which is used in MongoDB versions before 4.2 and supports less datatypes than v2.0. You can read more info about MongoDB Extended JSON v2.0 from [here](https://www.mongodb.com/docs/manual/reference/mongodb-extended-json/). You can see the Extended JSON v2.0 in the files by checking the fields which start with the symbol **\$**. For example: ```python {"theaterId":{"$numberInt":"1000"}} ``` `theaterId` here is a field and has the value 1000 as an integer type. :::info When you use `mongoimport` v4.2 for loading the data to MongoDB, the datatypes in extended JSON format will be automatically translated into corresponding types in MongoDB. If you are using older versions of `mongoimport` then some datatypes will not be translated properly since the older versions support only Extended JSON v1.0. ::: The description of the dataset is presented in the following table. | File Name | Description | | -------- | -------- | | users.json | This collection contains information on mflix users. Each line/document contains a single user, and their name, email, and password. The fields are `_id` as the primary key in the collection, `name` as name of the user, `email` has the unique property, and `password` field which contains the user password. | | movies.json | This collection contains details on movies. Each document contains a single movie, and information such as its title, release year, and cast. | | theaters.json | This collection contains movie theater locations. Each document contains a single movie theater and its location in both string and GeoJSON forms. | | sessions.json | This collection contains metadata about users. Each document contains a user and their corresponding JSON Web Token | | comments.json | This collection contains comments associated with specific movies. Each document contains the comment text, the user who submitted it, and the movie the comment applies to. | For more information on the dataset, you can look at [here](https://www.mongodb.com/docs/atlas/sample-data/sample-mflix/#sample_mflix.comments).  # Build a MongoDB document store You can build a document store by inserting some data in it. In this lab, we will use the sample dataset from the official website. The dataset is `sample_mflix`. It consists of 5 *Json* files. :::success **Exercises on MongoDB** 1. Download the dataset folder `sample_mflix`. Unzip it and copy the files to the folder `/sample_mflix` in the cluster node via `scp` or `docker cp`. 2. Import all data to the database `mflixdb` via `mongoimport` tool (Note: You do not need to create a database or a collection in advance). For example, to import *comments* data from `comments.json` to the database, you can run the following command: **Note:** For running multiline command in Windows Powershell, replace the slash (\\) with the backtick (\`). ```powershell! mongoimport --db mflixdb \ --collection comments \ --file comments.json ``` **Note:** The default porrt for the previous connection is 27017. If your server is running on a different port, you can specify the host and the port via options `--host` and `--port`. Similarly, import the data from other files. Eventually, you will create 5 collections in the database which are *["comments", "sessions", "movies", "theaters", "users"]*. ```powershell! mongoimport --db mflixdb --collection sessions --file sessions.json mongoimport --db mflixdb --collection movies --file movies.json mongoimport --db mflixdb --collection theaters --file theaters.json mongoimport --db mflixdb --collection users --file users.json ``` **Note:** If the json file includes an array as the root element then you need to add the option `--jsonArray`. The option `--headerline` is useful for accepting the first line in the file as field list in `csv` and `tsv` files. You can find more options for the tool `mongoimport` by running the command line `mongoimport --help`. 3. Access the database from the shell `mongo/mongosh` or from via `pymongo`. To access the database `mflixdb` from `mongo`, you can write in `mongo` as follows: ```sh > use mflixdb switched to db mflixdb ``` In the `mongo/mongosh` shell, `db` is the variable that references the current database. The variable is automatically set to the default database `test` when you access the shell or is set when you use the `use <db>` to switch current database. 4. You can list all collections in the database as follows. ```go > db.getCollectionNames() ``` 5. You can list the first 10 theaters as follows. ```go > db.theaters.find().limit(10) ``` OR ```go > db['theaters'].find().limit(10) ``` ::: # MongoDB CRUD Operations CRUD operations include create, read, update, and delete documents. All examples here are provided for `monog/mongosh` shell. Adapting them to other programming language like Python is not difficult in case you decided to use `PyMongo` package. :::warning The apparent difference between `mongo` shell and `pymongo`, is that `pymongo` uses snake_case for method names such as `db.collection.insert_many()` whereas`mongo` shell uses camelCase such as `db.collection.insertMany()`. ::: ## Create database If a database does not exist, MongoDB creates the database when you first store data for that database. If a database already exists with the mentioned name, then it just connects to that database. ```javascript! use <database_name> ``` ## Drop database Before deleting the database, connect to the required database which is to be deleted. The default database when you access the shell is `test`. ```javascript! db.dropDatabase() ``` ## Show databases You can check the databases in the current DBMS instance as follows: (By default mongodb has no enabled access control, so there is no default user or password) ```javascript! show databases ``` ## Show collections You can check the collections in the connected database as follows: ```javascript! show collections ``` ## Create collections ```javascript! // db.createCollection(name, options) db.createCollection("student", { capped : true, size : 5242880, max : 5000 } ) ``` This will create a collection named student, with maximum size of 5 megabytes and maximum of 5000 documents. :::warning If a collection is capped and reaches its maximum size limit, MongoDB then removes older documents from it to make space for new. ::: ## Drop collections ```javascript! // db.collection_name.drop() db.student.drop() ``` `drop` method will return true is the collection is dropped successfully, else it will return false.  ## Insert documents MongoDB provides the following methods to insert documents into a collection: * db.collection.insertOne() * db.collection.insertMany() To insert a single document:  <center> <img src='https://i.imgur.com/WFs3pUC.png' width="400"/> Insert a single document in users collection </center> To insert many documents: ```go db.inventory.insertMany([ { item: "journal", qty: 25, tags: ["blank", "red"], size: { h: 14, w: 21, uom: "cm" } }, { item: "mat", qty: 85, tags: ["gray"], size: { h: 27.9, w: 35.5, uom: "cm" } }, { item: "mousepad", qty: 25, tags: ["gel", "blue"], size: { h: 19, w: 22.85, uom: "cm" } } ]) ``` :::danger In MongoDB, each document stored in a collection requires a unique `_id` field that acts as a primary key. If an inserted document omits the `_id` field, the MongoDB driver automatically generates an `ObjectId` for the `_id` field. ::: ## Retrieve documents You can query documents in MongoDB via the method [`db.collection.find(query, projection)`](https://www.mongodb.com/docs/v4.2/reference/method/db.collection.find/#db.collection.find). Given the following `inventory` collection: ```go db.inventory.insertMany([ { item: "journal", qty: 25, size: { h: 14, w: 21, uom: "cm" }, status: "A" }, { item: "notebook", qty: 50, size: { h: 8.5, w: 11, uom: "in" }, status: "A" }, { item: "paper", qty: 100, size: { h: 8.5, w: 11, uom: "in" }, status: "D" }, { item: "planner", qty: 75, size: { h: 22.85, w: 30, uom: "cm" }, status: "D" }, { item: "postcard", qty: 45, size: { h: 10, w: 15.25, uom: "cm" }, status: "A" } ]); ``` Let's run the following queries. ### Select All Documents in a Collection To select all documents in the collection, pass an empty document as the query filter parameter to the find method. The query filter parameter determines the select criteria. For example, to select all inventory data without any specific criteria: ```go db.inventory.find( {} ) ``` The equivalent SQL statement is: ```sql SELECT * FROM inventory ``` ### Specify Equality Condition To specify equality conditions, use `<field>:<value>` expressions in the query filter document in the following form: ```go { <field1>: <value1>, ... } ``` For example, to select all inventory whose status is "D": ```go db.inventory.find( { status : "D" } ) ``` The equivalent SQL statement is: ```sql SELECT * FROM inventory WHERE status = "D" ``` ### Specify Conditions Using Query Operators MongoDB provides several [query operators](https://www.mongodb.com/docs/v4.2/reference/operator/query/#query-selectors) to specify conditions on the retrieved data in the following form: ```go { <field1>: { <operator1>: <value1> }, ... } ``` For example, to retrieve all documents from the inventory collection where status equals either "A" or "D": ```go db.inventory.find( { status: { $in: [ "A", "D" ] } } ) ``` Equivalently in SQL. ```sql SELECT * FROM inventory WHERE status in ("A", "D") ``` ### Specify AND Conditions A compound query can specify conditions for more than one field in the collection’s documents. Implicitly, a logical AND conjunction connects the clauses of a compound query so that the query selects the documents in the collection that match all the conditions. For example, to retrieve all documents in the inventory collection where the status equals "A" and qty is less than (`$lt`) 30: :::spoiler MongoDB ```go db.inventory.find( { status: "A", qty: { $lt: 30 } } ) ``` ::: The operation corresponds to the following SQL statement: ```sql SELECT * FROM inventory WHERE status = "A" AND qty < 30 ``` ### Specify OR Conditions Using the `$or` operator, you can specify a compound query that joins each clause with a logical OR conjunction so that the query selects the documents in the collection that match at least one condition. The following example retrieves all documents in the collection where the status equals "A" or qty is less than (`$lt`) 30: :::spoiler MongoDB ```go db.inventory.find( { $or: [ { status: "A" }, { qty: { $lt: 30 } } ] } ) ``` ::: The operation corresponds to the following SQL statement: ```sql SELECT * FROM inventory WHERE status = "A" OR qty < 30 ``` ### Specify AND as well as OR Conditions In the following example, the compound query document selects all documents in the collection where the status equals "A" and either qty is less than ($lt) 30 or item starts with the character p: :::spoiler MongoDB ```go db.inventory.find( { status: "A", $or: [ { qty: { $lt: 30 } }, { item: /^p/ } ] } ) ``` ::: The operation corresponds to the following SQL statement: ```sql SELECT * FROM inventory WHERE status = "A" AND ( qty < 30 OR item LIKE "p%") ``` :::info The `db.collection.findOne()` is another method which performs a read operation to return a single document. Internally, the `db.collection.findOne()` method is the `db.collection.find()` method with a limit of `1`. ::: ### Query on Embedded/Nested Documents To specify an equality condition on a field that is an embedded/nested document, use the query filter document `{ <field>: <value> }` where `<value>` is the document to match. For example, the following query selects all documents where the field size equals the document `{ h: 14, w: 21, uom: "cm" }`: ```go db.inventory.find( { size: { h: 14, w: 21, uom: "cm" } } ) ``` Equality matches on the whole embedded document require an **exact match** of the specified `<value>` document, including the **field order**. For example, the following query **does not** match any documents in the inventory collection: ```go db.inventory.find( { size: { w: 21, h: 14, uom: "cm" } } ) ``` ### Query on Nested Field To specify a query condition on fields in an embedded/nested document, use dot notation `("field.nestedField")`. :::warning When querying using dot notation, the field and nested field must be inside *"quotation marks"*. ::: The following example selects all documents where the field uom nested in the size field equals `"in"`: ```go db.inventory.find( { "size.uom": "in" } ) ``` Or maybe you would like to search using query operators to specify conditions. For example, to retrieve the inventory items whose height is less than 15: ```go db.inventory.find( { "size.h": { $lt: 15 } } ) ``` :::success **Exercises on MongoDB** - Retrieve all theaters ```javascript // db.theaters.find(filter, projection) db.theaters.find({}) db.theaters.find() ``` - Retrieve first 3 theaters which are located in states ends with letter A and the zip code is greater than 90000. Retrieve only location and theaterId. :::spoiler Solution ```javascript! db.theaters.find( { "location.address.state": { $regex : "A$" }, $expr: { $gt: ["location.address.zipcode", 90000] } }, { location : true, theaterId : true, _id : false } ).limit(3) ``` :::  ## Query operators :::success ### Comparison operators - Retrieve all movies which belong to drama or crime generes and number of comments $\ge 100$ and it is not filmed in USA and the type of show is movie. Retrieve the countries. ```javascript! db.movies.find( { genres: { $in:["Drama", "Crime"] }, num_mflix_comments: { $gte: 100 }, countries: { $nin: ["USA"] }, type: { $eq:"movie" } }, {countries:1} ) ``` ### Logical operators - Retrieve all movies which belong to drama genre or (number of comments $\ge 100$ and it is not filmed in USA). Retrieve every field except _id and awards. ```javascript! db.movies.find( { $or: [ { genres: { $in: ["Drama"] } }, { $and : [ { num_mflix_comments: { $gte: 100 } }, { countries: { { $not:{$in: ["USA"]} } } } ] } ] }, {_id:false, awards:false} ).limit(3) ``` ### Element operators - Retrieve all theaters which has street2 in its address field and the datatype of the field x corrdinates is `number`. Retrieve the number of theaters which satisfy this condition. ```javascript! db.theaters.find( { "location.address.street2":{ $exists:true }, "location.geo.coordinates.0":{ $type: "number" } } ).count() ``` **Note:** You can get the datatype of a specific column in MongoDB Shell by running the command `typeof <object>`. For example: ```javascript! typeof db.theaters.findOne().location.geo.coordinates[0] ``` ![](https://i.imgur.com/380uP5Q.png) ### Evaluation operators We have 6 operators here, the most commonly used ones are `$expr`, `$text`, and `$regex`. `$expr` allows the use of aggregation expressions within the query language. It has the following syntax: `{ $expr: { <expression> } }` The arguments can be any valid aggregation expression. Example: ```javascript! db.theaters.find( { "location.address.zipcode":{ $exists:true }, "location.address.state": { $regex : /^A/ }, "$expr": { "$gt": [ { "$convert":{ input: "location.address.zipcode", to: "decimal", onError:"", onNull:"" } }, 90000 ] } }, {_id:false} ).count() ``` **Note:** For the operator `$text`, you need to create a text index. - Create a text index on the fields "street1", "city", and "state". ```javascript! db.movies.createIndex( { title: "text", plot: "text" } ) ``` - Retrieve all movies which contain the sentence "card game", and does not contain any of the words "money", "friend", and "criminal". Sort the results according to the score of the match. ```javascript! db.movies.find( { $text:{ $search:"\"card game\" -money -friend -criminal" } }, { score:{ $meta:"textScore" }, title:true, plot: true } ) ``` ![](https://i.imgur.com/hfNenZD.png) You can visit [this tutorial](https://ananya281294.medium.com/mongo-maths-676469e55f78) for checking the score calculation formula. ### Array operators Array operators return data based on array conditions. The `$all` operator selects the documents where the value of a field is an array that contains all the specified elements. ```javascript! db.movies.find( { genres:{ $all:['Action', 'Comedy', 'Drama'] }, countries: { $elemMatch:{ $eq:"Canada" } }, $expr: { $gt : [ {$size: "$countries"}, 1 ] } } ) ``` **Note:** I covered here some of the operators, and you can check the documentation for the full list of them. ::: # Aggregation operations in MongoDB Aggregation operations process data records and return computed results. Aggregation operations group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result. MongoDB’s aggregation pipeline is a framework for data aggregation modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into an aggregated results. The pipeline consists of **stages**. Each stage transforms the documents as they pass through the pipeline. Pipeline stages do not need to produce one output document for every input document; e.g., some stages may generate new documents or filter out documents. MongoDB provides the `db.collection.aggregate()` method to run the aggregation pipeline. For example, the following query returns the total amount for each customer only for orders whose status is A. ![](https://i.imgur.com/EbPydJc.png) ```go db.orders.aggregate([ { $match: { status: "A" } }, { $group: { _id: "$cust_id", total: { $sum: "$amount" } } } ]) ``` **First Stage:** The `$match` stage filters the documents by the status field and passes to the next stage those documents that have status equal to "A". **Second Stage:** The `$group` stage groups the documents by the `cust_id` field to calculate the sum of the amount for each unique `cust_id`. ## Example with Inventory data Given the following 20 documents of `zipcodes` collection: ```go db.zipcodes.insertMany([ { "_id" : "01001", "city" : "AGAWAM", "loc" : [ -72.622739, 42.070206 ], "pop" : 15338, "state" : "MA" } { "_id" : "01011", "city" : "CHESTER", "loc" : [ -72.988761, 42.279421 ], "pop" : 1688, "state" : "MA" } { "_id" : "01012", "city" : "CHESTERFIELD", "loc" : [ -72.833309, 42.38167 ], "pop" : 177, "state" : "MA" } { "_id" : "01013", "city" : "CHICOPEE", "loc" : [ -72.607962, 42.162046 ], "pop" : 23396, "state" : "MA" } { "_id" : "01020", "city" : "CHICOPEE", "loc" : [ -72.576142, 42.176443 ], "pop" : 31495, "state" : "MA" } { "_id" : "01022", "city" : "WESTOVER AFB", "loc" : [ -72.558657, 42.196672 ], "pop" : 1764, "state" : "MA" } { "_id" : "01026", "city" : "CUMMINGTON", "loc" : [ -72.905767, 42.435296 ], "pop" : 1484, "state" : "MA" } { "_id" : "01057", "city" : "MONSON", "loc" : [ -72.31963399999999, 42.101017 ], "pop" : 8194, "state" : "MA" } { "_id" : "01060", "city" : "FLORENCE", "loc" : [ -72.654245, 42.324662 ], "pop" : 27939, "state" : "MA" } { "_id" : "01103", "city" : "SPRINGFIELD", "loc" : [ -72.588735, 42.1029 ], "pop" : 2323, "state" : "MA" } { "_id" : "01104", "city" : "SPRINGFIELD", "loc" : [ -72.577769, 42.128848 ], "pop" : 22115, "state" : "MA" } { "_id" : "01105", "city" : "SPRINGFIELD", "loc" : [ -72.578312, 42.099931 ], "pop" : 14970, "state" : "MA" } { "_id" : "01106", "city" : "LONGMEADOW", "loc" : [ -72.5676, 42.050658 ], "pop" : 15688, "state" : "MA" } { "_id" : "01107", "city" : "SPRINGFIELD", "loc" : [ -72.606544, 42.117907 ], "pop" : 12739, "state" : "MA" } { "_id" : "01108", "city" : "SPRINGFIELD", "loc" : [ -72.558432, 42.085314 ], "pop" : 25519, "state" : "MA" } { "_id" : "01109", "city" : "SPRINGFIELD", "loc" : [ -72.554349, 42.114455 ], "pop" : 32635, "state" : "MA" } { "_id" : "01118", "city" : "SPRINGFIELD", "loc" : [ -72.527445, 42.092937 ], "pop" : 14618, "state" : "MA" } { "_id" : "01119", "city" : "SPRINGFIELD", "loc" : [ -72.51211000000001, 42.12473 ], "pop" : 13040, "state" : "MA" } { "_id" : "01128", "city" : "SPRINGFIELD", "loc" : [ -72.48890299999999, 42.094397 ], "pop" : 3272, "state" : "MA" } { "_id" : "01129", "city" : "SPRINGFIELD", "loc" : [ -72.487622, 42.122263 ], "pop" : 6831, "state" : "MA" } ]); ``` where each document has the following structure: ```go { "_id": "10280", "city": "NEW YORK", "state": "NY", "pop": 5574, "loc": [ -74.016323, 40.710537 ] } ``` * The `_id` field holds the zip code as a string. * The `city` field holds the city name. A city can have more than one zip code associated with it as different sections of the city can each have a different zip code. * The `state` field holds the two letter state abbreviation. * The `pop` field holds the population. * The `loc` field holds the location as a longitude latitude pair. Let's run the following queries. :::success **Exercises on MonogDB** - Return States with Populations above 10 Million :::spoiler ```go db.zipcodes.aggregate( [ { $group: { _id: "$state", totalPop: { $sum: "$pop" } } }, { $match: { totalPop: { $gte: 10*1000*1000 } } } ] ) ``` In this example, the aggregation pipeline consists of the `$group` stage followed by the `$match` stage: * The `$group` stage groups the documents of the zipcode collection by the state field, calculates the totalPop field for each state, and outputs a document for each unique state. The new per-state documents have two fields: the `_id` field and the `totalPop` field. The _id field contains the value of the state; i.e. the group by field. The totalPop field is a calculated field that contains the total population of each state. To calculate the value, $group uses the $sum operator to add the population field (pop) for each state. After the $group stage, the documents in the pipeline resemble the following: ```go { "_id" : "AK", "totalPop" : 550043 } ``` * The `$match` stage filters these grouped documents to output only those documents whose totalPop value is greater than or equal to 10 million. The `$match` stage does not alter the matching documents but outputs the matching documents unmodified. The equivalent SQL for this aggregation operation is: ```sql SELECT state, SUM(pop) AS totalPop FROM zipcodes GROUP BY state HAVING totalPop >= (10*1000*1000) ``` ::: :::success **Exercises on MonogDB** - Return Average City Population by State :::spoiler ```go db.zipcodes.aggregate( [ { $group: { _id: { state: "$state", city: "$city" }, pop: { $sum: "$pop" } } }, { $group: { _id: "$_id.state", avgCityPop: { $avg: "$pop" } } } ] ) ``` In this example, the aggregation pipeline consists of the `$group` stage followed by another `$group` stage: - The first $group stage groups the documents by the combination of city and state, uses the $sum expression to calculate the population for each combination, and outputs a document for each city and state combination (A city can have more than one zip code associated with it as different sections of the city can each have a different zip code). After this stage in the pipeline, the documents resemble the following form: ```go { "_id" : { "state" : "CO", "city" : "EDGEWATER" }, "pop" : 13154 } ``` - A second `$group` stage groups the documents in the pipeline by the `_id.state` field (i.e. the state field inside the _id document), uses the $avg expression to calculate the average city population (avgCityPop) for each state, and outputs a document for each state. The documents that result from this aggregation operation resembles the following: ```go { "_id" : "MN", "avgCityPop" : 5335 } ``` ::: ### Aggregation pipelines In the `db.collection.aggregate()` method, pipeline stages appear in an array. Documents pass through the stages in sequence. It has the following schema: ```javascript! db.collection.aggregate( [ { <stage> }, ... ] ) ``` :::warning All stages except `$out` and `$merge`, can appear multiple times in the same pipeline. ::: The `$group` stage separates documents into groups according to a "group key". The output is one document for each unique group key. ```javascript! { $group: { _id: <expression>, // Group key <field1>: { <accumulator1> : <expression1> }, ... } } ``` The `<accumulator>` operator must be one of the following accumulator operators: ![](https://i.imgur.com/PqZ9wTI.png) `<expressions>` can include field paths, literals, system variables, expression objects, and expression operators. Expressions can be nested. Field path is the path to a field in the document. To specify a field path, use a string that prefixes the field name with a dollar sign (\$). For example, `"$user"` to specify the field path for the user field or `"$user.name"` to specify the field path to `"user.name"` field. `"$<field>"` is equivalent to `"$$CURRENT.<field>"` where the `CURRENT` is a system variable. Check [here](https://www.mongodb.com/docs/v4.4/reference/aggregation-variables/#std-label-agg-system-variables) for a list of system variables. Check the [documentation](https://www.mongodb.com/docs/v4.4/meta/aggregation-quick-reference) for other stages and pipeline operators. :::success **Exercises on MonogDB** - Retrieve the average number of comments for each movie type. ```javascript! db.movies.aggregate( [ { $group:{ _id: "$type", average: {$avg: "$num_mflix_comments" } } } ] ) ``` - Retrieve the average of comments for each movie type casted in China order by the average in ascending order. ```javascript! db.movies.aggregate( [ { $match:{ countries:{ $in:['China'] } } }, { $group:{ _id: "$type", average: { $avg: "$num_mflix_comments" } } }, { $sort:{ average: 1, } } ] ) ``` - Retrieve the average number of casts for each movie type. ```javascript! db.movies.aggregate( [ { $match: { cast: { $ne:null } } }, { $addFields: { cast_size: { $size : "$cast" } } }, { $group: { _id: "$type", cast_average: { $avg: "$cast_size" } } }, { $addFields: { cast_average_int: { $toInt:"$cast_average" } } } ] ) ``` - Retrieve the number of movies and series for each country order by count in descending order. Return only top 5 results and store them in a new collection `country_movie_count`. ```javascript! db.movies.aggregate( [ { $match: { type: { $ne:null }, countries: { $type:"array" } } }, { $unwind: "$countries" }, { $group: { _id: [ "$countries", "$type" ], count: { $count: {} } } }, { $sort: { count:-1 } }, { $limit:5 }, { $project: { _id:false, 'Country': { $arrayElemAt: ["$_id", 0] }, 'Movie': { $arrayElemAt:["$_id", 1] }, count:true } } , { $out: "country_movie_count" } ] ) ``` ::: ## Joins We can join documents on collections in MongoDB by using the `$lookup` (Aggregation) function. `$lookup`(Aggregation) creates an outer left join with another collection and helps to filter data from merged data. If documents are part of a “joined” collection, the `$lookup` (Aggregation) function will return documents in the form of a subarray of the original collection. In MongoDB’s JOIN operation, the goal is to connect one set of data to another set of data. To join two collections, we use the `$lookup` operator, whose syntax is defined below: ```go { $lookup: { from: <collection to join>, localField: <field from the input documents>, foreignField: <field from the documents of the "from" collection>, as: <output array field> } } ``` The `$lookup` function accepts a document containing these fields: - **From:** It describes the collection within an identical database that should be used for executing the join, but with sharded collection restrictions. - **LocalField:** This field represents the input field from the documents to the stage of $lookup, which performs an equivalence match between the local-field and the foreign-field from the collection ‘from.’ If an input document does not have a value for a local-field, then this operator will give the field a null value for matching purposes. - **foreignField:** This field contains data from the documents in the ‘from’ collection with which an equivalence match can be made between the foreignField and the localField. When a document in the collection ‘from’ does not have a foreignField value, this operator will set the field to null for further matching purposes. - **As:** It specifies the name of the array field that needs to be added to the input documents. More so, a new array field also includes matching documents from the collection ‘from.’ The prevailing field will be overwritten if the stated name already exists in the input document. **Example:** ```go db.address.insertMany( [ { "name": "Bob", "blk_no": 22, "street" : "dewey street", "city" : "United States of America" }, { "name": "Jack", "blk_no": 25, "street" : "gordon street", "city" : "New Zealand" } ] ); { "acknowledged" : true, "insertedIds" : [ ObjectId("613594cdb59313217373673c"), ObjectId("613594cdb59313217373673d") ] } db.userInfo.insertMany( [ { "contact_name": "Bob", "age": 27, "sex" : "male", "citizenship" : "Filipino" }, { "contact_name": "Jack", "age": 22, "sex" : "male", "citizenship" : "Filipino" } ] ); { "acknowledged" : true, "insertedIds" : [ ObjectId("613594dbb59313217373673e"), ObjectId("613594dbb59313217373673f") ] } ``` The join operation is: ```go db.userInfo.aggregate([ { $lookup: { from: "address", localField: "contact_name", foreignField: "name", as: "address" } } ]).pretty(); ``` :::spoiler Output The output is: ```go Output: { "_id" : ObjectId("613594dbb59313217373673e"), "contact_name" : "Bob", "age" : 27, "sex" : "male", "citizenship" : "Filipino", "address" : [ { "_id" : ObjectId("613594cdb59313217373673c"), "name" : "Bob", "blk_no" : 22, "street" : "dewey street", "city" : "United States of America" } ] } { "_id" : ObjectId("613594dbb59313217373673f"), "contact_name" : "Jack", "age" : 22, "sex" : "male", "citizenship" : "Filipino", "address" : [ { "_id" : ObjectId("613594cdb59313217373673d"), "name" : "Jack", "blk_no" : 25, "street" : "gordon street", "city" : "New Zealand" } ] } ``` ::: To learn more about aggregation framework. I recommend the online book [`https://www.practical-mongodb-aggregations.com`](https://www.practical-mongodb-aggregations.com). :::success ### Exercises on MongoDB - Retrieve number of movies watched by each user sorted by movies count in descending order. ```javascript! db.movies.aggregate( [ { $lookup: { from: "comments", foreignField: "movie_id", localField: "_id", as : "movie_comments" } }, { $out: "movie_comments" } ] ) db.movie_comments.aggregate( [ { $group: { _id:"$movie_comments.email", movie_count: { $count:{} } } }, { $match: { _id: { $not:{$size: 0} } } }, { $sort:{ movie_count:-1 } } ] ) ``` ::: ## Text search If your documents include text fields then you can speed up text search via using [`$text`](https://www.mongodb.com/docs/v4.2/reference/operator/query/text/) operator but you need to create an index for that field. `$text` performs a text search on the content of the fields indexed with a text index. A `$text` expression has the following syntax: ```go { $text: { $search: <string>, $language: <string>, $caseSensitive: <boolean>, $diacriticSensitive: <boolean> } } ``` Example: The following examples assume a collection articles that has a version 3 `text` index on the field subject: ```go db.articles.createIndex( { subject: "text" } ) ``` Populate the collection with the following documents: ```go db.articles.insert( [ { _id: 1, subject: "coffee", author: "xyz", views: 50 }, { _id: 2, subject: "Coffee Shopping", author: "efg", views: 5 }, { _id: 3, subject: "Baking a cake", author: "abc", views: 90 }, { _id: 4, subject: "baking", author: "xyz", views: 100 }, { _id: 5, subject: "Café Con Leche", author: "abc", views: 200 }, { _id: 6, subject: "Сырники", author: "jkl", views: 80 }, { _id: 7, subject: "coffee and cream", author: "efg", views: 10 }, { _id: 8, subject: "Cafe con Leche", author: "xyz", views: 10 } ] ) ``` ### Search for a Single Word The following query specifies a `$search` string of "coffee": ```go db.articles.find( { $text: { $search: "coffee" } } ) ``` This query returns the documents that contain the term coffee in the indexed subject field, or more precisely, the stemmed version of the word: ```go { "_id" : 2, "subject" : "Coffee Shopping", "author" : "efg", "views" : 5 } { "_id" : 7, "subject" : "coffee and cream", "author" : "efg", "views" : 10 } { "_id" : 1, "subject" : "coffee", "author" : "xyz", "views" : 50 } ``` ### Exclude Documents That Contain a Term The following example searches for documents that contain the words coffee but do not contain the term shop, or more precisely the stemmed version of the words: ```go db.articles.find( { $text: { $search: "coffee -shop" } } ) ``` The query returns the following documents: ```go { "_id" : 7, "subject" : "coffee and cream", "author" : "efg", "views" : 10 } { "_id" : 1, "subject" : "coffee", "author" : "xyz", "views" : 50 } ``` ### Return the Text Search Score The following query searches for the term cake and returns the score assigned to each matching document: ```go db.articles.find( { $text: { $search: "cake" } }, { score: { $meta: "textScore" } } ) ``` The returned document includes an additional field score that contains the document’s score associated with the text search. [`$meta`](https://www.mongodb.com/docs/v4.2/reference/operator/aggregation/meta/#exp._S_meta) operator is an aggregation operator. ### Text Search with Additional Query and Sort Expressions The following query searches for documents where the author equals "xyz" and the indexed field subject contains the terms coffee or bake. The operation also specifies a sort order of ascending date, then descending text search score: ```go db.articles.find( { author: "xyz", $text: { $search: "coffee bake" } }, { score: { $meta: "textScore" } } ).sort( { date: 1, score: { $meta: "textScore"} } ) ```      # References - [PyMongo 3.13.0 documentation](https://pymongo.readthedocs.io/en/3.13.0/) - [MongoDB v4.2 docs](https://www.mongodb.com/docs/v4.2/introduction/) - [MongoDB Join example](https://hevodata.com/learn/mongodb-join-two-collections/#l2) - [Intro to MongoDB](https://www.studytonight.com/mongodb/introduction-to-mongodb)