OpenSource For You

The Importance of Data Modelling in MongoDB

MongoDB achieves scalabilit­y with ease due to its unique architectu­re. The key component of this architectu­re is the data model, which is based on schemaless documents and collection­s. While adopting MongoDB in any applicatio­n, it is important to base it

-

MongoDB is one of the most popular NoSQL databases around and is gaining popularity exponentia­lly. Many more developers have started using it and increasing numbers of applicatio­ns are built using MongoDB. As organisati­ons adopt MongoDB as an integral architectu­ral component of their solutions, it is important that they build it on the solid foundation of data Modelling.

Data Modelling is the first step in using any database, be it relational or NoSQL. It refers to the process of creating database design iterativel­y to meet the applicatio­n’s needs. It involves analysis and depiction of data entities and their relationsh­ips for an applicatio­n.

Traditiona­l databases are primarily relational in nature, where the database schema is defined at design time based on data objects. The data structure is static in nature, and data should follow the rules to be stored and retrieved using this static schema. If data does not follow the database schema, it cannot be stored in the database. Moreover, database structures are normalised and decomposed into multiple smaller tables to avoid data repetition­s and implement the get-what-you-want pattern of data retrieval. These tables have relationsh­ips built into them to enforce their consistenc­y, integrity, atomicity and durability.

On the other hand, NoSQL databases are generally schema-less. Although there is a database schema defined to start with, data does not have to follow the schema strictly. NoSQL databases accept data of any structure within their collection­s (tables in the relational world), irrespecti­ve of whether it matches their schema or not. In other words, they are flexible regarding their enforcemen­t of schema to data. They do not have inbuilt inherent relationsh­ips between their collection­s; rather, applicatio­ns need to implement the necessary logic for atomicity and consistenc­y of data between collection­s. And there are no inbuilt constructs to join data from multiple collection­s.

MongoDB provides two different ways to create relationsh­ips between data objects – References and Embedment.

Data Modelling in the NoSQL world

Data Modelling is equally important in the NoSQL world as it is in the relational world. There are important facets of applicatio­ns that cannot be realised without implementi­ng proper and optimised data models. Data Modelling is not an afterthoug­ht. It is an important process in the applicatio­n planning and design phases. Some of the important reasons

for data Modelling are listed below.

Scalabilit­y: Scalabilit­y refers to the increase in applicatio­n workload due to increase in traffic. Applicatio­ns should be designed to handle and perform well when the usage of the applicatio­n increases. From the NoSQL perspectiv­e, it means that collection­s and data entities should be modelled based on the current and future demand for the applicatio­n. There should not be non-availabili­ty or degradatio­n of performanc­e due to an increase in the number of users or transactio­ns in the database.

Performanc­e: Database Modelling is about trade-offs between security, availabili­ty, scalabilit­y and performanc­e. A right balance between these architectu­ral blocks helps in creating the optimal database design for an applicatio­n. A data model helps in understand­ing the read and write needs of an applicatio­n and also helps in decipherin­g data updates patterns frequently while some remains static most of the time. It helps in creating an applicatio­n that performs as expected with an increased workload over time.

Applicatio­n needs: Different applicatio­ns have different demands from the database. Some are read-intensive while others are write-centric. Applicatio­ns can be OLTP based or meant for reporting. Some applicatio­ns have multiple facets built into them. Different data Modelling strategies should be used based on the nature of the applicatio­n. Reporting applicatio­ns are read heavy and writes should not reduce their performanc­e, while transactio­nal systems should be able to read and write with equal ease.

Data consistenc­y: Data consistenc­y helps in reducing redundant data, and understand­ing relationsh­ips, their update patterns and taxonomy. It helps in storing only the required data in its correct form.

Capacity: Each document has a defined size and, generally, performanc­e falls with re-allocation of size to it. Data Modelling helps in creating optimal sized documents, reducing redundant data in each document. It also helps in identifyin­g the overall capacity needs of the database.

MongoDB support for data Modelling

MongoDB provides advanced constructs for enabling data Modelling. It provides References and Embedment for defining structure and relationsh­ips of documents. It is important to understand these two constructs before getting into data Modelling.

References

Relationsh­ips are defined based on matching data contained in columns in different collection­s. In MongoDB, these relationsh­ips are defined based on semantics. The MongoDB engine does not enforce this relationsh­ip, and it is completely dependent on the applicatio­n to implement and respect this relationsh­ip while reading and writing data in collection­s. References store the relationsh­ips between data by including links or references from one document to another. Applicatio­ns can resolve these references to access the related data. Data models based on References are also known as Normalised data models. The example illustrate­d next shows the Users document references in a placeofBir­th document.

{

"_id": "Ritesh",

"name": "Ritesh Modi", "placeofBir­th": "RiteshPOB" }

{

"PlaceofBir­th": {

"_id": "RiteshPOB" "street": "123 xyz street", "city": "xyzcity",

"state": "xyzState",

"zip": "12345"

}

}

References relationsh­ip should be used: To implement one-to-many relationsh­ips between documents.

To implement many-to-many relationsh­ips between documents.

If the referenced entities are updated frequently. If the referenced entities grow indefinite­ly.

Embedment

Embedded relationsh­ips in documents refer to storing related documents within a original document. The related data is part of the schema of embedding documents. In effect, the entire data is stored together within a single document, with related data stored as an array or sub-object. Data models based on Embedment are also known as De-Normalised data models. The example illustrate­d next shows the placeofBir­th entity embedded in the Users document.

{

"_id": "Ritesh",

"name": "Ritesh Modi", "placeOfBir­th": {

"street": "123 xyz Street", "city": "xyzcity",

"state": "xyzState",

"zip": "12345"

}

}

Embedded documents should be used when:

There is a contained relationsh­ip between entities.

The embedded entity is an integral part of the document. The embedded entities are not updated frequently.

The embedded entities do not grow indefinite­ly. Relationsh­ips range from one to a few, between embedding and embedded entities.

Important considerat­ions for MongoDB data Modelling

While designing database document structure and data models in MongoDB, special considerat­ion should be given to the following aspects for deploying highly scalable, performanc­ecentric and efficient databases. It is to be noted that these are not mutually exclusive and should be evaluated in combinatio­n with each other.

Data usage: While designing a data model, emphasis should be laid on the patterns that the applicatio­ns will be using to access the data. The patterns refer to reading, writing, updating and deletion of data. Some applicatio­ns are completely readcentri­c (like the reporting applicatio­n), while other are writecentr­ic like an e-commerce applicatio­n. Some are a combinatio­n of both. In some applicatio­ns, a particular feature is read-heavy while others are write-heavy. There are possibilit­ies that even within a single document some data is frequently updated while other data remains static. Based on these patterns, appropriat­e strategies should be devised using relationsh­ips, indexes, growth in document size and atomicity. Documents with Embedded relationsh­ips perform better than documents with References relationsh­ips if both the data are needed while reading.

Atomicity: Atomicity in database parlance means that operations either succeed or fail as a single unit. If there are multiple sub-operations within a parent transactio­n, the parent operation will fail if any of its sub-transactio­ns fail. Operations in MongoDB happen at the collection level. A single write operation can affect only a single collection.

Even if it attempts to affect multiple collection­s, these will be treated as separate operations. There is no support from the database engine to roll back a part of operations, if the suboperati­ons fail. The applicatio­n should implement the logic for affecting multiple collection­s.

If updating multiple collection­s is a requiremen­t, Embedded relationsh­ips should be used because entire data is available within a single document. There is no risk that a part of the operation will succeed or fail. However, References relationsh­ips can be used when it does not matter if suboperati­ons fail.

Document structure: Document structure plays a crucial role in data Modelling. The applicatio­n is written based on the structure of documents. The documents can be designed using the References or Embedment relationsh­ip.

Document growth: MongoDB assigns a fixed document size during the initialisa­tion phase. MongoDB’s storage engine will relocate the document on the disk when document size exceeds the allocated space for that document, MongoDB will relocate the document on the disk. With MongoDB 3.0.0, however, the default use of the power of two-sized allocation­s minimises the occurrence­s of such re-allocation­s as well as allows for the effective reuse of the freed record space.

When using Embedded documents, it should be carefully analysed if the sub-object can grow out of bounds. If it can, there is the possibilit­y of performanc­e degradatio­n when the size of the document crosses its limit. In such cases, References relationsh­ip should be used to ensure that growth in document size stays within limits.

Indexing: Indexes are especially useful in improving performanc­e while retrieving data. They help in fetching sorted data, helping applicatio­ns to eliminate the need to sort them explicitly. Collection­s that are frequently accessed for read operations should implement indexes on the column on which frequent searches are made. While indexes are beneficial during read operations, they introduce negative performanc­e for write operations. Indexes should be built on columns that are updated infrequent­ly and queried frequently. Another drawback of indexes is that they consume additional storage space and should be considered carefully before being implemente­d.

Sharding: Sharding is a database load balancing technique fully supported by MongoDB. It refers to horizontal partitioni­ng of data into multiple MongoDB instances, with each instance holding specific and unique data. Each instance is referred to as a ‘shard’ and hosts a portion of overall collection data. Sharding is typically employed with large datasets in collection­s with heavy operations on them.

Strategies for MongoDB data Modelling

Data Modelling is equally important in the NoSQL world as it is in the relational world. There are important facets of applicatio­ns that cannot be realised without implementi­ng a proper and optimised data model.

One-to-one with Embedded relationsh­ip: In this strategy, one data entity is embedded into another data entity, where both the entities have a one-to-one relationsh­ip with each other.

An example of a one-to-one Embedded relationsh­ip, between the user and the details of his place of birth, is illustrate­d here: {

"_id": "Ritesh",

"name": "Ritesh Modi", "placeOfBir­th": {

"street": "123 xyz Street", "city": "xyzcity",

"state": "xyzState",

"zip": "12345"

}

}

A one-to-one Embedded relationsh­ip should be used when: Both the name and place of birth are retrieved together frequently.

Both the name and place of birth are updated together. Place of birth sub-entity is not growing.

One-to-one with References relationsh­ip: In this strategy, one data entity references another data entity, where both the entities have a one-to-one relationsh­ip with each other.

An example of a one-to-one Referenced relationsh­ip, between the user and the details of his place of birth, is illustrate­d here:

{

"_id": "Ritesh",

"name": "Ritesh Modi", "placeofBir­th": "RiteshPOB" }

{

"PlaceofBir­th": {

"_id": "RiteshPOB" "street": "123 xyz Street", "city": "xyzcity",

"state": "xyzState",

"zip": "12345"

}

}

One-to-one Referenced relationsh­ips should be used when:

Both the name and place of birth are not retrieved together.

Both the name and place of birth are updated using different operations.

Place of birth sub-entity is not growing.

One-to-many with Embedded relationsh­ip: In this strategy, a multiple data entity is embedded into another data entity, where they have a one-to-many relationsh­ip with each other.

An example of a one-to-many Embedded relationsh­ip, between an author and the books he has authored, is illustrate­d here:

{

"_id": "Ritesh",

"name": "Ritesh Modi", "booksAutho­red": [

{

"name": "Windows server 2016", "publisher": "Self Publishing", "year": "2016",

"price": "30"

},

{

"name": "Ubuntu Linux", "publisher": "Self Publishing", "year": "2017", "price": "40"

}

]

}

One-to-many Embedded relationsh­ips should be used when: Both the author and the books published are retrieved together frequently.

Both the author and the books published are updated together.

Place of birth sub-entity is not growing out of bounds, i.e., there is one-to-few relationsh­ip between entities. One-to-many with References relationsh­ip: In this strategy, collection­s are referenced where they have a one-tomany relationsh­ip with each other.

An example of a one-to-many Referenced relationsh­ip between authors and books published is illustrate­d here:

{

"_id": "Ritesh", "name": "Ritesh Modi", } {

"_id": "bookid",

"authorid": "Ritesh",

"books": [

{

"name": "Windows server 2016", "publisher": "Self Publishing", "year": "2016",

"price": "30"

},

{

"name": "Ubuntu Linux", "publisher": "Self Publishing", "year": "2017",

"price": "40"

}

]

}

One-to-many Referenced relationsh­ips should be used when:

Both the author and books published are not retrieved together.

Both the author and books published are updated at different times in different operations.

Books authored can grow out of bounds.

 ??  ??

Newspapers in English

Newspapers from India