In the world of SQL, you normalize your data into many small, distinct tables. In MongoDB, the approach is different. The goal is to structure your data to match the way your application queries it. This often means denormalizing and putting related data together in a single document.

The most important decision you'll make when designing your schema is whether to embed related data or to reference it.

The Core Decision: Embedding vs. Referencing

1. Embedding (Denormalization)

Embedding means storing related data as a sub-document or an array of sub-documents inside a parent document.

When to use it:

  • For "one-to-few" relationships.
  • When the data is intrinsically tied to the parent document and is not accessed on its own.
  • When you want to retrieve all related data in a single database query.

Example: A blog Post that has a few Comments. The comments "belong" to the post.

JSON


{
  "_id": "post123",
  "title": "My First Post",
  "content": "This is the content...",
  "comments": [
    { "author": "Alice", "text": "Great post!" },
    { "author": "Bob", "text": "Thanks for sharing." }
  ]
}
  • Pros:Fast Reads. You get the post and all its comments in one database call. Atomic operations on a single post (e.g., updating the title and adding a comment) are simple.
  • Cons:Large Documents. If the embedded array (comments) could grow infinitely, the document could exceed MongoDB's 16MB size limit.

2. Referencing (Normalization)

Referencing means storing related data in separate collections and using a unique ID to link them, similar to a foreign key in SQL.

When to use it:

  • For "one-to-many" (where "many" is very large) or "many-to-many" relationships.
  • When the related data is frequently accessed on its own.
  • To avoid data duplication.

Example: An e-commerce application with Products and Orders. A single product can be part of thousands of orders.

JSON


// In the 'products' collection
{
  "_id": "product456",
  "name": "Laptop",
  "price": 1200
}

// In the 'orders' collection
{
  "_id": "order789",
  "userId": "userABC",
  "items": [
    {
      "productId": "product456", // Reference to the product
      "quantity": 1
    }
    // ... other items
  ]
}
  • Pros:Smaller Documents. Avoids data duplication (the product name and price aren't copied into every order). No risk of unbounded arrays.
  • Cons:Slower Reads. To get the full order details (with product names and prices), your application needs to make a second query to the products collection. (This can be done on the server with the $lookup aggregation stage).

Common Schema Design Patterns

  • Attribute Pattern: Useful when you have many fields of a similar type, and you often query on them. Instead of having dozens of top-level fields, group them into an array of key-value pairs. This makes indexing and querying more efficient.
  • Example: A product with many specifications.
  • JSON

{
  "name": "Laptop",
  "specs": [
    { "k": "ram_gb", "v": 16 },
    { "k": "screen_in", "v": 14 },
    { "k": "cpu_mhz", "v": 3200 }
  ]
}
  • Extended Reference Pattern: This is a hybrid approach that gives you the best of both worlds. You reference another document but also duplicate a few frequently needed fields to avoid a second query for common use cases.
  • Example: A comment on a post. You reference the authorId but also duplicate the authorUsername for easy display.
  • JSON

{
  "_id": "comment999",
  "postId": "post123",
  "text": "This is a comment.",
  "authorId": "userABC",
  "authorUsername": "Alice" // Duplicated data for faster reads
}