MongoDB 5.0 Throughput Performance

MongoDB 5.0 Throughput Performance vs. Data Integrity Balance - A Comparison with 4.4

Jul 30th 21mongodbthroughputperformance

Introduction

In an age where data volumes are continuously getting larger and the number of online users is unprecedented, data throughput is one of the most crucial factors of the success of Big Data architectures. High Throughput capacity allows companies to capture more data from users and IoT devices, which can be used to solve customers' problems and improve their experience.

MongoDB is currently one of the most used NoSQL Document databases. It's available directly from MongoDB as well as in the form of managed services from all the major Cloud providers. In this article, we'll be analyzing the throughput capacity on MongoDB 5.0 which was announced on the 13th of July 2021, and compare it with the previous version (4.4).

Key Performance Indicators & Benchmark Environment

Performance Indicators Definition

Before conducting the analysis, it's important to define the key performance indicators we are interested in. In our case, we will be mainly focused on raw throughput, which represents the number of write transactions that the MongoDB instance is capable of processing per second.

Since MongoDB is optimized for high volumes of data, measuring its performance on small batches of transactions can result in imprecise results. Thus, we will measure the throughput per 1 million transactions and normalize it to get the rate per second.

The Choice of The Technologies

In order to reduce variance and maximize the reproducibility of our tests, we will be building a benchmark script and conduct the various measurements using it. For the programming language, we will be using Javascript for various reasons such as the nature of Objects that can be interpreted by MongoDB natively (JSON) without the need for drivers, which reduces overhead. Another reason is that Javascript is the only language supported by mongosh, which is what we used for this test.

The Experimental Environment

To reduce overhead and get the raw throughput rate, we decided to eliminate the following potential bottlenecks:

  • Network Latency: Mongosh is running directly on the MongoDB instance server and connecting to the local IP.
  • Raw Compute & Storage Performance: All the benchmarks were conducted on an AWS c5ad.4xlarge instance, which is running a 2nd generation AMD EPYC 7002 with 16 allocated vCores clocked at 3.3GHz, 32Gb of RAM and a 600Gb Gen 3.0 NVME SSD.
  • Driver / Programming Language Overhead: we chose to run a native Javascript script directly in Mongo Shell without the use of any NodeJS framework or driver.
  • Objects Creation / Loading: The objects that we will be inserting will be created in advance in memory before starting the insertion process. This is important to isolate since it depends on the used technologies and the specific case and application. The objects themselves will be simple JSON objects containing a random Integer, a random hash, and a timestamp.

The Benchmark: Implementation

As explained in the previous section, we start by generating a set of random objects made of an Integer, a String, and a timestamp. These objects are loaded into memory with minimal manipulation.

var random_documents = new Array()

for (let index = 0; index < documents_num; index++){

    var rand_int = Math.round(Math.random()*1000000)
    var rand_hash = [1,2,3].map((el)=>Math.random().toString(16).substring(2)).join('')
    var current_date = new Date()

    random_documents.push({
        rand_int,
        rand_hash,
        current_date
    })
}

After preparing the set of documents, we start a stopwatch and initiate the write operations. The insertion is done by batch, and to analyze the performance impact of the batch size as well, we tested the throughput starting by a batch of 1 (single document writes) up to a batch of 500 documents.

While inserting big volumes of data to a MongoDB instance, it's highly recommended to use bulk insertion methods such as insertMany or bulkWrite. These methods allow the write of multiple documents at once and result in important performance gains as opposed to a single insert method.

var stopwatch = new Date();

for (let index = 0; index < documents_num; index+=batch_size)
    db.benchmark_collection.insertMany(
            random_documents.slice(index, index+batch_size)
    )

In MongoBD, there is a maxWriteBatchSize setting that defines the maximum size of batches, this variable had a value of 1,000 before it was increased to 100,000 in MongoDB 3.6. Any operation that exceeds this limit will be divided into multiple 100,000 transactions groups.

Below, we show the performance results we got from inserting 1 million different random documents to a new collection, using different batch sizes from 1 to 400,000.

Paperboat - MongoDB Performance per Batch Size

Based on the results, we can see that the performance keeps increasing proportionally to the batch size. However, past a batch of 1,000 records, performance gains decrease dramatically to the point where we can consider it negligible. For this reason, we will be using a batch size of 1,000 records for the benchmark.

To start the benchmark, we launch it using mongosh directly like the following:

mongosh benchmark --eval "var documents_num=1000000;batch_size=1000" benchmark.js

We pass the values of the number of documents and the batch sizes as execution variables to the mongosh command as shown in our example.

Results (Using Default Write Concerns)

The maximum performance that we got with MongoDB 5.0 is 102k writes / second, which impressive compared to other relational databases such as MySQL (around 8k writes / second).

Since MongoDB 5.0 is relatively new, and to have a baseline comparison with the previous version, we ran the same benchmark using MongoDB 4.4 on the exact same hardware configuration and the result is a maximum throughput of ~102k writes / second, which is exactly the same performance of the new release. This confirms the changelog of the new version and that there was no important impact in performance moving to version 5.

Results (Using Custom Write Concerns)

The performance results were obtained by using the default writeConcern and ordered flags. However, these can be customized for each use case.

db.benchmark_collection.insertMany(
   [ <document A> , <document B>, <document C>, ... ],
   { writeConcern: <document>, ordered: <boolean> }

With its early versions, MongoDB's default configurations was set to write data in an unsafe way to get the maximum performance. Later on, a safer default Write Concern was introduced, which offered a better data integrity by default.

As described by the official MongoDB documentation: Write Concerns describe the level of acknowledgment requested from MongoDB for write operations. The writeConcern object is defined as follows:

{ w: <value>, j: <boolean>, wtimeout: <number> }

The w flag defines the minimum number of mongod instances to get a write acknowledgement from. It can have 3 possible values:

  • A number N: By specifying a number, MongoDB will return a positive acknowledgment only if the write operation was successful on this number of mongod instances at minimum. The value w=1 is the default configuration in MongoDB. With this value, the requested acknowledgement is for the standalone mongod or the primary instance of a replica set. With a value of w=0, no acknowledgements are requested. This configurations deactivates the check for successful writes, which may potentially increase performance. With w = N > 1, the acknowledgment is requested for successful writes on the primary node as well as N-1 secondary nodes.
  • majority: When this value is set, acknowledgement is requested for the primary node and N-1 secondary nodes, where N is the number of the majority of the voting nodes. For example, a replica set of 5 members (P-S-S-S-S) has a majority of 3 nodes, in this case acknowledgement will be requested for the primary and 3 secondary nodes. This setting can be considered as a generalization of the first one and it's adapted to elastic clusters where the number of voting nodes can charge throughout the lifetime of the cluster.
  • tag: With a custom value, custom acknowledgment is requested from nodes which have the specified tag, this is defined in settnigs.getLastErrorModes.

The j flag is a configuration related to journaling acknowledgment. MongoDB offers an automatic journaling that can be used in case of a failure. This value is boolean:

  • True: When this value is set, an acknowledgment that mongod have written to the on-disk journal is requested.
  • False: If this value is set, no journaling acknowledgements are requested.

For the timeout setting, it specifies the maximum duration in milliseconds to get the requested acknowledgements before considering it a failure.

For production use cases, studying acknowledgements and setting them according to the use is highly recommended. However, for our experimentation, we will disable all acknowledgements and on-disk journaling (in-memory journaling can't be disabled) and run our tests to see if that removes any potential overhead in the write operations.

The setting we will be using is the following:

var stopwatch = new Date();

for (let index = 0; index < documents_num; index+=batch_size)
    db.benchmark_collection.insertMany(
            random_documents.slice(index, index+batch_size),
						{w: 0, j: false}
        )

By removing all acknowledgments, there are the obtained results:

Paperboat - MongoDB Performance Write Concerns

Conclusion

As published in the MongoDB's blog, the new version 5.0 introduced many new features such as native time series collections and Live Sharding. Even with the introduction of these new high level features, and based on our benchmark, the performance of the engine is still almost intact compared to older versions. In terms of Write Concerns impact, the throughput increased marginally when completely disabling all Write Concerns and automatic Journaling, however, the gains aren't significant especially if we take in consideration the risk taken in terms of data integrity checks, which is why it's highly recommended to either leave the default configuration for basic use cases or fine tune it for high scale architectures.

Made with a lot of

and