In an age where data volumes are continuously getting larger and the number of online users is unprecedented, data throughput is one of the most crucial factors of the success of Big Data architectures. High Throughput capacity allows companies to capture more data from users and IoT devices, which can be used to solve customers' problems and improve their experience.
MongoDB is currently one of the most used NoSQL Document databases. It's available directly from MongoDB as well as in the form of managed services from all the major Cloud providers. In this article, we'll be analyzing the throughput capacity on MongoDB 5.0 which was announced on the 13th of July 2021, and compare it with the previous version (4.4).
Before conducting the analysis, it's important to define the key performance indicators we are interested in. In our case, we will be mainly focused on raw throughput, which represents the number of write transactions that the MongoDB instance is capable of processing per second.
Since MongoDB is optimized for high volumes of data, measuring its performance on small batches of transactions can result in imprecise results. Thus, we will measure the throughput per 1 million transactions and normalize it to get the rate per second.
In order to reduce variance and maximize the reproducibility of our tests, we will be building a benchmark script and conduct the various measurements using it. For the programming language, we will be using Javascript for various reasons such as the nature of Objects that can be interpreted by MongoDB natively (JSON) without the need for drivers, which reduces overhead. Another reason is that Javascript is the only language supported by mongosh, which is what we used for this test.
To reduce overhead and get the raw throughput rate, we decided to eliminate the following potential bottlenecks:
As explained in the previous section, we start by generating a set of random objects made of an Integer, a String, and a timestamp. These objects are loaded into memory with minimal manipulation.
var random_documents = new Array()
for (let index = 0; index < documents_num; index++){
var rand_int = Math.round(Math.random()*1000000)
var rand_hash = [1,2,3].map((el)=>Math.random().toString(16).substring(2)).join('')
var current_date = new Date()
random_documents.push({
rand_int,
rand_hash,
current_date
})
}
After preparing the set of documents, we start a stopwatch and initiate the write operations. The insertion is done by batch, and to analyze the performance impact of the batch size as well, we tested the throughput starting by a batch of 1 (single document writes) up to a batch of 500 documents.
While inserting big volumes of data to a MongoDB instance, it's highly recommended to use bulk insertion methods such as insertMany or bulkWrite. These methods allow the write of multiple documents at once and result in important performance gains as opposed to a single insert method.
var stopwatch = new Date();
for (let index = 0; index < documents_num; index+=batch_size)
db.benchmark_collection.insertMany(
random_documents.slice(index, index+batch_size)
)
In MongoBD, there is a maxWriteBatchSize setting that defines the maximum size of batches, this variable had a value of 1,000 before it was increased to 100,000 in MongoDB 3.6. Any operation that exceeds this limit will be divided into multiple 100,000 transactions groups.
Below, we show the performance results we got from inserting 1 million different random documents to a new collection, using different batch sizes from 1 to 400,000.
Based on the results, we can see that the performance keeps increasing proportionally to the batch size. However, past a batch of 1,000 records, performance gains decrease dramatically to the point where we can consider it negligible. For this reason, we will be using a batch size of 1,000 records for the benchmark.
To start the benchmark, we launch it using mongosh directly like the following:
mongosh benchmark --eval "var documents_num=1000000;batch_size=1000" benchmark.js
We pass the values of the number of documents and the batch sizes as execution variables to the mongosh command as shown in our example.
The maximum performance that we got with MongoDB 5.0 is 102k writes / second, which impressive compared to other relational databases such as MySQL (around 8k writes / second).
Since MongoDB 5.0 is relatively new, and to have a baseline comparison with the previous version, we ran the same benchmark using MongoDB 4.4 on the exact same hardware configuration and the result is a maximum throughput of ~102k writes / second, which is exactly the same performance of the new release. This confirms the changelog of the new version and that there was no important impact in performance moving to version 5.
The performance results were obtained by using the default writeConcern and ordered flags. However, these can be customized for each use case.
db.benchmark_collection.insertMany(
[ <document A> , <document B>, <document C>, ... ],
{ writeConcern: <document>, ordered: <boolean> }
With its early versions, MongoDB's default configurations was set to write data in an unsafe way to get the maximum performance. Later on, a safer default Write Concern was introduced, which offered a better data integrity by default.
As described by the official MongoDB documentation: Write Concerns describe the level of acknowledgment requested from MongoDB for write operations. The writeConcern object is defined as follows:
{ w: <value>, j: <boolean>, wtimeout: <number> }
The w flag defines the minimum number of mongod instances to get a write acknowledgement from. It can have 3 possible values:
The j flag is a configuration related to journaling acknowledgment. MongoDB offers an automatic journaling that can be used in case of a failure. This value is boolean:
For the timeout setting, it specifies the maximum duration in milliseconds to get the requested acknowledgements before considering it a failure.
For production use cases, studying acknowledgements and setting them according to the use is highly recommended. However, for our experimentation, we will disable all acknowledgements and on-disk journaling (in-memory journaling can't be disabled) and run our tests to see if that removes any potential overhead in the write operations.
The setting we will be using is the following:
var stopwatch = new Date();
for (let index = 0; index < documents_num; index+=batch_size)
db.benchmark_collection.insertMany(
random_documents.slice(index, index+batch_size),
{w: 0, j: false}
)
By removing all acknowledgments, there are the obtained results:
As published in the MongoDB's blog, the new version 5.0 introduced many new features such as native time series collections and Live Sharding. Even with the introduction of these new high level features, and based on our benchmark, the performance of the engine is still almost intact compared to older versions. In terms of Write Concerns impact, the throughput increased marginally when completely disabling all Write Concerns and automatic Journaling, however, the gains aren't significant especially if we take in consideration the risk taken in terms of data integrity checks, which is why it's highly recommended to either leave the default configuration for basic use cases or fine tune it for high scale architectures.
Made with a lot of
and