Introduction
Counting distinct values is a fundamental data aggregation task that can be complex depending on the structure of your data. MongoDB offers a flexible aggregation framework to carry out this operation effectively. In this tutorial, we’ll walk you through the steps of counting distinct values in each group using MongoDB.
Understanding $group and $addToSet
Before diving deep into counting distinct values, let’s understand the ‘$group’ stage in the MongoDB aggregation pipeline. The ‘$group’ stage groups input documents by a specified identifier expression and applies accumulator expressions to each group. ‘$addToSet’ is an accumulator operator that adds unique values to a set.
db.collection.aggregate([
{ $group: {
_id: '$fieldToGroupBy',
uniqueValues: { $addToSet: '$fieldToGetDistinctValuesFrom' }
}}
])
This pattern groups documents by ‘fieldToGroupBy’ and accumulates unique ‘fieldToGetDistinctValuesFrom’ values into an array for each group.
Counting Distinct Values with $size
Now that we have groups with unique values, to count them, we can use another operator ‘$size’ that returns the number of elements in an array.
db.collection.aggregate([
{ $group: {
_id: '$fieldToGroupBy',
uniqueValues: { $addToSet: '$fieldToGetDistinctValuesFrom' }
}},
{ $project: {
countDistinct: { $size: '$uniqueValues' }
}}
])
The ‘$project’ stage here computes the number of distinct values by getting the size of the ‘uniqueValues’ array.
Example: Group by Category
Consider a products collection where each document represents a product with a category and supplier. To count distinct suppliers for each category:
db.products.aggregate([
{ $group: {
_id: '$category',
distinctSuppliers: { $addToSet: '$supplier' }
}},
{ $project: {
numberOfSuppliers: { $size: '$distinctSuppliers' }
}}
])
Output would look something like this:
[
{ _id: "Electronics", numberOfSuppliers: 5 },
{ _id: "Clothing", numberOfSuppliers: 3 },
...
]
Handling Large Sets and Performance
While using ‘$addToSet’ and ‘$size’ works well for small datasets, when dealing with large datasets, this can lead to performance issues since ‘$addToSet’ holds all unique values in memory. A better approach for large sets is to use the ‘$sum’ accumulator with a sub-pipeline that handles the distinct count using ‘$group’.
db.collection.aggregate([
{ $group: {
_id: {
groupField: '$fieldToGroupBy',
distinctField: '$fieldToGetDistinctValuesFrom'
}
}},
{ $group: {
_id: '$_id.groupField',
countDistinct: { $sum: 1 }
}}
])
This approach first groups documents by a compound key including the group field and the distinct values. Subsequently, it regroups by just the group field, and counts each unique value once.
Indexing and Performance Optimization
For performance gains when counting distinct values, ensure proper indexing. Creating a compound index on the fields used for ‘$group’ can significantly improve aggregation speed.
db.collection.createIndex({ fieldToGroupBy: 1, fieldToGetDistinctValuesFrom: 1 })
After indexing, the aggregation queries will perform efficiently, even on larger datasets.
Use Case: Count Distinct Status Codes
Let’s consider a more concrete example: a logs collection with fields for userId and statusCode. We want to count how many distinct status codes each user has encountered.
db.logs.aggregate([
{ $group: {
_id: {
user: '$userId',
code: '$statusCode'
}
}},
{ $group: {
_id: '$_id.user',
uniqueStatusCodes: { $sum: 1 }
}}
])
Results would display the count of distinct status codes experienced by each user.
Advanced Techniques: Using $facet for Multiple Aggregations
When you need to perform multiple distinct counts in one query or gather additional statistics, ‘$facet’ provides a way to execute several aggregation pipelines within a single stage and return the results in a single document.
db.collection.aggregate([
{
$facet: {
"statusCounts": [
{ $group: { _id: "$status", count: { $sum: 1 } } }
],
"distinctUsersPerStatus": [
{ $group: { _id: "$status", users: { $addToSet: "$user" } } },
{ $project: { count: { $size: "$users" } } }
]
}
}
])
This query produces two separate counts: one counting the occurrences of each status and one counting the distinct users per status.
Conclusion
Counting distinct values in each group in MongoDB is a powerful technique for data analysis. It can be straightforward or involve more complex aggregation, especially for performance optimization in large data sets. Indexing is key for the best performance.