MongoDB Aggregation I: Counting and Grouping

November 22, 2009

This is the first in a series of articles detailing the syntax, patterns, and use cases for MongoDB’s aggregation functions:

  1. Counting and Grouping
  2. Grouping Elaborated
  3. Map-reduce Basics

Here, as an introduction, we cover the basics of count(), distinct(), and group().

Count

count() simply returns the number of documents in a collection.

Suppose we have a collection where each document represents a pageview. We can get the total number of pageviews like so:

// Javascript
db.pageviews.count()
# Ruby
@db['pageviews'].count

Admittedly more useful would be the number of pageviews from a given month:

// Javascript
db.pageviews.count({date: 
  {'$gte': new Date("Nov 1, 2009"), '$lt': new Date("Dec 1, 2009")}})
# Ruby
@db['pageviews'].find({:date => 
  {'$gte' => Time.utc(2009, 11), '$lt' => Time.utc(2009, 12)}}).count

Notice that we’ve passed a document selector to the count() method. This ensures that only documents matching the selector are included in the total.

Distinct

Another simple but useful aggregator is distinct(), which returns an array of unique values for a given key in a collection. We can specify a root-level key or a nested key. Here, we request unique ip-addresses and user agents:

// Javascript
db.pageviews.distinct('ip-address')
db.pageviews.distinct({'user.agent'})
# Ruby
@db['pageviews'].distinct('ip-address')
@db['pageviews'].distinct('user.agent')

Group

MongoDB’s group() command provides some of the same functionality as SQL’s GROUP BY, although the syntax is rather different. Like most database commands, group() takes a document whose keys designate the parameters of the operation:

// Group by user-agent, returning a sum for each user-agent.
db.pageviews.group(
{
 key: {'user.agent': true}, 
 initial: {sum: 0}, 
 reduce: function(doc, prev) { prev.sum += 1}
});

// Returns (Note: simplified)
[
  {
    'user.agent': 'Mozilla 5.0 / Gecko'
    'sum': 241
  },
  {
    'user.agent': 'Mozilla 5.0 / Webkit' 
    'sum': 79
  }
]

Looking at the command itself, the first parameter is clear enough: key specifies which key or keys to group by.

The next two parameters, initial and reduce, may be less familiar. Their use is analgous to most programming language implementations of inject.

With initial, we provide a base, aggregator document for each grouped result. Here, we’re just saying that the base document should contain one key, sum, having a value of 0.

// Initial, aggregator document
{sum: 0}

The reduce function take two parameters: 1) the current document being processed, and 2) the document we’re aggregating over (which starts out as the initial document described above).

// Reduce function
function(doc, prev) {
  prev.sum += 1;
}

In this simple case, we don’t even use the doc parameter; we just increment our initial document’s sum key. Since a new document is used for each grouping, the group() function returns all those documents as an array, adding any keys we’re grouping by, so that we’re given a result like this:

// Result (Note: simplified)
[
  {
    'user.agent': 'Mozilla 5.0 / Gecko'
    'sum': 241
  },
  {
    'user.agent': 'Mozilla 5.0 / Webkit' 
    'sum': 79
  }
]

The command we’ve been describing could be run from Ruby as follows:

# Group from Ruby
@db['pageviews'].group(['user.agent'], nil, {'sum' => 0},
  "function(doc, prev) { prev.sum += 1}")

With the second parameter, here nil, we can provide a query selector, so that group() will only operate over a certain subset of the collection. In the next installment, we explore that along with some of group()’s more advanced features.

In the meantime, fire the up the MongoDB shell, experiment with some of these functions, and don’t hesitate to consult the docs.

funky dingbat
blog comments powered by Disqus