Monary: High-performance column queries for MongoDB

Jan 22 2011 Published by admin under Python

MongoDB is a document-oriented database, organized for quick access to records (or rows) of data. When doing analytics on a large data set, it is often desirable to have it in a column-oriented format. Columns of data may be thought of as mathematical vectors, and a wealth of techniques exist for gathering statistics about data that is stored in vector form.

For small to medium sized collections, it is possible to materialize several columns of data in the memory of a modern PC. For example, an array of 100 million double-precision numbers consumes 800 million bytes, or about 0.75 GB. For larger problems, it’s still possible to materialize a substantial portion of the data, or to work with data in multiple segments. (Very large problems require more powerful weapons, such as map/reduce.)

Extracting column data from MongoDB using Python is fairly straightforward. In PyMongo, collection.find() generates a sequence of dictionary objects. When dealing with millions of records, the trick is not to keep these dictionaries in memory, as they tend to be large. Fortunately, it’s easy to move the data in to arrays as it is loaded.

First, let’s create 3.5 million rows of test data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import random
import pymongo

NUM_BATCHES = 3500
BATCH_SIZE = 1000
# 3500 batches * 1000 per batch = 3.5 million records

c = pymongo.Connection("localhost")
collection = c.mydb.collection

for i in xrange(NUM_BATCHES):
    stuff = [ ]
    for j in xrange(BATCH_SIZE):
        record = dict(x1=random.uniform(0, 1),
                      x2=random.uniform(0, 2),
                      x3=random.uniform(0, 3),
                      x4=random.uniform(0, 4),
                      x5=random.uniform(0, 5)
                 )
        stuff.append(record)
    collection.insert(stuff)

Here’s an example that uses numpy arrays:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import numpy
import pymongo

c = pymongo.Connection("localhost")
collection = c.mydb.collection
num = collection.count()
arrays = [ numpy.zeros(num) for i in range(5) ]

for i, record in enumerate(collection.find()):
    for x in range(5):
        arrays[x][i] = record["x%i" % x+1]

for array in arrays: # prove that we did something...
    print numpy.mean(array)

With 3.5 million records, this query takes 85 seconds on an EC2 Large instance running Ubuntu 10.10 64-bit, and takes 88 seconds on my MacBook Pro (2.66 GHz Intel Core 2 Duo with 8 GB RAM).

These timings might seem impressive, given that they’re loading 200,000+ values per second. However, closer examination reveals that much of that time is spent by pymongo as it reads each query result and transforms the BSON result to a Python dictionary. (If you watch the CPU usage, you’ll see Python is using 90% or more of the CPU.)

It is possible to get (much) more speed from the query if we bypass the PyMongo driver. To demonstrate this, I’ve developed monary, a simple C library and accompanying Python wrapper which make use of MongoDB C driver. The code is designed to accept a list of desired fields, and to load exactly those fields from the BSON results into some provided array storage.

Here’s an example of the same query using monary:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from monary import Monary
import numpy

monary = Monary("127.0.0.1")
arrays = monary.query(
    "mydb",                         # database name
    "collection",                   # collection name
    {},                             # query spec
    ["x1", "x2", "x3", "x4", "x5"], # field names
    ["float64"] * 5                 # field types
)
monary.close()

for array in arrays:                # prove that we did something...
    print numpy.mean(array)

Monary is able to perform the same query in 4 seconds flat, for a rate of about 4 million values per second (20 times faster!) Here’s a quick summary of how this Monary query stacks up against PyMongo:

Operation Time for EC2 Large (s) Time for MacBook Pro (s)
Insert Data 102 76
PyMongo Query 85 88
Monary Query 5.4 3.8


Of course, this test has created some fairly ideal circumstances: It’s querying for every record in the collection, the records contain only the queried data (plus ObjectIDs), and the database is running locally. The performance may degrade if we used a remote server, if the records were larger, or if queried for a only subset of the records (requiring either that more records be scanned, or that an index be used).

At present, Monary only knows how to retrieve four types:

* ObjectID
* Int32
* Float64
* Bool

Monary’s source code is available on bitbucket. It includes a copy of the Mongo C driver, and requires compilation and installation, which can be done via the included “setup.py” file. (The installation script works, but is in a somewhat rough state. Any help from a distutils guru would be greatly appreciated!) To run Monary from Python, you will need to have the pymongo and numpy packages installed.

In this first release, Monary has several limitations. Hopefully many of these will be lifted in a future release:

* Only supports a few field types (need to add more numeric types + string support)
* Does not yet support fetching nested fields (e.g. “x.y”)
* Lacks support for MongoDB authentication
* Requires PyMongo and NumPy to be installed (should also support Python’s array module if numpy isn’t installed)

I hope that you find Monary useful. If you have any questions or suggestions, please leave a comment.

* Monary Source Code (on bitbucket)

13 responses so far

  • Just to make sure I’m following the basics — the practical application for heavy-read queries of this sort of database is for graph theory applications, right?

    If this is correct, could you give some background into the proportion of CPU-computation-to-data-query time one would experience in a typical application (e.g. a commercial product recommendations engine) that this would help? In other words, when doing a x20 speedup on these reads, how much does that improve the overall performance of a graph-theory-based application?

    • Beach says:

      Not exactly. MongoDB is a document database, so it doesn’t really make any special design affordances for graph databases.

      Monary provides a high-speed interface for extracting whole columns from a table (or query result). Monary is roughly 20x faster than the same query would be from python using the standard PyMongo driver. (And MongoDB is typically much faster than a conventional RDMBS systems.) This makes Python+MongoDB+Monary+NumPy a powerful combination for near “real-time” analytics.

      For example, suppose you had a table that contained 100,000 orders, and you wanted to extract the id, subtotal, discount, total_amount, and customer_zip columns. You could use Monary to extract this data in a few seconds, with the results returned as 5 arrays corresponding to these columns. Moreover, if you assume a 12-byte ID, 3 x 4-byte prices, and a 5-byte zip, you could store this information using (12+4+4+4+5 =) 29 bytes per record, for a total of about 2,900,000 bytes, or roughly 2.9 MB.

      Now, suppose you’re doing analytics on this data. Perhaps you like to find those zip codes which had the largest average order size, or the largest total order. Maybe you want to find a correlation between zip code and discount received. You can do these computation with relative ease using the numpy library. Best of all, you can have this information available in a matter of seconds, whereas many conventional analytic queries take many orders of magnitude more time.

  • [...] Shared David J C Beach: Monary: High-performance column queries for MongoDB. [...]

  • Dustin says:

    Great work — I’m sure a lot of people will find this very useful, because Python has excellent numerical libraries. Do you have plans to implement support for variable-width numeric arrays – i.e. so one could reconstruct a sparse matrix from distributed records?

    • Beach says:

      Yes, this is something I’d be interested in supporting. One thing I’m not sure about is what the interface for such a feature might look like. Suggestions are welcome…

  • Larry says:

    It’s very fast! I need to retrieve Date type from Mongdb.
    Would you give me some help on adding it to Monary?

    Thanks,

    Larry

    • Beach says:

      Sure, I’d be happy to add something like that into the mix. Let me see if I can get around to that early this week.
      Any idea how this should be supported in numpy? (I haven’t looked into numpy’s support for dates before.)

  • Luca says:

    Very useful tool. I want to share a few benchmarks. I created a math engine for solving financial formulas (built over sympy).
    Test query (from a 60+ fields – 76000 rows table, which is the exact replica of a MySQL one):

    Query1: “SUM(field1)”

    PyMongo (Map Reduce): time of execution 4.63 sec [low mem usage]
    PyMongo (in memory sum): time of execution 4.32 sec [high mem usage]
    Monary (in memory sum): time of execution 0.8 sec [high mem usage]
    MySQL equivalent (SELECT SUM(field1) FROM table): time of execution 0.38 sec [low mem usage]

    Query2: “SUM(field1)+SUM(field2)+SUM(field3)+SUM(field4)+SUM(field5)”

    PyMongo (Map Reduce): time of execution 23 sec
    PyMongo (in memory sum): time of execution 16 sec
    Monary: time of execution (around) 1.6 sec
    MySQL equivalent (SELECT SUM(field1)+SUM(field2)+SUM(field3)+SUM(field4)+SUM(field5) FROM table): time of execution 0.42 sec

    I’m a bit disappointed by Mongo (especially Map Reduce, which btw should improve in 2.x releases).

    Note1: The math engine takes something about 0.004 to parse/call the numpy sum.
    Note2: When I’m selecting multiple field I call monary once with the columns [field1,field2,field3..] yielding the database and calling numpy sum on every dimension connected with the formula (calling monary many times proves to be quite a disastrous task both in terms of performances and memory usage).

  • Beach says:

    Luca,

    Thanks for trying Monary, and for the benchmarks. I’m glad Monary is working for you, but am not surprised that it isn’t entirely competitive with MySQL for this use case. MySQL is at quite an advantage, since the entire SUM operation can be performed in the database process — without transmitting all 76k rows to the client. The fact that Monary is transferring all this data and is only 2-4x slower tells me that it’s doing a pretty impressive job. And what if the operation were something more involved than summing — say there was a matrix multiply that needed to be done or some other operation that required having all the data available to the client. How long does it take to actually retrieve 1-5 fields for all 76k records from MySQL?

    Could you try setting the option “select_fields=True” when you perform the monary.query? I’d be interested to know if this has a performance impact. Setting this option causes the Mongo database to do more work, but uses less bandwidth to the client, so it’s conceivable that the timings could move in either direction.

    Also, I’m considering adding a sort of “cursor” concept to Monary, allowing results of large collections to be handled in blocks using arrays that are smaller than the total collection size. Currently, doing this requires using the limit/offset options, and may become slow for large offsets. The more efficient way to do this is to keep a reference to the underlying Mongo cursor (at the C level), and to continue using that same cursor when loading data into subsequent blocks. It won’t make the queries any faster, but it would certainly cut down on memory usage for this kind of operation. (The arrays might only need to be 1-10k elements, instead of 76k.) Would this be useful to you?

  • Luca says:

    Hi David,
    Two things I forgot in the above post. Your setup.py script is broken (you try to import monary even if I’m going for “build/install”, I managed to install monary by commenting the two imports in monary/__init__.py, building cmongo, building cmonary, building, installing, uncommenting the two imports, rebuilding, reinstalling which was quite painful :) ).
    Also you don’t support python 2.6 which is default shipping on debian; you should import the “python implementation” of OrderedDict, which you can find at http://code.activestate.com/recipes/576693/ if you can’t load the c one which comes with python 2.7 (to remain at least compatible).

    I think that monary is the natural substitution of the aggregation commands in Mongo, so using an array as a buffer might be quite interesting mainly for two reasons:
    1) You could reduce the memory usage drastically (and on not so powerful systems this will also increase performances as the os wont swap).
    2) You could directly sum/count/min/max/avg in the C code (which I think would lead to way more efficient results that passing stuff to python via numpy). Just ask the user another list of desired aggregations and answer the call only with the result. I guess that being python a scripted programming language it won’t handle well massive memory usages.

    As for performances, well I’ve to say that monary itself is quite unstable (when I wrote 1.6 sec for a query that was actually the average, the variance between the experiments was extremely high [1.3,1.7,1.5,1.9..] which makes me think that the memory issue could be terrible in production when all the clients will try to get their results at the same time) even if it’s a great tool right now. Using select_fields=True reduces of 0.1 sec the average and I’ve got the impression that it also reduces the variance of the experiments (as more work is handled by mongo itself). My idea right now is that the bottleneck is when python is trying to get the huge array from monary, but it might be as well in the c code when you fetch all the mongo records. I guess it will all come clearer if you decide to implement the cursor ;) .

    Right now we haven’t implemented very complex formulas, as 99.9% of the clients wont need them (also I don’t think that MySQL supports stuff like matrix multiplications natively, but I’ve never delved into it). The complex thing for me would be to be able to use threading for my engine (as I have many nested formulas) to increase performances (multiple reads on the db/monary). Obviously I’ve already implemented a sort of cache, but I wont be using it right now as it wouldn’t have any sense for benchmarking.

    Great work man.

    • Beach says:

      A few responses:

      > My idea right now is that the bottleneck is when python is trying to get the huge array from monary…

      There is no cost of “passing” the array. The memory is allocated from within Python using NumPy. The C extension then writes directly onto this memory. So, the only memory overhead is the allocation itself.

      The tool itself was intended for analytics work, not necessarily something that you would run many instances of at once. Of course, it could be used this way, but the programmer needs to take responsibility for the total memory usage. In this case, one must to some math, and use limit/skip or the soon to be available block processing mechanism to keep the memory for each query bounded.

      I’m not sure I want to get into the business of writing sum/count/min/max/avg functions into the C portion of Monary. Numpy supports all of this (and much, much more) and I would rather leverage what can be done with Numpy. Once the memory is allocated, the cost of populating it should be very cheap — particularly so if the memory is a block that fits in L2 cache. If this is being done correctly (and I believe it is), then Monary should run more or less at the speed of Mongo.

      I now have a version of Monary which supports block processing, and, when tuned, I’m seeing a modest but reliable improvement in performance. I’ll be pushing this to BitBucket soon.

      Regarding the problems with Python 2.6 support and the setup.py script, I will look into these and try to fix them.

  • [...] High-performance column queries for MongoDB By RSS FEED, on July 19th, 2011 Author:  Source: Planet Python [...]

  • [...] High-performance column queries for MongoDB By RSS FEED, on July 19th, 2011 Author:  Source: Planet Python [...]

Leave a Reply