Wednesday, December 29, 2010

Debugging MapReduce in MongoDB

On a project that I am working on, we are doing some pretty intense MapReduce work inside of MongoDB. One of the things we've run up against is the lack of solid debugging tools. Some Googling basically tells you that print() is all you've got.

We've decided to take a different approach and debug our MapReduce code in the browser. Since the code is JavaScript and modern browsers have really excellent support for debugging (breakpoints, variable inspection, etc.) it's pretty easy to do.

All you need is a web app (or even static HTML file) that will:

  1. Load up one of your documents that you would like to map in the browser. Since the documents are JSON, this is easy. In our project, we have JSON fixture files and a small web app that allows you to choose which fixture to use for testing.
  2. Mock the emit() method. You can just have it write to a Hash that you can inspect later.
  3. Load up the Map and Reduce functions. If you keep these in separate .js files, you can pull them in with a simple script tag.
  4. Bind the map function to the document so that it has the correct context. In a MongoDB mapper function "this" is set to the document that you are mapping. You can easily do this with the bind() function in Underscore.js. I'm sure that other JavaScript frameworks provide a similar function.
  5. Put a link on the page that will let you run the bound function.
This will emulate the MongoDB MapReduce environment, but you can now use the browser's debugging tools.

Tuesday, December 14, 2010

Using Underscore.js with MongoDB

I've been using MongoDB for a while now and have been really happy with it. I wanted to share something we are doing on of the projects I work on that makes working with Mongo even better.

MongoDB allows for the use of JavaScript to do lots of work on the server side. This includes running MapReduce jobs on collections, but also can be used in where clauses and for doing grouping. Being able to use JavaScript for these things is handy, but using just the core JavaScript language can be less than ideal. That's why we prime our MongoDB environment with Underscore.js.

On the Underscore website, it claims to be a JavaScript utility belt. I've found that to be the case. It has functions like any or include that save you the trouble of having to write for loops to iterate over arrays. While the MongoDB documentation describes how you can store individual functions for server side use, it didn't really touch on how you could load an entire library like Underscore.

It turns out you can load up libraries like this pretty easily using db.eval(). I recall reading (but can't currently find the docs to prove this) that every MongoDB connection has a JavaScript context associated with it. If you create functions in this context, they will exist as long as the connection is around. So if you just eval the Underscore.js library before you do any work with your connection, you will have access to all of its functions to do your work.

Here is an example of how to use Underscore.js with the Ruby driver. In this example, I'll set up the MongoDB connection with Underscore.js, create a sample dataset of cars, then use Underscore to group them by make without repeating model.

The only downside to this approach is that db.eval does not seem to work with sharding. That is OK for me right now, but YMMV. Also note that I am using the awesome_print gem to pretty print the results.

Wednesday, September 15, 2010

Git rm may cause insanity

Ran across this today, and wanted to help others avoid the same fate. If you use git rm to remove the last file in a directory, it will remove the directory as well. If you are in that directory, odd things can happen that will potentially drive you insane.

Let's create a git repository with a folder that has a single file in it:
$ cd /tmp
$ mkdir foo
$ cd foo/
$ git init
Initialized empty Git repository in /private/tmp/foo/.git/
$ mkdir bar
$ cd bar/
$ echo 'hello' > splat.txt
$ git add splat.txt
$ git ci -m 'adding a text file'
[master (root-commit) 069f11b] adding a text file
1 files changed, 1 insertions(+), 0 deletions(-)
create mode 100644 bar/splat.txt

So I now have a git repository and placed the file bar/splat.txt under revision control. Now if I do:
$ git rm splat.txt
rm 'bar/splat.txt'
This will not only remove splat.txt, but it will remove the whole bar directory. I say this will drive you insane, because if you try to move or copy a file into your current directory, you'll get an error that will probably catch you off guard. Like:
$ cp ~/.gitignore .
cp: ./.gitignore: No such file or directory
There is a file called .gitignore in my home directory, it's just that my current directory no longer exists. It took me about five minutes to realize what was going on... and I was starting to wonder if I knew how to use the cp command.

The reason I ran into this is that I was rearranging my .vim folder to use pathogen. I keep all of my dot files under source control and stumbled upon this while clearing out my vim autoload folder.

Monday, May 17, 2010

Gwibber on Ubuntu 10.04 issues with FiOS

I just upgraded one of my machines to the latest and greatest Ubuntu because I plan on taking it on the road later this week. After I got everything set up, I fired up Gwibber, my favorite Twitter client on Linux. Immediately, I started running into problems. I couldn't get Gwibber to load any new tweets. There seem to be several people who are experiencing this issue with Gwibber, but their troubles are related to the language settings. That was not the case for me.

I did some digging by firing up Gwibber in a terminal:

$> gwibber-service -o -d

This is what I got:

Gwibber wasn't refreshing because it was timing out on the DNS lookup. I have Verizon FiOS as an ISP. FiOS having terrible DNS seems to be a problem. I switched over to Google DNS and every thing is snappy and working properly. If you're using Gwibber on FiOS and having issues, try this out. It may save you an hour or three.

Gwibber, or whatever library it is using for network communication, picked a pretty short timeout, but this is pretty lame. Verizon really needs to step it up here. People will perceive FiOS as slow because it takes forever to look up an IP, even though the network is pretty quick in my experience.

Thursday, April 1, 2010

Using xargs with git

Sometimes, when I'm working on a project, I'll create a bunch of new files and realize that I have a ton of untracked stuff that I need to add to my git repository. Since generally only use git on the command line, it would be painful to copy and paste all of the untracked file names from the output of git status into separate git add commands.

The two commands I have found handy for dealing with this situation are git ls-files and xargs.

If you run the command:
git ls-files -o
It will show you all of the untracked files in your working directory, one file per line. A problem that you will run into here is that it also shows files in your .gitignore. To get around this issue, you just need another argument to specify your .gitignore:
git ls-files -o --exclude-per-directory=.gitignore
Now that you have all of the files you want to add, you just need to run git add on all of them. This is where xargs comes in handy. It will read from standard in, break it up on line endings and then feed each line as an argument into another command. Putting it all together, you get:
git ls-files -o --exclude-per-directory=.gitignore | xargs git add
That last command will add any untracked files to your git repository. The beautiful thing here is that we can also leverage some UNIX-y goodness if we want to as well. Let's say we're working on a project and we only want to commit some XSLT we have been working on. You can do this by throwing grep into the command chain:
git ls-files -o --exclude-per-directory=.gitignore | grep xslt | xargs git add
This will only add files that contain "xslt" in their names. This same approach comes in handy when you remove files from your working copy but forget to run git rm.