Andy's Blog: 2008

Saturday, November 22, 2008

Glassfish is looking speedy

Let me start by saying that most benchmarks should be taken with a grain of salt, so please do the same here.

I work on Project Laika, and for our deployments we are looking to switch to JRuby. We already have java code hanging around, and it looks like we will need some more soon (I'm thinking it will be easier to deal with SOAP web services in java and calling the classes from JRuby). I wanted to run some numbers to make sure that our performance wouldn't fall through the floor.

The Setup

I decided to pit Mongrel 1.1.5 running on MRI 1.8.6 against Glassfish v3 Prelude 1 and JRuby 1.1.5 on Java 5. I'm running Rails 2.0.2 in both setups (I know we are behind the times). I ran them both on OS X 10.5.5 and had the Rails apps hit the same MySQL database.

I used ab to grab the numbers. I had it hit the site 1000 times with 10 concurrent requests. I hacked the Laika app slightly so that you didn't have to log in.

The Test

I wanted to get a feel for how the app would perform, so I did two simple tests: dynamic content from a typical page and static content. The dynamic content was the patient template library in Laika which contains code that you'd expect in a Rails app: AR pulling info from the DB and putting it into ERB templates. I also pulled down the Rails 404 page to get a feel for serving static content. This is probably less meaningful, as you'll probably have Apache or Nginx serve up your static stuff.

The Results

Glassfish, hands down. The results are the average number of milliseconds it took to serve each request.

It beat Mongrel in both static and dynamic content easily. Glassfish v3 also makes it ridiculously easy to deploy Rails apps. You can use the Glassfish gem, and serve up your app with a single command. I installed the Glassfish server, so I could run JEE apps alongside my Rails stuff, there is a single command where you point the app server to the root directory of your app and you're done.

With Rails 2.2 now out and offering thread safety, and JRuby being the only interperter that can take advantage of it... Glassfish and JRuby are really worth checking out.

Friday, September 5, 2008

SezHoo and WikiSym 2008

I'm proud to announce the release of SezHoo! It's a tool we created at MITRE to help establish reputations for authors in wiki sites... specifically those run on MediaWiki (the same software that runs Wikipedia).

The way that SezHoo works is that it goes through a wiki article's history and tracks the authorship of each word in the article. We use a pretty popular technique in text analysis called shingling to help us out.

For each revision of an article, we break it up into shingles. We're using shingles that are 6 words long, so each word will have 6 shingles associated with it. We credit authorship of the word to the author of the earliest shingle.

This allows us to add some information to a MediaWiki installation:

Pages
We show the authors of a page, and the percentages at which they have contributed. We also offer this information at the section level.

The information we compute is better than doing diffs with a standard MediaWiki install. Diffs are line ending based, so changing a word in a paragraph that has a single line ending can make it difficult to determine who wrote the paragraph. In addition, since we are looking for authorship of a word across all revisions at once (as opposed to a revision vs. revision comparison with a diff), this technique is immune to copying and pasting of text and reversion of vandalism.

Authors
We create a reputation for authors with 3 dimensions: quantity, quality and value. For all metrics, we assign a 5 star rating. The way this is done is by computing a raw score for all authors and then ranking them. Authors in the top 20% have a 5-star rating in that dimension.

For quantity, we simply calculate the number of word contributed by each author. This includes words that may no longer be in the current revision of the article. Value takes the percentage authorship of a page, and multiplies that by the number of page views (it's valuable if it's being read).

Quality is a lot more tricky. We assume that wiki articles are a Darwinian environment. That is, quality text will "survive" in that as the article is edited, other authors will leave the text alone. Poor quality text will be "killed" or deleted by other authors. We determine quality by looking at the percentage of text that is expected to be alive for an author after 8 revisions (you can change 8 to whatever you like, but it works well for our internal wiki). To calculate the percentage alive after 8 revisions, we used a Kaplan-Meier estimate. This was necessary due to the fact that authors will have phrases which are "alive" (they are in the current version of an article). So if a word has survived 3 edits and is in the current version of an article, when calculating survival probabilities you don't want to say that it "died" at 3 revisions, but you don't want to lose the data either. Kaplan-Meier handles that for us.

In the end, you get a 5-star rating for each author in the dimensions of quantity, quality and value.

I'll be presenting this work at WikiSym 2008. We're doing some other interesting stuff which we haven't rolled into SezHoo yet, so it's worth coming to my talk even if you've read this blog post.

Wednesday, August 13, 2008

Using Hadoop

Since I've gotten back from OSCON, I've had a chance to use Hadoop at work. For those who aren't familiar with it, Hadoop is an open source framework for implementing map reduce jobs.

There are plenty of tutorials on Hadoop around the web, so I won't do any of the basic intro stuff. I wanted to write about some of the stuff I didn't find all that easily.

Most of the Hadoop documentation talks about dealing with input where your records are on a single line (like an Apache access log). From using the Google/documentation/experience, we have found that Hadoop works just fine with multi-line records. We're using Hadoop to process XML, specifically a dump from Wikipedia. The dump is a single 33 GB file, where there is a root tag, and then several million child tags (representing Wikipedia pages). Using this code I found on the Hadoop core user mailing list, we can have it so that the mapper gets the XML for one child node (or one Wikipedia page). This is nice, because the XML for a single page is relatively small. We then use JDOM to deal with the contents of the individual pages.

We are using HDFS to store our input and output. By default, it will chop files into 64MB chunks, which get shipped to the mappers so that the file can be processed in parallel. One thing that I was concerned about was how records would be handled that spanned the splits. So far, we haven't seen any issues, and this answer in the Hadoop FAQ seems to indicate that the records spanning the split will be handled. It may be possible that those records get dropped, it would be hard for us to tell at this point... but the good news for us is that it won't affect the data we're trying to collect much.

As for our setup and usage. We have 5 machines in our cluster, most of which are run of the mill dual core Intel machines running Linux. The jobs we're running on the 33GB XML file are taking around 45 minutes, which seems pretty fast to me.

Monday, July 28, 2008

OSCON 2008: Wrap Up

So I've returned from Portland after all of the OSCON activities. The conference was good, but I definitely didn't feel like it was a good as in years past. The keynotes were OK, but there weren't any that were spectacular. I hit a few sessions that were bombs, but I didn't get to one that rocked. Many were good, but nothing off the charts.

Hadoop was big on Thursday. Derek Gottfrid of the New York Times talked about how they used Hadoop and Amazon EC2 to process tons of data. Derek's presentation style is great, which mate the talk entertaining. Some folks from Yahoo were also getting into the nitty gritty details of how the whole thing works too.

The MySQL Proxy talk was good. It seems like a pretty handy tool for performance tuning and all sorts of SQL trickery.

The last talk that stood out to me was Selectricity. The project is a both a site to run elections as well as the software you can use to run elections where ever you want. One point that Benjamin Mako Hill made that I thought was interesting is that most election research goes into government elections... and these are the least likely to change. By building a tool to allow folks to conduct elections for simple things (what movie to see, who will lead the coffee club, etc.) using methods different from plurality, it's a good way to sneak in alternative voting methods to the masses. That way if people get familar with Condercet when voting for the next American Idol, they may be more likely to push for election reform in government elections.

I'm not sure if I'll hit OSCON again next year. I like going because it's nice to get an overview of a lot of different techologies, as opposed to something like Rails Conf. But things did feel pretty shallow this year.

Thursday, July 24, 2008

OSCON 2008: Day 3

Today, I gave my first OSCON talk on Laika. I think that the talk went pretty well. I had plenty of good questions from the audience, and I think I may have been able to snag a few people who were interested in contributing.

The speaking experience was pretty cool. I was in a fairly small room, and probably had about 20 to 30 people in the audience, which was pretty non-threatening. I would have been more nervous in one of the more cavernous rooms with 200 people.

As for the rest of the conference today...

XMPP has a lot of buzz for communicating in the cloud. There were a few talks on that today.

There is a lot of Ruby stuff going on outside of web apps. RAD seems to have a lot of buzz for using Ruby to work with Arduino. Ruby's also behind Adhearsion, a tool for building IVR's.

I missed the talk on CouchDB, but some of the folks I'm out here with said it was great.

On a conference logistics note... I was kinda bummed that some of the talks had filled, so I couldn't get in. I wound up missing the talk on Hypertable as well as one on XMPP in the cloud.

Wednesday, July 23, 2008

OSCON 2008: Day 2

Day 2 at OSCON.... Some of the highlights...

Practical Erlang Programming was great. Francesco Cesarini is a great speaker and delivered a great tutorial. While Erlang does make you look at things differently, I can see how it makes it a lot easier to write concurrent code.

While I was physched to see Mark Shuttleworth give a keynote (given my fondness for Ubuntu), the best keynotes tonight were definitley Robert (r0ml) Lefkowitz and Damian Conway.

R0ml's talk compared various software development methodologies to Quintilian's 1st century works on rhetoric. My take on his talk was that open software has a good development methodology since it doesn't really have a requirements phase. Code gets released early an often. Bugs are filed and patches are submitted. Then users and developers can look at bugs and patches are there to determine what is in the next release. This is different form a typical development methodology, where you need to decide what you want up front. In this model, people do what they want, and you take what you like in the end.

On the other hand Damian Conway is somewhere between insane and brilliant. His talks are hillarious, but the stuff that he is actually able to implement is crazy... I'm sure that we'll be seeing some talk of positronic variables on the tubes in the comming days.

Tuesday, July 22, 2008

OSCON 2008: Day 1

The first day at OSCON 2008 has come and gone... This year looks to be another good one. Here's what I saw on my first day.

The first session I went to was Python in 3 Hours. While I do most of my work in Ruby, I do try to keep an eye on Python. It seems like a pretty clean scripting language, and quite speedy when compared to Ruby.

The tutorial was good. The material is kinda dry (it's language syntax after all, which is pretty hard to spice up). Steve Holden's presentation was clear and well thought out. I walked away feeling like I could approach Python code now without too much fear. However, I still have some pretty mixed feelings about Python... There are a lot of little things that bother me... having to add a self parameter to instance methods, double under bar naming conventions and the whole significant white space thing. At any rate, I though the tutorial was informative.

The second tutorial I did was Making Things Blink: An Introduction to Arduino. This was a lot of fun. I haven't played with a microcontroller since college... but I've always loved working at the place where software meets hardware.

In the session, we worked through coding for Arduino as well as some basic circuits. The class culminated by building an Etch-a-Sketch. This is accomplished by hooking up two potentiometers to the Arduino, which reads the values and passes them to your computer via USB. We then used Processing to read and visualize the data on the screen. This meant that you could turns the knobs on the pots and draw on the screen pretty cool stuff.

Overall, one of the vibes I'm getting from the conference this year is big data. How do deal with really big databases and how to process tons of data in parallel. We'll see if this continues throughout the conference.

Thursday, June 5, 2008

Speaking at OSCON

I'll be speaking about the work I'm doing on Project Laika at OSCON. Come check out the session to see how we're using rails to test electronic health record systems

Wednesday, May 21, 2008

DBSlayer is taking JSON to the next level

I ran across DBSlayer on the Rails Envy podcast and I was struck by a number of things. One, is that it is really cool that the New York Times has their own open code repository. Another is how far JSON is coming as the choice format for data exchange.

I remember reading a few posts on the tubes months ago that JSON was going to give XML a run for its money and I thought folks were crazy. I don't see JSON totally replacing XML, but I'm definitely seeing it used in many more places where XML would have been the natural choice a couple a of years ago.

DBSlayer also intrigues me because I can see writing some pretty interesting web apps without a traditional app server. I know that DBSlayer is intended to help with scaling your DB layer, but I think it would be cool to hit it directly from the browser. There would be problems with this approach... You'd pretty much only be able to build read only apps where you don't care who sees the data... but I can see plenty of apps fitting that mold (corporate directories, stock data, sports data, etc.)

Andy's Blog