Where Should You Sort Your Data?

The obvious answer of where should sorting operations reside appears to be “in the database, stupid!” RDBMSs are built specifically for this kind of activity. They’re good at it, they’re fast, and if properly indexed, they can sort arbitrarily large sets of data better than just about anything on the planet.

But the answer isn’t always quite so straightforward.

If your dataset is small enough, if your sort criteria is sufficiently complex, if your sorting is over columns that you would not otherwise index, or if your application has a sufficiently large user base, you may actually be better off sorting your data in the application.

The primary problem with today’s relational databases is that they don’t scale out easily. Yes, there are strategies to help deal with this problem; but it’s not nearly as simple as “drop another database server into the farm”. Application servers, on the other hand, are built to do just that. Can’t handle the load, stand up another server. Often, this is less costly than hiring a developer to try to make the application more efficient. And in any event, unless you are using one of the free database servers, adding application servers will be less expensive than adding database servers.

Indexing is supposed to solve our data retrieval performance issues – and for the most part, it does. There is a downside however: each index makes inserts/updates slower. Again, there are strategies for working around this; but they usually entail segregating the tables that written to from those that are read from. The result is that the user is potentially left waiting for an update process to complete before they can read data they just saved. I say, if your only purpose for indexing a column is to improve the performance of a sort operation, move it to the application.

Some sort operations are sufficiently complex to warrant keeping out of the database. For example, if you wanted to sort a list of stores within a certain zip code by their distance from a certain competitor, you are most likely better off doing this in the application layer. There is virtually no way to do this more efficiently in an RDBMS; all you are really doing is ensuring that fewer simultaneous queries can be executed.

One final note: while RDBMS is currently de-rigeur, the industry is moving toward a more agnostic data storage model. It will not always be possible to know that the data is stored relationally; and therefore it may not always be possible to know the sorting characteristics of the data provider. Find this out before making any decision.

Posted in Uncategorized | Comments Off

“Self-Documenting” Does Not Apply to the API

Developers hate writing documentation.  Not all of them, but most do.  I don’t necessarily hate it, but neither am I exactly fond of it. I’d much rather be hammering out some sweet code than toiling over a paragraph about what that code is supposed to do.  But I am deeply troubled by a growing trend among software developers of eliminating all code commentary, including not writing comments that document the API (eg. JavaDoc, Doxygen).

The “Good Code is Self-documenting” Argument

I agree that most code commentary can be effectively eliminated through judicious application of good variable and method names.  As a general rule, if I encounter a bit of code that requires comments for me to understand what it does, it’s a pretty good bet that that bit of code is a candidate for refactoring; it’s also a good bet that the refactoring will eliminate the need for the comments.  But “eliminate the API docs”?  Come on, now…

There is a school of thought that says, “if you name your classes and methods appropriately, you don’t have to document them.”  The only place where this might be true is getter/setter methods that have no other side-effects.  While it is true that a properly named method can go a long way toward describing the core behavior, it is virtually impossible for the method name alone to describe all of its behavior.

Take for example the case of the method V put(K key, V value) in the java.util.Map interface. This should be a fairly straightforward method to understand: you supply a key and a value associated with that key, and the map stores the object and key.  Simple, right?  But what about that return value? Would you know just by the method signature that the Map is supposed to return the previous value associated with that key? What if I pass null in either the key or the value parameters? I assume that’s bad for the key, but I might also assume that null is OK for value — does that fail, remove the old key/value pair, or store null under the key?  

Without decent API documentation, the client developer is left to either guess or write experimental code to figure out what to expect. This is no way to treat your users.

The point is this: the method signature is the contract; the documentation is the fine print.

The “Tests as Documentation” Argument

The followup argument to “the code is self-documenting” is usually something like, “There’s no need to document the API, because we have a full suite of unit tests that detail the behavior”.

Let me be blunt about this one: if your position is that you don’t have to tell me how your API works because you’ve made the unit tests available to me, then you’re not fit to be in polite society.

Unit tests are there to verify the behavior, not describe it to potential users of your API. Forcing your users to pore through unit tests is one step above telling them to read the source code. This is no way to treat your users, and should result in having your developer privileges revoked.

Not All Comments are Created the Same

Developers who claim that there should never be a need to write a single comment tend to view the need for comments as some sort of failure of the code. In the case of algorithm implementation, I tend to agree. But this argument loses steam for me in the case of API comments.

There is a fundamental difference between code comments and documentation comments. Code comments are oriented toward the developers/maintainers of the commented code. Documentation comments are oriented toward the users of your library.

The “But the Only Users of My API are On My Team” Argument

Just because your users are coworkers, you think you can treat them like crap? Set the tone. Be a good citizen. Treat them the way you want to be treated. Just because they have access to the source code, does not mean they should be forced to read it in order to understand what it does.

Let’s assume for a minute that you aren’t the altruistic team player that I believe you are… there is another, more selfish reason that you don’t want your users to read your source code: Once they understand the internals of the implementation, they will optimize their code according to your implementation, and you will never be able to change it without breaking applications that depend on that behavior. I’ve seen it happen hundreds of times, and it is always a painful lesson for those involved.

Save yourself the pain and spend the extra three minutes it takes to just write the documentation comment. Everyone will be happier for it.

Posted in Uncategorized | Comments Off