Friday, May 25, 2012

A data communication standard - OData vs. GData vs. homebrew

Today I discovered OData.  Then I discovered GData.  Then I discovered that I don't like either one.

The idea of a data standardization protocol is appealing on the surface, but the two "official" implementations of a common web protocol for data consumption are not my cup of tea.  I do like tea, however.  But I'm picky about my tea and I'm picky about how I communicate with web servers.

First off, I don't like complicated protocols.  I find them annoying as do most other programmers.  This is one of the reasons why SOAP slipped on itself and fell on its derriere at the starting gate.  /hahagreatjoke

I've been working with JSON a lot lately.  It benefits greatly from having native support in every popular scripting language and easy-to-integrate libraries in the few languages that don't.  So, if we want to develop a standard protocol, JSON is the starting and ending point for modern development.

For some strange, bizarre reason, when a company builds a "standardization protocol", they also waste a lot of time building into said protocol a "discovery service".  Let's face it:  The concept of a "discovery service" is just broken.  It is a LOT easier to skip that mess and just apply a healthy amount of documentation, which works equally well.  "Works equally well" (aka Laziness) trumps innovation.

One of the things I do is build a success vs. failure indicator into my JSON replies:

{
  "success" : false,
  "error" : "The API key is invalid."
}

Or:

{
  "success" : true,
  "results" : [ ... ]
}

This makes it easy to test, from any programming language, whether or not a specific reply was successful without having to check for the existence of an error message.

My JSON isn't very complex, but I imagine that if I were pulling a ton of data down, I'd possibly break it up with pagination.  However, OData's "__next" implementation is not what I'd do.  I'd return the LIMIT-ed set of result IDs and offer a mechanism to select 'x' results based on a subset of the IDs with another query because a server has no business dictating to a client how it should paginate results and the server would benefit from performance gains.  After all, most pagination results in running the same expensive query multiple times, whereas selecting a bunch of IDs will always use an index.

Another thing I do is version many of the APIs that I write.  "RESTful" interfaces are overrated and are kind of annoying when a query string works just as well, perhaps better.  So I just stick the version of the client API into my query string.  I generally do version breakage - if I upgrade the version of the server, I immediately break all clients using the older version.  If I were a large organization like Google, I might keep old API versions around for a few months.  However, I'm just one programmer, so trying to support multiple versions of an API will be a waste of my time.  I want you on the latest and greatest version, not some old version.  But I did just come up with a good idea:  Add an "api_expires" response to all queries to the server.  Set it to "never" initially.  When you want to deploy a new version of an API or terminate an obsolete version, change it to some date/time in the near future.  This will allow application developers to have fair warning and be able to programmatically do something (e.g. send an e-mail to themselves).

The last thing I do is encrypt some of my JSON requests and replies using shared secrets.  This allows for more secure communication between client and server even over plain-ol' HTTP.  I'll save this discussion for a separate post.

So there you have it - my homebew protocol.  Simple is better.

Saturday, May 12, 2012

Creating Flexible Relational Database Tables via Data Serialization

I thought about titling this post "How to make a DBA's head explode" or "The blog post from hell:  A DBA's worst nightmare come true" or something like that.  Of the database administrators (DBAs) I've met, which haven't been that many, none of them liked my radical database design ideas.

Traditional database design says to create a field (column) for every little fragment of data you want to store.  I've found that this approach causes all sorts of immediate complications both at the database design layer and the application layer, and you have to nail the table design right off the bat or there will be big problems later on.  A lot of people get lost in the details of creating the database table and then lost in the details of linking the application to the database.  Good database design is perceived to be hard and therefore a good portion of the DBA's job is secure.

Over the past 1 1/2 years, I've been playing with an alternate approach that is in line with some of the pros of NoSQL data storage but applied to relational databases like MySQL and SQL Server.  By the way, I'm humorously watching the NoSQL crowd slowly being dragged kicking and screaming back into RDBMS-land.  Anyway, I've made many small- to medium-sized applications using my alternate approach and, from my software developer perspective, has held up quite well under real-world use.

As a bonus, I have also discovered that I can crank out entire web applications in a matter of days where it would have taken me weeks to create the same applications using a traditional approach AND the applications are much more easily maintained.  Since "time is money" to most business owners, the potential time savings alone is sufficient to read on - plus you may even be able to get rid of a few DBAs.

Many years ago, I learned about the PHP functions serialize() and unserialize().  Basically, construct an array of whatever, call serialize() to convert it to a string, and then store the result somewhere.  To get the original data back, load the stored data and call unserialize().  PHP also has json_encode() and json_decode() functions for a more language-agnostic approach, but those functions are more annoying to use for various reasons.

Let's look at a traditional table:

id - Auto-incrementing integer.
rid - An integer that is used in another table (i.e. foreign key for join operations).
name - A user-defined string relating to the table row.
created - A date/time object.
lastupdated - Another date/time object.
misc_field_1 - A column.
misc_field_2 - A column.
misc_field_3 - A searchable column.
misc_field_4 - A column.
...
misc_field_25 - A semi-searchable column.
misc_field_26 - A column.
misc_field_27 - Seriously, how many columns are there?
...

Creating and maintaining such tables are a nightmare.  Want to add or remove a column?  That becomes a major modification of the application rife with mistakes and subsequent bugs waiting to happen.  And let's say not every column is used for every row, so now the data technically isn't "normalized" and a lot of DBAs will totally freak out over that.  It is fun to watch them freak out as long as you aren't the recipient of their ire.

Somewhere along the line, it dawned on me that I could merge serialized data and databases together.  It took a few tries to get a combination that worked well, but that's the point of this article - share a radical paradigm shift in relational database table design that I've found works really well.

So now let's look at a modified schema:

id - Automatically incrementing integer.
rid - An integer that is used in another table (i.e. foreign key for join operations).
name - A user-defined string relating to the table row.
created - A date/time object.lastupdated - Another date/time object.
misc_field_3 - A searchable column.
info - A very large string (e.g. 'mediumtext' or 'longtext').

Every extraneous field that isn't going to be heavily searched on goes into 'info'.  The idea here is to serialize less-frequently-used data and store it into an 'info' column.  Fields that are heavily searched on and/or need indexes for performance reasons still have their usual field structure.  The 'info' field can be searched on but typically such searches won't occur very often.

This approach also introduces a significant level of flexibility to the application design.  Want to add or remove a field?  Just add the field to or remove the field from the application.  The change can be made in a matter of minutes and doesn't affect existing data in the database.

If you are already toying with this idea in your mind, I want to impart one last bit of requisite knowledge:  Write a loader function.  Instead of calling unserialize()/json_decode(), make a couple of functions that follow this sort of approach:

function ProcessInfoDefaults($info, $defaults)
{
 foreach ($defaults as $key => $val)
 {
  if (!isset($info[$key]))  $info[$key] = $val;
 }

 return $info;
}

function InitUserInfo($info)
{
 $defaults = array("somefield" => "", "somefield2" => "", "somefield3" => "",
  ...
  "somefield20" => "");

 return ProcessInfoDefaults($info, $defaults);
}

function LoadUserInfo($row)
{
 $info = unserialize($row->info);

 return InitUserInfo($info);
}

That way, the initial insertion of a row into the database can just be 'serialize(array())' and you are guaranteed to always have the fields defined that the application expects even if you add a new field later on.

When updating the data, don't create a new array, just overwrite the existing values in the array with the new values.  This allows removed fields to keep their data in the database with no data loss.  Traditionally, the columns would be dropped and the data lost or the column kept around but new rows fill in the column with empty/null data.

This approach makes tables "flexible".  Easily add and remove "columns" without making major application changes.  I've also found that I can add new fields and modify an application already in production at my leisure.  This is something that is nearly impossible with the traditional method of table design.  I've been there, done that, had the virtual heart-attack.

This approach also allows for "concept tables" - something that is traditionally very hard to do in a relational database and would require multiple tables to accomplish.  Use a field named 'type' to determine what 'info' looks like to the application.  An example of this could be an 'asset' table that stores detailed information about related but vastly different asset types.  Of course, this is ripe for abuse by bad programmers who would store everything for an application in one table and create a nightmare of a loader function.  Designing a concept table requires serious thought and careful analysis of alternative approaches.  When done right, application performance is impressive and the application becomes incredibly flexible with future growth options.

At this point, you might still be on the fence about whether or not designing a table like this is a good idea.  And I fully understand your concern because you are intelligently asking, "What if I want to search 'info' for specific data - isn't that going to be slow?"  Well, yes and no.  If you only filter a query by 'info', then that will be a full table scan and therefore will slow down over time.  However, if you filter first by an indexed field, you will be left with a much smaller result set and that part of the search will be fast because index-based searches are fast.  Then filter on 'info' looking for the string (or strings) you want to find.  This will probably return more rows than it should, but the idea is to let the database do a bunch of fast filtering of results and only return "LIKE-ly" results (like-ly - get it?  /nerdgasm).  Then, unserialize the 'info' column with the loader function and do a final round of filtering in the application before displaying the results.  Also, I like to apply a reasonable limit on the number of results at the database level (e.g. 300 rows).  A combination of these things will severely restrict the performance hit against the database.

I'd love to hear your thoughts on the idea of storing serialized data in a RDBMS system.  (All off-topic, insulting, flame war worthy comments along the lines of "You are stupid, NoSQL is 'better'" will be rejected).  The approach is a vast departure from traditional database and application designs and it does bring some NoSQL-ness to the RDBMS party.  The resulting applications are fast enough and, more importantly to me, are easy to construct and maintain.  I applied these concepts to my latest ambitious project, which is a Single Sign-On Server/Client.  In fact, that software project would likely have never left the ground if I did not use serialized data because it still took many months to develop the application.

Saturday, May 05, 2012

Dear Web Developers, Stop making login systems

It seems to be a rite of passage to author a login system - or several login systems - over the course of one's life as a web developer.  Some systems are more elegant than others but I have to ask:  Is it really necessary?

Let me provide some background.  I was mulling one day and realized that every website I've ever come across has one of two problems:
  1. Multiple login systems all over the place with multiple registration mechanisms (the result of using open source products or just bad programming habits).
  2. Some homegrown, half-baked solution riddled with security issues and poor UI design choices that breaks in half the browsers out there and irritating every user who uses the system.
Or some mix of both problems.  And don't think it is just limited to cheesy little one-off implementations.  Major big-name websites that people actually use every day are all in the same boat with the little dinky one-off sites.


Then I got to thinking about how many man-hours are wasted every year dealing with login systems.  It takes about half an hour to build a login system (one of the cheesy one-off varieties that have zero security - salting and hashing passwords isn't good enough, BTW).  But that's where the problems start - you discover shortly after that there are spammers, hackers, script kiddies, trolls, etc.  So if you deal with those folks, that's about three months of development time over the lifetime of the system.  But if you don't deal with those folks, your websites will be hacked or spammed to death in short order.  If every programmer, approximately 12 million people in the world (and growing), constructs at least one login system every couple of years, that's roughly:

12,000,000 * 3 * 31 * 24 / 2 = 13,392,000,000 hours

This says to me that approximately 13.4 billion man-hours are wasted each year dealing with login systems.  Okay, so not every programmer writes for the web or authors a login system, but we're still talking about an estimated minimum of two billion hours wasted each year.

This utter waste of time ends now.

Over the past 6 1/2 months, I've been working on a brand new platform:

Single Sign-On Server/Client

This offers a complete generic login system within a complete SSO server and client pair.  This system is so much more than just your average login system:  Under the hood is a powerful system that scales out with high-availability and offers all sorts of useful security features.  Usually with scalability comes end-user complexity but I've done my best to cover over the complexities with a nice administration interface and fairly simple installers.  Whatever I was not able to deal with in the admin interface, I thoroughly documented.

For your next project, consider using this SSO server and client instead of rolling your own login system.  You'll save time, money, and hair.