I thought about titling this post "How to make a DBA's head explode" or "The blog post from hell: A DBA's worst nightmare come true" or something like that. Of the database administrators (DBAs) I've met, which haven't been that many, none of them liked my radical database design ideas.
Traditional database design says to create a field (column) for every little fragment of data you want to store. I've found that this approach causes all sorts of immediate complications both at the database design layer and the application layer, and you have to nail the table design right off the bat or there will be big problems later on. A lot of people get lost in the details of creating the database table and then lost in the details of linking the application to the database. Good database design is perceived to be hard and therefore a good portion of the DBA's job is secure.
Over the past 1 1/2 years, I've been playing with an alternate approach that is in line with some of the pros of NoSQL data storage but applied to relational databases like MySQL and SQL Server. By the way, I'm humorously watching the NoSQL crowd slowly being dragged kicking and screaming back into RDBMS-land. Anyway, I've made many small- to medium-sized applications using my alternate approach and, from my software developer perspective, has held up quite well under real-world use.
As a bonus, I have also discovered that I can crank out entire web applications in a matter of days where it would have taken me weeks to create the same applications using a traditional approach AND the applications are much more easily maintained. Since "time is money" to most business owners, the potential time savings alone is sufficient to read on - plus you may even be able to get rid of a few DBAs.
Many years ago, I learned about the PHP functions serialize() and unserialize(). Basically, construct an array of whatever, call serialize() to convert it to a string, and then store the result somewhere. To get the original data back, load the stored data and call unserialize(). PHP also has json_encode() and json_decode() functions for a more language-agnostic approach, but those functions are more annoying to use for various reasons.
Let's look at a traditional table:
id - Auto-incrementing integer.
rid - An integer that is used in another table (i.e. foreign key for join operations).
name - A user-defined string relating to the table row.
created - A date/time object.
lastupdated - Another date/time object.
misc_field_1 - A column.
misc_field_2 - A column.
misc_field_3 - A searchable column.
misc_field_4 - A column.
...
misc_field_25 - A semi-searchable column.
misc_field_26 - A column.
misc_field_27 - Seriously, how many columns are there?
...
Creating and maintaining such tables are a nightmare. Want to add or remove a column? That becomes a major modification of the application rife with mistakes and subsequent bugs waiting to happen. And let's say not every column is used for every row, so now the data technically isn't "normalized" and a lot of DBAs will totally freak out over that. It is fun to watch them freak out as long as you aren't the recipient of their ire.
Somewhere along the line, it dawned on me that I could merge serialized data and databases together. It took a few tries to get a combination that worked well, but that's the point of this article - share a radical paradigm shift in relational database table design that I've found works really well.
So now let's look at a modified schema:
id - Automatically incrementing integer.
rid - An integer that is used in another table (i.e. foreign key for join operations).
name - A user-defined string relating to the table row.
created - A date/time object.lastupdated - Another date/time object.
misc_field_3 - A searchable column.
info - A very large string (e.g. 'mediumtext' or 'longtext').
Every extraneous field that isn't going to be heavily searched on goes into 'info'. The idea here is to serialize less-frequently-used data and store it into an 'info' column. Fields that are heavily searched on and/or need indexes for performance reasons still have their usual field structure. The 'info' field can be searched on but typically such searches won't occur very often.
This approach also introduces a significant level of flexibility to the application design. Want to add or remove a field? Just add the field to or remove the field from the application. The change can be made in a matter of minutes and doesn't affect existing data in the database.
If you are already toying with this idea in your mind, I want to impart one last bit of requisite knowledge: Write a loader function. Instead of calling unserialize()/json_decode(), make a couple of functions that follow this sort of approach:
That way, the initial insertion of a row into the database can just be 'serialize(array())' and you are guaranteed to always have the fields defined that the application expects even if you add a new field later on.
When updating the data, don't create a new array, just overwrite the existing values in the array with the new values. This allows removed fields to keep their data in the database with no data loss. Traditionally, the columns would be dropped and the data lost or the column kept around but new rows fill in the column with empty/null data.
This approach makes tables "flexible". Easily add and remove "columns" without making major application changes. I've also found that I can add new fields and modify an application already in production at my leisure. This is something that is nearly impossible with the traditional method of table design. I've been there, done that, had the virtual heart-attack.
This approach also allows for "concept tables" - something that is traditionally very hard to do in a relational database and would require multiple tables to accomplish. Use a field named 'type' to determine what 'info' looks like to the application. An example of this could be an 'asset' table that stores detailed information about related but vastly different asset types. Of course, this is ripe for abuse by bad programmers who would store everything for an application in one table and create a nightmare of a loader function. Designing a concept table requires serious thought and careful analysis of alternative approaches. When done right, application performance is impressive and the application becomes incredibly flexible with future growth options.
At this point, you might still be on the fence about whether or not designing a table like this is a good idea. And I fully understand your concern because you are intelligently asking, "What if I want to search 'info' for specific data - isn't that going to be slow?" Well, yes and no. If you only filter a query by 'info', then that will be a full table scan and therefore will slow down over time. However, if you filter first by an indexed field, you will be left with a much smaller result set and that part of the search will be fast because index-based searches are fast. Then filter on 'info' looking for the string (or strings) you want to find. This will probably return more rows than it should, but the idea is to let the database do a bunch of fast filtering of results and only return "LIKE-ly" results (like-ly - get it? /nerdgasm). Then, unserialize the 'info' column with the loader function and do a final round of filtering in the application before displaying the results. Also, I like to apply a reasonable limit on the number of results at the database level (e.g. 300 rows). A combination of these things will severely restrict the performance hit against the database.
I'd love to hear your thoughts on the idea of storing serialized data in a RDBMS system. (All off-topic, insulting, flame war worthy comments along the lines of "You are stupid, NoSQL is 'better'" will be rejected). The approach is a vast departure from traditional database and application designs and it does bring some NoSQL-ness to the RDBMS party. The resulting applications are fast enough and, more importantly to me, are easy to construct and maintain. I applied these concepts to my latest ambitious project, which is a Single Sign-On Server/Client. In fact, that software project would likely have never left the ground if I did not use serialized data because it still took many months to develop the application.
Traditional database design says to create a field (column) for every little fragment of data you want to store. I've found that this approach causes all sorts of immediate complications both at the database design layer and the application layer, and you have to nail the table design right off the bat or there will be big problems later on. A lot of people get lost in the details of creating the database table and then lost in the details of linking the application to the database. Good database design is perceived to be hard and therefore a good portion of the DBA's job is secure.
Over the past 1 1/2 years, I've been playing with an alternate approach that is in line with some of the pros of NoSQL data storage but applied to relational databases like MySQL and SQL Server. By the way, I'm humorously watching the NoSQL crowd slowly being dragged kicking and screaming back into RDBMS-land. Anyway, I've made many small- to medium-sized applications using my alternate approach and, from my software developer perspective, has held up quite well under real-world use.
As a bonus, I have also discovered that I can crank out entire web applications in a matter of days where it would have taken me weeks to create the same applications using a traditional approach AND the applications are much more easily maintained. Since "time is money" to most business owners, the potential time savings alone is sufficient to read on - plus you may even be able to get rid of a few DBAs.
Many years ago, I learned about the PHP functions serialize() and unserialize(). Basically, construct an array of whatever, call serialize() to convert it to a string, and then store the result somewhere. To get the original data back, load the stored data and call unserialize(). PHP also has json_encode() and json_decode() functions for a more language-agnostic approach, but those functions are more annoying to use for various reasons.
Let's look at a traditional table:
id - Auto-incrementing integer.
rid - An integer that is used in another table (i.e. foreign key for join operations).
name - A user-defined string relating to the table row.
created - A date/time object.
lastupdated - Another date/time object.
misc_field_1 - A column.
misc_field_2 - A column.
misc_field_3 - A searchable column.
misc_field_4 - A column.
...
misc_field_25 - A semi-searchable column.
misc_field_26 - A column.
misc_field_27 - Seriously, how many columns are there?
...
Creating and maintaining such tables are a nightmare. Want to add or remove a column? That becomes a major modification of the application rife with mistakes and subsequent bugs waiting to happen. And let's say not every column is used for every row, so now the data technically isn't "normalized" and a lot of DBAs will totally freak out over that. It is fun to watch them freak out as long as you aren't the recipient of their ire.
Somewhere along the line, it dawned on me that I could merge serialized data and databases together. It took a few tries to get a combination that worked well, but that's the point of this article - share a radical paradigm shift in relational database table design that I've found works really well.
So now let's look at a modified schema:
id - Automatically incrementing integer.
rid - An integer that is used in another table (i.e. foreign key for join operations).
name - A user-defined string relating to the table row.
created - A date/time object.lastupdated - Another date/time object.
misc_field_3 - A searchable column.
info - A very large string (e.g. 'mediumtext' or 'longtext').
Every extraneous field that isn't going to be heavily searched on goes into 'info'. The idea here is to serialize less-frequently-used data and store it into an 'info' column. Fields that are heavily searched on and/or need indexes for performance reasons still have their usual field structure. The 'info' field can be searched on but typically such searches won't occur very often.
This approach also introduces a significant level of flexibility to the application design. Want to add or remove a field? Just add the field to or remove the field from the application. The change can be made in a matter of minutes and doesn't affect existing data in the database.
If you are already toying with this idea in your mind, I want to impart one last bit of requisite knowledge: Write a loader function. Instead of calling unserialize()/json_decode(), make a couple of functions that follow this sort of approach:
function ProcessInfoDefaults($info, $defaults) { foreach ($defaults as $key => $val) { if (!isset($info[$key])) $info[$key] = $val; } return $info; } function InitUserInfo($info) { $defaults = array("somefield" => "", "somefield2" => "", "somefield3" => "", ... "somefield20" => ""); return ProcessInfoDefaults($info, $defaults); } function LoadUserInfo($row) { $info = unserialize($row->info); return InitUserInfo($info); }
That way, the initial insertion of a row into the database can just be 'serialize(array())' and you are guaranteed to always have the fields defined that the application expects even if you add a new field later on.
When updating the data, don't create a new array, just overwrite the existing values in the array with the new values. This allows removed fields to keep their data in the database with no data loss. Traditionally, the columns would be dropped and the data lost or the column kept around but new rows fill in the column with empty/null data.
This approach makes tables "flexible". Easily add and remove "columns" without making major application changes. I've also found that I can add new fields and modify an application already in production at my leisure. This is something that is nearly impossible with the traditional method of table design. I've been there, done that, had the virtual heart-attack.
This approach also allows for "concept tables" - something that is traditionally very hard to do in a relational database and would require multiple tables to accomplish. Use a field named 'type' to determine what 'info' looks like to the application. An example of this could be an 'asset' table that stores detailed information about related but vastly different asset types. Of course, this is ripe for abuse by bad programmers who would store everything for an application in one table and create a nightmare of a loader function. Designing a concept table requires serious thought and careful analysis of alternative approaches. When done right, application performance is impressive and the application becomes incredibly flexible with future growth options.
At this point, you might still be on the fence about whether or not designing a table like this is a good idea. And I fully understand your concern because you are intelligently asking, "What if I want to search 'info' for specific data - isn't that going to be slow?" Well, yes and no. If you only filter a query by 'info', then that will be a full table scan and therefore will slow down over time. However, if you filter first by an indexed field, you will be left with a much smaller result set and that part of the search will be fast because index-based searches are fast. Then filter on 'info' looking for the string (or strings) you want to find. This will probably return more rows than it should, but the idea is to let the database do a bunch of fast filtering of results and only return "LIKE-ly" results (like-ly - get it? /nerdgasm). Then, unserialize the 'info' column with the loader function and do a final round of filtering in the application before displaying the results. Also, I like to apply a reasonable limit on the number of results at the database level (e.g. 300 rows). A combination of these things will severely restrict the performance hit against the database.
I'd love to hear your thoughts on the idea of storing serialized data in a RDBMS system. (All off-topic, insulting, flame war worthy comments along the lines of "You are stupid, NoSQL is 'better'" will be rejected). The approach is a vast departure from traditional database and application designs and it does bring some NoSQL-ness to the RDBMS party. The resulting applications are fast enough and, more importantly to me, are easy to construct and maintain. I applied these concepts to my latest ambitious project, which is a Single Sign-On Server/Client. In fact, that software project would likely have never left the ground if I did not use serialized data because it still took many months to develop the application.
this is a good approach unless you have complex applications with reporting functions or common schemas shared by many applications, if each one of those apps write and read data you end with a mess.
ReplyDeleteRegards
If a single database is shared between two or more applications, it will always be a mess. That's why a single, well-documented API to access each database that your reporting tools need is a better approach. Application isolation via an API also helps breed application security.
DeleteWordpress database stores a lot of serialized data in a few of its core tables. This has been already used for a good decade, however as the database grows so does the number of calls that need to be made to unserialize and serialize the data with some Wordpress sites that use a dozen or two plugins can have over 1000 database calls on every single page. So there is a +/- for everything.
ReplyDeleteWordPress plugins are generally poorly written to begin with. For the vast majority, they are written by people without a solid background in software development and, as an unfortunate side-effect of not vetting plugins and plugin authors before allowing them to publish on the WP website, nearly all of them introduce security vulnerabilities into the infrastructure. So the issues with serialized data are moot in the face of more serious problems that plague that product.
DeleteHi there! I've been working with this concept myself! So nice to read what you wrote! Have a great day!
ReplyDelete