April 17th, 2008Is Your Storage Bloated?
I read an InfoWorld article a few weeks back talking about the problem with database bloat in enterprise environments and it got me thinking about how much I miss working with massive datasets and complex SQL queries. There was a time when I would dive head first into a huge pile of digits in an effort to make sense of sales trends, marketing effectiveness, inventory control and a host of other meaningful bits of data for several different employers, but alas, those days are gone. Instead, I get to manage five smaller databases accessed by 14 sites. These databases tend to grow by a laughable 12 megabytes per month.
The majority of the information in my databases is stats-collection related. This means that, if I were to rely exclusively on services like Google Analytics, I could effectively eliminate a good amount of data from my personal storage engines. This would free up quite a bit of space and, at the same time, force me to make use of another company’s much larger servers and databases to make decisions about content and site placement. While I’m not 100% confident in the numbers reported with Google Analytics, it would give me the jist of the information I regularly ask for.
But why not go even further?
Data Expiration Dates?
In the past, I had argued quite often with some employers about the need to archive and remove older data from the main database. This was usually met with mixed feelings and has been a rewarded with the standard “thanks, but no” response from every manager and contract employer I’ve ever had. We can certainly understand why a company wouldn’t want to archive and remove data from a central location, as you never know when you’ll need to know the detailed inventory transactions for a specific box of doo-dads that you had purchased in 1997. The main argument wasn’t really about eliminating the seemingly excessive superfluous data that existed in some of these 40+ Gigabyte SQL Server databases, but instead about speeding up the existing systems that would read, write and update these medium-sized databases.
SQL Server is fast and, with the right equipment, can easily handle multi-terabyte databases with trillions of records. However, while the database engine can certainly handle such a complex role, most software applications I’ve seen written are not well suited to dealing with this bloat. So if I were to remove all data over a year in age, it would be gone. There are no options to search through archived data, or even send a request to an administrator to look for that information. On the other hand, it’s completely possible to move the information to another database, which might introduce some performance benefits on the main server, but wouldn’t do anything about the data that we’re still creating and accumulating.
How does this relate to blogs? For most of us, the size of our database is something we couldn’t care less about. We’ll never have millions of blog posts, hundreds of categories, hundreds of thousands of comments and several hundred million stat-collection records. The only way we could succeed in building such a colossal database would be to start a popular multi-user blog that stores everyone’s data in the very same data tables.
However, while bloat may not be an issue for us, something far more sinister is eating away at our sites: relevance.
How many people visit our older posts? When I examine the data, the last time someone visited a post written before April of last year on j2fi.net was three weeks ago. Before that, it was two weeks prior to the last. In both cases, the visitor came from Google and, in both cases, my page was on the 20th result page. The people that came (and stayed a total of 8 seconds each) were not really interested in what I had to say about over-hyping terrorist threats on public WiFi networks.
And who can blame them? The article is not quite full of the frustration I felt on that day, nor is it as wordy as some of the more recent posts about nothing in particular.
So why keep it?
This idea is hardly new, as there are plugins for WordPress that will delete a post when they reach a certain age. If people aren’t going to read our older posts, we might as well ditch them, right? Not only would this keep our system’s database small and manageable, but it would give us a better opportunity to revisit a specific topic and write about it again. There are problems with this approach, but for anyone running their site from an under-powered web server, this could be a viable solution to the speed issues.
I think we do keep too much data, but I also believe that there is not enough of a cost involved for most of us to sit down and determine what should stay and what should be deleted. It’s the same problem that we see with email, in that it’s not always in our best interest to delete a message. Heck, I’m a nobody, and I still have every non-spam email sent to me since 1998! So unless there is a clear and definite cost involved with managing our excessive amounts of data, most of us will opt to keep our historical data because it’s just easier than going back through our records and eliminating the things that are no longer relevant.
Is the email I received from a high school friend on July 12th, 1998 telling me only “Get on IRC, now” worth keeping? Not particularly. Is it worth deleting? Well … if I delete that one, I might as well delete everything from that person right up until we lost touch with each other in 2006. Either way, the storage space is cheap, so I’m not reaching for the delete key.
So … ?
Though the InfoWorld article was talking about enterprise-level databases and recommending companies use expensive software packages to manage, construct, and maintain auxiliary databases for archive query purposes, we can see how this might start to play a role in our lives in the next decade or so.
We are putting more and more information online every year. As it sits, most of us store this data on huge corporate servers that are sitting somewhere on the planet with a (seemingly) endless number of hard drives available for our use. With the rise in home servers, we’ll soon begin to see far more home servers double as simple web servers with rich content hosting capabilities.
Streaming our audio and video collections from home will soon become a normal everyday occurrence. Hosting an image gallery with the 10,000 high-resolution digital pictures we’ve taken over the last few years will no longer be an issue. Setting up a blog on these little machines will also be pretty simple, as 99% of all blogs in the world do not attract enough attention to warrant the massive servers they currently sit on with the various shared-hosting services.
All of these things are possible now, and some of us started doing much of it three years ago on some pretty archaic hardware. As more and more of us build multi-terabyte home storage solutions, we’ll need to give more consideration to what can stay, and what should go. That said, some of us should decide in the near future whether we’ll be digital pack rats, or transient data savers.
Regardless of which direction each of us chooses, we’ll soon see that the transient data savers are the people that learn early on the value of current and relevant information, while the digital pack rats learn the value of a Google-Mini 1U server in their data closet.
What do you think of all the out-dated or irrelevant information we’re storing on our personal computers and web spaces? Is it time we start paying more attention to the content of these systems?













































While for business I agree with the idea of moving data that’s over a certain age to a secondary database to be called upon infrequently, but for blogging I would never delete an old post.
Many of my visitors that have come through a search engine do so on old posts, the last thing I’d want them to find is a 404 because I’d been cleaning up old posts.
Plus each old post is a single page as far as the almighty Google is concerned and thus counts as a small increment towards increasing my page rank and the value of doing a link exchange with me.
We can certainly understand why a company wouldn’t want to archive and remove data from a central location, as you never know when you’ll need to know the detailed inventory transactions for a specific box of doo-dads that you had purchased in 1997.
Most of the time the need to retain this data isn’t to get at the record so much as for legal purposes. Between contractual agreements with customers, legal requirements for various alphabet letter legislation and so on … some of our data needs to be retained for up to a decade.
Other times it’s just plain lack of foresight. I used to own a label printing application. Now this is a high-powered beast that gets data from the ERP system and sends jobs to hundreds of printers world-wide. Yet .. they’re just _labels_ with extracted data from elsewhere.
We were going to get rid of the data as soon as it was printed. Then the users dropped in a requirement for label re-print without referring to ERP.
At that point the PM sort of threw up his hands and decided we’d revisit the get rid of the data problem later. It’s been five years now and there are records in that db that date from the very beginning of that system.
But .. blog entries. As long as google has no problem with me it’s not a biggie. When I run it on my own server I’ll worry about it. Or maybe not - how many mb does a blog post take up in mysql?