To Date or not to Date

Disclaimer: No, i’m not talking relationship advice.

When I began looking at possible database schemas for storing NetFlow data in MySQL, I was worried about space. I was working under different assumptions back then (the numbers I had for data in was 1/100th what it is now), and I was thinking that perhaps I could keep the whole database under the 32-bit boundary (for old filesystems, portability, etc.). Now that the 2GB barrier is clearly broken with the sheer amount of data I’m dealing with, I’m not concerned at all with disk space – hard drives are cheap, or so says my boss. What does this mean for the schema? No more CPU-disk tradeoffs in favor of disk space. But does that also mean I can drop additional disk space, and a bit of CPU, for coding convenience, maintainability, extensibility, and ease of use?

In the short-term NetFlow tables, data lives for only a month or two at a time. Any more than that and insert times go down the drain — indices get too large for MySQL to cache adequately, even when there’s only two indices on the whole table. One of those indices is on a SMALLINT field (that’s a 16-bit integer for you non-MySQL types) corresponding to data in a SMALLINT-DATETIME lookup table.

Obviously there’s CPU overhead involved in converting the smallint back into a date, or vice versa, whenever the database is accessed. Furthermore, there’s a code inconvenience/overhead involved. Lastly, but certainly an important consideration, is that the system becomes less scalable: when places like NYU have the resources (such as a system with 32GB of RAM, like some of the Sun systems they’ve got there for network security) to keep a couple years’ worth or NetFlow data in MySQL, the system will break because 16-bit integers only allow for just over 1.8 years’ worth of 15-minute reports. Or consider an institution that wished to have more granular data, suppose 5- or 1-minute reports; not even considering the extensive code changes required to move from 15-minute granularity to some other level of granularity, the system comes crashing down in a mere 45 days. This is unacceptable. Even though my superiors at the workplace currently don’t intend on using on using the system in this way, chances are that their demands will change over the next year, and I want to be able to adapt the system to accommodate for those changes.

So now what I have a motivation for change, the question is on the direction in which to make the change. There are a few good options.

Bump the SMALLINT to a MEDIUMINT
Adding 8 bits would get me somewhere a lot more safe in terms of maximum capacity. But again we have the problem of scalability/maintainability if we want to change the granularity of the system. Also, like before, the system uses different date metrics for different tables, and things get confusing as well as difficult to maintain.

Drop the lookup table and move to DATE and TIME fields
Allright, it sounds a lot like overkill to store 48 bits where I could deal easily with 24. I’ll admit that. My logic is that since storage is no longer an object, things like this are possible. CPU-wise, I’m not sure exactly what the tradeoff will look like numerically; there should be a slight increase (okay, maybe not slight) in time taken to insert into the table because indices are larger, but total query time will be less because the number of queries is lower (no need for an extra lookup query on the date field). An added benefit, though, is that the entire backend could work under the same date metric. The huge plus here is that the system is completely scalable and 100% granularity-agnostic. Half minute, even ten second, reports? Sure, you got it. I can’t fathom needing anything less than one-minute reports, but I’m a stupid programmer/engineer, so I’ll quarter my minimum expectations and settle on 15-second reports as an absolute minimum (after which I would expect the system to just be real-time and not periodical in nature, which would require a restructure of the system anyhow).

I’m leaning towards the latter option above, but am open to suggestions and whatever my benchmarks report. I’ll make the final decision on this hopefully by the end of the day, definitely by the end of tomorrow.

1 Responses to “To Date or not to Date”


  • I did not get any of it but I like your style. You write so well. Sounds like you know what you are talking about too.
    I am so proud of you.
    Mom

Leave a Reply