Webinar “Create secure LAMP applications”

August 20th, 2008

Last week, noted PHP security expert Stefan Esser from SektionEins fame held a webinar together with MySQL regarding the topic "Create secure LAMP applications". Unfortunately, this Webinar was in German, but if you understand German you might be interested in the Webinar recording which is now available at MySQL's website. If you have further questions regarding Web Application Security or MySQL security, don't hesitate to contact us.

The missing pieces in the protobuf binary log

August 20th, 2008
Protobuf comes with a minor problem: it does not have support for handling "type tagged structures", that is, something reminiscent of objects in OOP lingo, so if one is going to have a heterogeneous sequences of messages, you have to roll it yourself. For that reason, I added a transport frame for the messages in the binary log that wraps each with some extra information. In addition to allowing the binary log to be a sequence of messages, it also adds some integrity-checking data and simplifies some administrative tasks.

Transport frame with message
Length
Type Tag
Message
Checksum
The format of each message in the sequences is given in the table in the margin. where the length is a specially encoded length that we will go through below, type is a single byte being the type tag, message being one of the messages given in the specification, and checksum being a checksum to ensure the integrity of the transport.

Checksum. As checksum, the plan is to use a CRC-32. We don't want it to be too large to affect performance, and we want it to catch reasonable losses of integrity. I'm considering storing this as a varint after the actual message, but for the time being, it is given as 4 raw bytes (it is not implemented at all yet). Please give me feedback on this: if we make it a varint, we can stuff the checksum in there, but that will also run the risk of not being able to read the checksum due to corruption of the checksum, so offhand, I would say that a fixed number of bytes is preferable.

Type Tag. The type tag is a single byte giving the type of the event. This means that we are limited to 255 events, but considering that we don't even have 26 events in 5.1 right now, I don't see that we will run into that limit very soon. It is possible to put the type tag in the message as well, and specifying it as an enum inside the protobuf specification, but that will just provide the information in two places, so it is better to keep it separately.

Length. The length is the length of the message, that is, it does not include the type tag, length, nor the length itself. This simplifies the normal processing, and in the event that one needs to skip an event, it is easy to compute the next position of a transport frame by just decoding the length (see below).

Length encoding

The length is encoded using a special scheme to allow for very little overhead for small events while still leaving room for giving the length for very large events. This scheme currently allows for a compact representation of lengths from 2 bytes 4 GiB (Gigabytes) and, if you don't need to have a compact representation of 2, you can represent lengths in the range 3 bytes to to 16 EiB (Exabytes) [sic].

The basic idea is to note that the length of a message can never be zero, and the minimal length in this case is actually 16 bytes. Since we will never have a length that is less than 16 for an event, that leaves the lengths 0-16 available for denoting other information. The obvious solution is to let the first byte denote either the length, if it is, say greater than 8, or the number of bytes following the byte that gives the length if it is less than or equal to 8, but we can actually do better.

Storing Length requires ceil(log256(Length)) bytes, so if we let the first byte L be

ceil(log2(ceil(log256(Length)))) - 1
(that is, the smallest power of two that is greater than the number of bytes needed for the length, minus 1), we can get away with reserving significantly fewer values. So, for example:

Bytes Value
2A 42
00 FF 02 767
01 FF 01 01 00 66047

Computing the number of bytes can be done by computing 1 << (L + 1), but computing the inverse is a little more involved. The following two functions does the job. Although it looks like a lot of code, length_encode() is actually only 29 instructions on my machine (no function calls), while length_decode() is about 7 instructions.

inline unsigned char *
length_encode(size_t length, unsigned char *buf)
{
  unsigned char *ptr= buf;
  assert(length > 1);
  if (length < 256)
    *ptr++= length & 0xFF;
  else {
    int_fast8_t log2m1= -1;        // ceil(log2(ptr - buf)) - 1
    uint_fast8_t pow2= 1;          // pow2(log2m1 + 1)
    while (length > 0) {
      // Check the invariants
      assert(pow2 == (1 << (log2m1 + 1)));
      assert((ptr - buf) <= (1 << (log2m1 + 1)));

      // Write the least significant byte of the current
      // length. Prefix increment is used to make space for the first
      // byte that will hold log2m1.
      *++ptr= length & 0xFF;
      length >>= 8;

      // Ensure the invariant holds by correcting it if it doesn't,
      // that is, the number of bytes written is greater than the
      // nearest power of two.
      if (ptr - buf > pow2) {
        ++log2m1;
        pow2 <<= 1;
      }
    }
    // Clear the remaining bytes up to the next power of two
    while (++ptr < buf + pow2 + 1)
      *ptr= 0;
    *buf= log2m1;
    assert(ptr == buf + pow2 + 1);
  }
  return ptr;
}

inline const unsigned char *
length_decode(const unsigned char *buf, size_t *plen)
{
  if (*buf > 1) {
    *plen = *buf;
    return buf + 1;
  }

  size_t bytes= 1U << (*buf + 1);
  const unsigned char *ptr= buf + 1;
  size_t length= 0;
  for (unsigned int i = 0 ; i < bytes ; ++i)
    length |= *ptr++ << (8 * i);
  *plen= length;
  return ptr;
}

How To Use the Analysis Services Execute DDL Task in SSIS 2005

August 20th, 2008
We are looking to automate some tasks to be performed on our SQL Server Analysis Services Servers. Can you give us the details on how the Analysis Services Execute DDL Task can be used in a SQL Server Integration Services (SSIS) package?

Investigating SQL Server 2008 Wait Events with XEVENTS

August 20th, 2008
Some reasons for the slow-running of database applications aren't obvious. Occasionally, even the profiler won't tell you enough to remedy a problem, especially when a SQL Statement is being forced to wait. Now, in SQL Server 2008, come XEvents, which allow you to look at those waits that are slowing your SQL Statements. Mario Broodbakker continues his series about SQL Server Wait Events

Using XML to Enhance the Performance of String Manipulations

August 20th, 2008
This article from Wayne Sheffield shows how the use of XML can speed up those string manipulations in your T-SQL code.

SQL Know How - Best Practices in Performance Seminar

August 20th, 2008
SQL Know How is holding a 3 day seminar in Hatfield, Hertfordshire in the UK this September 1-3.

Green & Nutty: Soap nuts - laundry, shampoo, etc

August 20th, 2008
disclaimer: this is a blatant plug for a friend of mine, Kristy Bennett.

Kristy runs (among many things ;-) a company called Green & Nutty, selling so-called soap nuts (sapindus mukorossi). Actually, only the shells are used and no actual nuts should be part of what you get as per Australian quarantine regulations (the nuts themselves don't contain the soap compound).

So anyway, you stick one half soap nut shell per kilo of washing in a little bag (provided) and add to washing, and that's all you need - forget about the old laundry powder/liquid or any other blah. After a handful of washes (depends on washing temp), you add a few more shells into the bag. You can also make shampoo by simply boiling a few shells for a bit; info is on the website.

Anyway, I brought back a bag from Adelaide last week, and have been using them since; works just fine! If you like being a little bit serious about environmentally non-harmful compounds and methods, give this a try.

How to unit-test code that interacts with a database

August 20th, 2008

I got some interesting comments on my previous article about unit testing Maatkit, including echoes of my own conversion to the unit-testing religion. One of the objections I’ve heard a lot about unit-testing is how it’s impossible to test code that talks to a database. “It’s too hard,” they say. “Oh, it’s easy to test a module that calculates a square root, but a database? Way too much work!”

Is it really impossible or even hard?

I disagree. In one of my previous articles I said The Rimm-Kaufman Group, my previous employer, has a comprehensive unit-test suite. When I say comprehensive I mean it: database interaction is fully tested, too. I know because I was heavily involved in building it. Even extremely complex things like big reports that are generated from lots of data are tested. And believe me, sharding the databases would have been much harder without complete code coverage. It’s really not that complicated to unit-test against a database, and it’s so worth it. Here are some hints about how you can do this.

There are many ways to do it, but I’ll just describe the basics of the system I helped build. There are several moving parts to the test suite (”smoke“), but one of them sets a magical environment variable. And then, all code that connects to a database server magically gets back a different database connection from the create_me_a_connection() function. This is because there is a database connection abstraction library that respects the environment variable. It’s really pretty simple for the most part; instead of doing DBI->connect(…) you just call this function, which is a thin wrapper that hands back a connection object.

This wrapper is itself unit-tested thoroughly, too. This ensures that when some code is being run from a test, it cannot (I mean cannot!) connect to a production database, and vice versa. There are some conventions about production and test servers that make sure the abstraction library can tell for sure. If there’s any confusion, of course, it will die in a non-recoverable way. Safety first.

Building a good development environment

Just as each developer has their own copy of the code from version control, each developer has their own private database server running on the dev machine. There are some simple conventions that make this possible: Unix user ID plus a constant for the port number, etc. It’s really quite easy. The private database server is a slightly modified version of Giuseppe Maxia’s MySQL Sandbox tool. It can be torn down and set up afresh as desired. It is wiped clean and re-filled at the start of every test, with a small, tightly focused dataset carefully chosen to represent the conditions the code is supposed to work with. (Each test has its own dataset).

If this sounds like a system that can’t work on a large scale, well, it does. That’s the secret sauce that I won’t reveal in this post. (It’s my past employer after all, and I can’t go revealing everything about them can I?) You just have to be smart about it. When a database is central to your business, you either figure out how to get this right, or you pay the consequences in lost time and poor code quality.

I and the other developers there (another secret: it’s a small team; small teams build great things) built several quick utilities to help develop unit tests against a database. There are utilities to get a minimal necessary dataset for testing and dump it into a file that can be loaded by the test. There are utilities that can migrate schemas and update the tests to match the schema changes. And so on, and so on. This is possible because of careful planning for testability, and really smart things like super-consistent and sensible naming conventions for database objects. (Ruby On Rails owes a lot of its success to simple things like this, too. Conventions are really powerful.) Maybe I’ll write about the database naming conventions some other time — I have to credit Alan Rimm-Kaufman a lot for designing those conventions. It was a stroke of genius.

Things to avoid

There are several things I do not recommend doing when you unit-test code that talks to a database. I’ll just mention a couple:

<ul
  • Don’t mock anything! In general I think mocking is the devil. Most of the mock objects I’ve ever seen reflected a propensity to test an implementation instead of a behavior, which is also the devil. Write all your code to test a test instance of something real, and do not mock up a database to test against. It is a rabbit-hole that you will not emerge from easily.
  • Never let a test connect to a production database. Never, ever. Worlds of hurt will follow. Not only are you risking your production data, but what about the risk to your code? You’re testing against things that will almost certainly change and break your tests; and you’re possibly polluting your live data with testing data and/or changing live data from the tests.
  • I also recommend developing unit tests for your current database functionality if you’re thinking about changing it much. Don’t like MySQL’s lax error handling? Plan to set the SQL_MODE to something stricter? Dive into that database abstraction library and make your tests run in strict mode first by setting SQL_MODE on every new connection that’s created when running inside a test; fix all the breakage in the test suite; feel sure that your code isn’t going to break in production. That was easy!
  • Summary

    Once your creative juices get flowing, you’ll see tons of places your unit test suite can help you out.

    If you’re in the Oracle or SQL Server world, or any other world where you can’t just set up and discard database instances at will due to licensing problems, you’re going to have to be a little more inventive. But you can still do it. (Don’t you wish you’d chosen Freedom?) And unit tests are just as beneficial for apps based on Oracle as they are for MySQL.

    Have fun! Go forth and test some more!

    , , , ,

    Finding the Needle in your MySQL Haystack

    August 20th, 2008
    (First in what will probably be a series of blogs as we move all our projects to a replicated, MySQL 5.0 environment, and I finally get to start playing with all the useful features that come with it)Say you've got a client, Mystery Client A. Mystery Client A has hired a marketing consultant. As a part of their rebranding efforts, they have decided to refactor their company spelling convention to MysteryClientA!, for whatever reason. That's fine for replacing a few logos, but MCA has a database-driven content management system, and their name is riddled throughout the database in page content, event description, news headlines and so forth. Your job is now to sift through the entire system and apply the newly crafted spelling to the entire database. So what do you do? ...

    Finding the needle in your MySQL haystack

    August 20th, 2008
    (first in what will probably be a series of blogs as we move all our projects to a replicated, MySQL 5.0 environment, and I finally get to start playing with all the useful features that come with it)Say you've got a client, Mystery Client A. Mystery Client A has hired a marketing consultant. As a part of their rebranding efforts, they have decided to refactor their company spelling convention to MysteryClientA!, for whatever reason. That's fine for replacing a few logos, but MCA has a database-driven content management system, and their name is riddled throughout the database in page content, event description, news headlines and so forth. Your job is now to sift through the entire system and apply the newly crafted spelling to the entire database. So what do you do? ...