No matter what tool we use to automate tests, the tests need data to process. Ideally, they need realistic data that matches production data. However, production databases usually contain lots and lots of data, and they can be highly complex. Also, database access slows down tests exponentially. Like so much of agile testing, it’s a balancing act.
Data Generation Tools
As we write this book, there are several cool tools available to generate test data for all kinds of input fields and boundary conditions. Open source and commercial tools such as Data Generator, databene benerator, testgen, Datatect, and Turbo Data are available to generate flat files or generate data directly to database tables. These tools can generate huge varieties of different types of data, such as names and addresses.
It’s also fairly easy to generate test data with a home-grown script, using a scripting language such as Ruby or Python, a tool such a Fit or FitNesse, or a shell script.
Lisa’s Story
Our Watir scripts create randomized test data inputs, both to ensure they are rerunnable (they’re unlikely to create an employee with the same SSN twice), and to provide a variety of data and scenarios. The script that creates new retirement plans produces plans with about 200 different combinations of options. The script that tests taking out a loan randomly generates the frequency, reason, and term of the loan, and verifies that the expected payment is correct.
We have utility scripts to create comma-separated files for testing uploads. For example, there are several places in the system that upload census files with new employee information. If I need a test file with 1,000 new employees with random investment allocations to a retirement plan, I can simply run the script and specify the number of employees, the mutual funds they’re investing in, and the file name. Each record will have a randomly generated Social Security Number, name, address, beneficiaries, salary deferral amounts, and investment fund allocations. Here’s a snippet of the code to generate the investment calculations.
# 33% of the time maximize the number of funds chosen, 33% of the time
# select a single fund, and 33% of the time select from 2-4 funds
fund_hash = case rand(3)
when 0: a.get_random_allocations(@fund_list.clone)
when 1: a.get_random_allocations(@fund_list.clone, 1)
when 2: a.min_percent = 8;
a.get_random_allocations(@fund_list.clone, rand(3) + 2)
end
emp['fund_allocations'] = fund_hash_to_string(fund_hash)
Scripts like these have dual uses, both as regression tests that cover a lot of different scenarios and exploratory test tools that create test data and build test scenarios. They aren’t hard to learn to write (see the section “Learning by Doing” earlier in this chapter).
—Lisa
Scripts and tools to generate test data don’t have to be complex. For example, PerlClip simply generates text into the Windows clipboard so it can be pasted in where needed. Any solution that removes enough tedium to let you discover potential issues about the application is worth trying. “The simplest thing that could possibly work” definitely applies to creating data for tests. You want to keep your tests as simple and fast as possible.
Avoid Database Access
Your first choice for testing should try to have tests that can run completely in-memory. They will still need to set up and tear down test data, but the data won’t store in a database. Each test is independent and runs as quickly as any test could. Database access means I/O and disks are inherently slow. Every read to the database slows down your test run. If your goal is to give fast feedback to the team, then you want your tests to run as quickly as possible. A fake object such as an in-memory database lets the test do what it needs to do and still give instant feedback.
Lisa’s Story
One of our build processes runs only unit-level tests, and we try to keep its runtime less than eight minutes, for optimum feedback. The tests substitute fake objects for the real database in most cases. Tests that are actually testing the database layer, such as persisting data to the database, use a small schema with canonical data originally copied from the production database. The data is realistic, but the small amount makes access faster.