Guidelines for dataset design

Created 22 Aug 2016 • Last modified 19 Apr 2017

Some hints on how to design data-entry and data-storage systems for social science to make using the data as painless as possible.

Generalities

Don't ask a subject a single question more than once at the same timepoint, unless you're specifically trying to examine the consistency of responses. Don't ask subjects logically redundant questions (e.g. if you ask a subject whether they've drunk coffee in the past 24 hours and they say yes, don't then ask them if they've drunk coffee in the past week), or permit logically inconsistent answers (e.g., a subject answering "Monthly" to "How often do you drink alcohol?" and "Weekly" to "How often do you drink 6 or more alcoholic drinks on one occassion?"), with the same exception.

Never use paper for data collection, unless you have no choice (e.g., you have to run a lot of subjects at once but there aren't enough computers for all of them, not even when you count their own cellphones as computers). In order for anybody to do anything with paper data, it needs to be transcribed to a digital form, which can take time and money and introduces another source of error. And storing the paper can be massively inconvenient. Furthermore, paper forms prevent you from checking responses as they're entered (as suggested in detail below). If you must use paper, scan or transcribe all of it and then shred the originals as soon as possible.

If the data-collection or data-storage software you're using doesn't do what you need, use different software, or edit the software (hire a programmer for this purpose if necessary). Don't let the whims of outside programmers limit your data quality.

To the degree that variables (or entire datasets) are parallel in meaning or function, they should also be parallel in form and structure. For example, if a subject is asked to recall how much they ate 3 days ago and this variable is called food_recall_3_days, then a similar question asking about 2 days should use the variable name food_recall_2_days, not food_recall_2days or food_recall_2_days_ago or 2_day_food_recall. Observing the principle "Don't repeat yourself" can help you to avoid accidentally creating inconsistencies like this.

Input

How missing values are represented should be uniform across a dataset. Whenever there's more than one reason a value can be missing, and the reason for each case can be determined (e.g., the question was skipped due to skip logic vs. the interviewee refused to answer a question), use different missing-value codes to indicate the different reasons for missingness. Whenever possible, don't represent missing values as numbers, because then they're too easy to accidentally treat as non-missing. Even using an obviously bogus number, such as −1 for a subject's number of siblings, should be avoided when possible, because a careless analyst might, e.g., compute the mean of this variable without checking the unique values first.

When you need to encode the answer to a yes-or-no question as a number, use 0 for "no" and 1 for "yes". (This is how most programming languages perform Boolean operations on numbers.) In these cases, it's useful to name the variable after the "yes" answer (e.g., "female" rather than "gender"; this is a hint to users of the dataset that "1" means "female").

Input that's expected to conform to a particular format (as nearly all input does) should have that format automatically enforced at the point of data entry. For example, if subject identification codes can only have capital letters, entering a lowercase letter should either cause it to be upcased or cause the data-entry program to produce an error message. Likewise, if there are only four legal responses to a question, then those are the only responses that should be accepted.

Where important but easily confused identification codes, such as subject identification numbers, must be entered, a check digit should be employed and its conformance automatically checked at the time of data entry. This scheme allows the person entering the data to notice and correct their own typo.

Text

Encode all text in UTF-8, or simply ASCII. (ASCII is a subset of UTF-8.) If you replace non-ASCII characters with ASCII, do this everywhere and in a consistent fashion. For example, you might transliterate "ü" to ASCII as "u" or as "ue", but be consistent.

All text strings (whether subject responses or researcher-written variable names) should have all whitespace characters at the beginning or end of the string removed, and unless the string is expected to span several paragraphs, all other continuous runs of whitespace characters should be squashed to a single space each. It may sound obvious not to type spaces at the end of a variable name, but you might be surprised how easy it is to do this by accident.

Newlines, if there are any, should be in either Unix style (0x0A) or DOS style (0x0D 0x0A). Use a consistent style throughout the dataset.

Identifiers (e.g., variable names, response codes)

For the most part, identifiers should be in English.

Identifiers should contain only ASCII letters, digits, and underscores, and begin with a letter. This maximizes the odds that your identifiers will be natively representable in any given program, programming language, or data format. It's also a good idea to keep identifiers less than 256 characters long.

In identifiers, use capitalization with consistent rules, or not at all. What you want to avoid is two questions which are constructed in close parallel except one happens to have a certain letter in the variable name capitalized and one doesn't (e.g., one variable named Drinks_milk and another named Drinks_Soda rather than the parallel form Drinks_soda). Also, if you do use capitalization, capitalize the same identifier in the same way everywhere even if a case-insensitive system permits you to be inconsistent (e.g., Windows filesystems, on which C:\Program Files and C:\program files are the same thing), and don't use two distinct identifiers that are distinguished only by capitalization (or else case-insensitive software won't be able to tell them apart).

Check the spelling of identifiers. Abbreviations of various kinds are fine and expected, in order to keep identifiers from getting inconveniently long, but misspellings, like "anxios" for "anxious", are dangerous. Use a single regional style of spelling (e.g., American) throughout a dataset, and always spell the same word the same way. Try to match the style of spelling with the vocabulary: in the US, "cookies" have "flavor", whereas in the UK, "biscuits" have "flavour".

Dates and times

A wide variety of ways to input or display dates and times can be useful, but only a few different representations should be used for storage. These representations are unambiguous and are widely supported by existing software, or at least are easy to program support for.

Dates (that is, combinations of a particular year, month, and day of month, with no time-zone information) should be represented as one of:

  • A text string in the form 2016-09-03, per ISO 8601. Notice the leading zeroes.
  • An integer of the form 20160903. This is just the digits of the ISO 8601 form concatenated together to form a numeral.

Absolute times (that is, moments in time) should be represented as one of:

  • A text string in the form 2016-09-03T16:57:09Z or 2016-09-03T10:41:44-07:00, per ISO 8601. Use Coordinated Universal Time (UTC) unless the local time zone is relevant somehow. For example, if you're stopping Americans on the street and asking how hungry they are, people will probably be hungrier at 11 am local time (before lunch) than 2 pm local time (after lunch). (Avoid the term "Greenwich Mean Time" (GMT). It is ambiguous because it can refer to UTC or to UT1, which is an astronomical rather than atomic timescale that can differ from UTC by up to 0.9 seconds.)
  • A POSIX time (also called a "Unix time" or "epoch time"). This should be an integer if your timing information has 1-second precision and a rational number (probably stored internally as a floating-point or fixed-point number) if it's more precise than 1 second. (Although POSIX time is ambiguous around leap seconds, this is unlikely to matter in social science.)

High-precision times since the beginning of a task or session (as might be useful in, e.g., a psychophysics experiment) can be represented as just a rational number of seconds, or an integer of the highest-precision time unit you have.

Occasionally, you may need to represent some time-related information that's neither a date nor an absolute time nor a time since the beginning of a task. For example, you might ask subjects what their birthday is without asking for their age or year of birth, or you might ask subjects what their favorite day of the week is. In these cases, try to avoid using one of the above formats unaltered, because that will lead to confusion. Follow ISO 8601 when possible.

Four common problems with date representation that you really want to avoid (and which you will avoid if you follow the above rules) are:

  • Missing time zone information for absolute times.
  • Incorrect time zone information (e.g., "EST" when daylight-saving time is in effect; EST is different from EDT).
  • Ambiguity between the American numeric date style MM/DD/YYYY and the European style DD/MM/YYYY.
  • Two-digit years. (Remember back in the 90s when organizations had to goad COBOL programmers out of retirement with lots of money to fix year-2000 bugs? Try not to contribute to year-2100 bugs.)

File formats

Data should be stored, or available to be completely exported, in an unambiguous format for which there is at least one complete free-software implementation, such as JSON, HDF5, SQLite databases, XML, or RFC 4180-compliant CSV. (Of these, I recommend JSON, and HDF5 is a good alternative for datasets that are too large for JSON to be practical.) Using such file formats ensures that anybody can access the data and that it will still be accessible decades in the future when fashions in software have changed and Excel 2007 must be run in emulation if it can be run on conventional PCs at all.

Avoid other forms of CSV, which are easy to confuse with each other. Avoid formats that are proprietary or are tied tightly to a single application, such as Excel spreadsheets, SAS data files, R data dumps, and Python pickle files. Especially avoid inventing your own binary format (see Jamie Zawinski's critique of the Mork file format for an example of how badly this can go wrong). All of these more ephemeral or ill-defined file formats are fine for caching or otherwise holding intermediate results, but there needs to be an authoritative source of the data in a good file format.