Step 2: Prep and QC Your Data

This step involves organizing, cleaning, and performing quality control checks on your data to ensure it is accurate, consistent, and well-documented before moving forward.


Stage 1: Organize & Clean Your Data

Describe and Preserve Your Data

  • Backup Raw Data: Always save a copy of the untouched, raw data in a separate, secure location as a backup.
  • Document Changes: Keep a log or script of all changes you make to your data during the cleaning process. This is crucial for reproducibility.
  • Understand Requirements: Be aware of when and where your data will need to be submitted for long-term storage.

Create a Consistent File Structure

  • Organize Folders: Use a logical folder structure to easily locate raw data, processed data, scripts, and final outputs.
  • Use Standard Naming Conventions:
    • No Special Characters: Avoid !, @, #, $, %, etc. in filenames.
    • No Spaces: Use underscores _ or hyphens - instead of spaces (e.g., Raw_Data_2022.csv).
    • Be Descriptive: Name files in a way that you’ll understand in the future.
    • Be Consistent: Use the same case (e.g., UPPERCASE, lowercase, CamelCase) for all your files.

Clean Your Data Columns

  • One Data Type per Column: Ensure each column contains only one type of data (e.g., all numbers, all text, or all dates).
  • Separate Units from Data: Keep units in the column header (e.g., Depth_Meters) and the data as numbers (e.g., 10), not 10m.
  • Check for Uniqueness: If you have a UNIQUE_ID column, verify that every ID is truly unique.
  • Check for Typos: For text columns, create a sorted list of all unique values to easily spot and correct misspellings or inconsistent entries (e.g., “USA” vs “U.S.A.”).
  • Define Nulls: Be clear about what empty cells mean. Define in your data dictionary whether a blank cell means 0, NA, or something else.

Stage 2: Perform Data Integrity & QC Checks

Once the data is organized, perform these specific quality control checks on the data itself.

Verify Standard Metadata Fields

Ensure all standard descriptive fields are included for every record to describe the “where, when, who, and what” of the data.

Click to see all required metadata fields
  • Mission
  • Region/Island
  • Site
  • Latitude/Longitude
  • Depth
  • Date (and time if relevant - preferred format `YYYY-MM-DD`)
  • Diver
  • Method/Instrument type

Check for Accuracy and Consistency

  • Check for Unexpected Missing Values: Look for empty cells in columns where data should always be present, such as Latitude/Longitude, Site, Depth, etc.
  • Check the Range of Values: Look for outliers. Does the range of values make sense for the location and metric? (e.g., Depth > 30m?, a fish length of 1000cm?).
  • Check for Consistency: Do all SITE_IDs follow the same format? Are there any codes used that don’t apply to the dataset?

Verify Method-Specific Rules

Ensure the data follows all rules required by the scientific collection method.

Click to see examples of method rules
  • Required Value: e.g., `RECENT DEAD %` must be a value between 0–100.
  • If/Then Logic: e.g., If `CONDITION` is "Bleaching," then the `SEVERITY` column must have a value.
  • Formula Check: e.g., `OLD_DEAD` + `RECENT_DEAD` must be ≤ 100%.
  • Constraint: e.g., `LENGTH` must be ≥ 5 cm for adult coral surveys.
  • Exception: e.g., If `LENGTH` is < 5 cm, is the colony type listed as a `FRAGMENT`?

Helpful Tools


Next Step: Register Your Data in InPort


Table of contents


Questions? Reach out to the Data Services Team!