Abstract. Preparing data for medical research can be challenging, detail oriented, and time consuming. Transcription errors, missing or nonsensical data, and records not applicable to the study population may hamper progress and, if unaddressed, can lead to erroneous conclusions. In addition, study data may be housed in multiple disparate databases and complex formats. Merging methods may be incomplete to obtain temporally synchronized data elements. We created a comprehensive database to explore the general hypothesis that environmental and occupational factors influence health outcomes and risk-taking behavior among active duty Air Force personnel. Several databases containing demographics, medical records, health survey responses, and safety incident reports were cleaned, validated, and linked to form a comprehensive, relational database. The final step involved removing and transforming personally identifiable information to form a Health Insurance Portability and Accountability Act compliant limited database. Initial data consisted of over 62.8 million records containing 221 variables. When completed, approximately 23.9 million clean and valid records with 214 variables remained. With a clean, robust database, future analysis aims to identify high-risk career fields for targeted interventions or uncover potential protective factors in low-risk career fields.