1. I2B2 ONTOLOGY CURATION LEVERAGING CLINICAL NOTES
Matteo Gabetta, [1,2], , Lorenzo Chiudinelli, , Natalia Viani, , Alberto Zambelli , , Mauro Bucalo , Eleonora Sfreddo , Lucia Sacchi, , Carlo Tondini, , Riccardo Bellazzi 
University of Pavia; Biomeris;  Hospital Papa Giovanni XXXII of Bergamo
i2b2 has shown to be an efficient tool to stratify subgroups of patients based on structured data. The exploitation of unstructured data, such as clinical notes and reports, still remains an open issue. To access these data, many natural language processing (NLP) algorithms have been developed. In this work we have used a general pipeline that is able to process free clinical text in various domains. This pipeline allows analyzing and extracting data from italian textual clinical reports, with the aim to store them in i2b2. The information extraction is based on a domain ontology, that is translated into a customized i2b2 ontology. The NLP algorithm has been already used to process clinical notes in the cardiology domain. In this work a corpus of anatomic pathology medical reports is given as input to the pipeline. These reports were provided by the Hospital Papa Giovanni XXIII of Bergamo, in Italy.
2. An Analysis of Public Phenotype/Genotype Data with Arvados
Kevin Fang, [1,3], Abram Connelly [1,2], Sarah Wait Zaranek [1,2], Alexander Wait Zaranek [1,2]
 Curoverse, Somerville, MA;  Personal Genome Project, Harvard Medical School, Boston, MA;  Commonwealth School, Boston, MA
It can be difficult to gain credentials to perform analysis on sensitive data as a researcher, especially as a student. Furthermore, genomic data is potentially identifiable, so many individuals who have their genome sequenced often do not wish to make this data available to bioinformaticians, often making it difficult to acquire. The Harvard Personal Genome Project (PGP) and the 1000 Genomes Project curates genomes of volunteers willing to publicly share it with biomedical researchers for analysis to aid the future of biology and genetics.
Curoverse develops an open-source data analysis tool called Arvados. It allows one to run complex analysis on large datasets. In addition, a team at the Università Degli Studi Di Padova in Italy developed a tool titled “BOOGIE” [BOOGIE: Predicting Blood Groups from High Throughput Sequencing Data, Giollo, M. et al.] to analyze genomes and predict a blood type, claiming a 94% accuracy rating. The goal of this research was to eventually use Arvados and run BOOGIE on 606 of the genomes from the Personal Genome Project and the 1000 Genomes Project to compare this data to race/ethnicity data provided in genomic surveys and see if this data would match that of Wikipedia or readily-available race vs. blood type information. Arvados ran a pipeline incorporating BOOGIE using a Docker image to analyze the datasets. In under 10 hours, the tool was able to run BOOGIE on all 606 genomes available. This included 173 Genomes from the Personal Genome Project and 433 Genomes from the 1000 Genomes Project. After downloading all the data from Arvados and comparing it to the survey data provided from the Personal Genome Project using a Python script, BOOGIE was rated at an 86.67% accuracy, having correctly guessed 39/45 blood types from the PGP. In the end, each person was matched with a blood type and race, and this data was counted to assign a number of each blood type to each race featured in the survey data.
The Personal Genome Project and the 1000 Genomes Project allow genomic data to be accessible and easily available for everyone to use. The Arvados Project records work and simplifies the process of doing so by using Docker images and pipelines. In addition, the Arvados Project allows analysis of massive data sets, from gigabytes to petabytes of information, aiming to create an efficient, common solution for data management across many platforms.
3. i2FHIRb2 – FHIR in i2b2
Harold Solbrig, Guoqian Jiang, PhD, Na Hong, PhD
FHIR® – Fast Healthcare Interoperability Resources (hl7.org/fhir) – is a next generation standards framework created by HL7. FHIR solutions are built from a set of modular components called “Resources” that can easily be assembled into working systems that solve real world clinical and administrative problems. The latest version of FHIR (STU3) includes a proposed specification for how FHIR information should be represented in RDF, and includes the FHIR “ontology”, a standard set of identifiers FHIR resources and their components. This poster describes a proposal for transforming the FHIR ontology into an i2b2 equivalent and the implications for the representation of FHIR resource data as i2b2 CRC tables.
<NO IMAGE OF POSTER AVAILABLE AT THIS TIME>
4. Visualizing healthcare system dynamics in biomedical big data
Griffin Weber, Nick Benik; Daniel Bolin; Katy Borner; Nick Brown; Scott Christopher Hutcheson
 Harvard Medical School, Boston, MA 02115 USA;  Indiana University, Bloomington, IN 47405 USA
Electronic health records (EHR) and administrative claims databases are transforming medical research by giving investigators access to data on millions of individual patients. Compared to manual paper chart review, these databases reduce the time and cost of clinical studies by orders of magnitude, enabling types of research that were unfeasible in the past. However, EHRs and claims are observational databases, which reflect not only the health of the patients, but also their interactions with the healthcare system. For example, the date associated with a code for diabetes is when the physician made the diagnosis, not when the patient first developed the disease. These observations are influenced by the dynamics of the healthcare system–when physicians schedule visits with their patients, which tests physicians decide to order, what codes need to be recorded to get reimbursed for procedures, etc. By ignoring this dimension of the data or naively treating it as noise, investigators risk both misinterpreting the actual patient pathophysiology and losing valuable information content.
METHODS AND RESULTS: Using 8 years of data for more than 600,000 patients at two hospitals, we show how information about patient health can be obtained through large-scale analysis and visualization of “healthcare system dynamics” (HSD). A simple example illustrates how it works. Consider the laboratory test white blood cell (WBC) count. The value of the result is an observation of the patient’s pathophysiology. The time of day when the test is ordered is an observation of HSD–a physician decided a WBC count was needed. It is counterintuitive, but we have found in several hospital EHRs that patients with an abnormal WBC value at 4pm are actually more likely to be alive after 3 years than patients with a normal WBC value at 4am. This is because physicians only order blood counts at 4am when they believe a patient is very sick. Surprisingly, for most laboratory tests (233 of 272 common tests), the temporal HSD dimension of the observation, which reflects physicians’ experience and intuition, is actually more predictive of survival than the value of the test result. As part of this project, we have developed several HSD visualization plugins for the open source clinical research software platform i2b2 (Informatics for Integrating Biology and the Bedside). These visualizations introduce investigators to the concept of HSD, first showing them a traditional graph based only on patient pathophysiology (e.g., age vs survival), and then demonstrating how adding an HSD dimension can separate patients into subpopulations with very different outcomes.
CONCLUSIONS: In observational databases, each data point is a combination of both patient pathophysiology as well as HSD, and both of those dimensions provide valuable information for research. In this poster we present the software tools and data visualizations we developed to explain HSD to investigators and to help them incorporate it into in their research.
5. Data curation framework for tranSMART
Ward Weistra, Jochem Bijlard, Stefan Payralbe
Preparing your data for loading into tranSMART has always involved the need for a lot of knowledge of formats and rules specific to tranSMART. With the Python- and Jupyter-based tranSMART Toolkit (tmtk; http://tmtk.readthedocs.io) and customer facing Arborist collaboration tool TraIT/BBMRI and The Hyve have made data ingestion, transformation, collaborative structuring and loading a pleasant experience. It allows you to focus on modelling the data, rather than worrying about tranSMART specific technical details.BBMRI - Arborist & tmtk poster - Final
6. Procedures to Ensure Baseline Data Quality for a SHRINE Network
Philip Trevvett, M.A., Isha Johri, B.S., Bhanu Bahl,Ph.D
Harvard Medical School
SHRINE (Shared Health Research Information Network), is a networked application that enables users to query for aggregated counts of patient data that is stored in i2b2 (Informatics for Integrating Biology and the Bedside) repositories at multiple hospitals. While the SHRINE ontology and site mappings enable federated query execution, local i2b2 instances may still vary in data available. The date range for data available at each site may differ, as well as data update frequency, and local organization of demographics data also may impact results. In addition, local site changes may impact an institution’s data. The HMS SHRINE network has implemented semi-annual data update procedures, including data update validation, as well as weekly network query tests to help ensure baseline quality.
HMS SHRINE has defined date ranges for networked i2b2 sites and a regular network-wide data update cycle. Because there is a lag for encoding patient data, HMS SHRINE instituted a 6 month buffer between the current date and most recent data available. Updates at HMS SHRINE are coordinated in July and January, to update site data through 6 months prior. Updates are validated through test queries covering all data elements (demographics, diagnoses, medications, and lab tests) with review and comparison of results: total results from before and after data update, and results for specified date ranges. Total patient sets are expected to increase. Patient sets limited to previous dates should remain unchanged. New dates show data resulting from the update. In addition, weekly queries are run on the SHRINE network, using arbitrarily selected terms, with variation of data elements queried. This ensures queries are not limited to terms queried in data updates, or discoverable through expected usage patterns.
Results and Discussion
By tracking and comparing results with each data update, and running weekly queries on the network, the SHRINE team centrally monitors the general status and data quality across all sites. Weekly query tests have led to the discovery of several issues, including a number of false mappings caused by a local site’s data migration, which had caused a small number of local ID changes. The data update process has helped maintain consistent representation of data in the HMS SHRINE network, and highlighted site inconsistencies due to differing data models: demographics data store in the patient_dimension table prevents date range limits for queries involving only those concepts; demographics data stored in the observation_fact table can be queried with date ranges. Moving forward, it is critical that SHRINE clearly shows users what data is available, and positions users to clearly understand their query results. A next step will be to identify how SHRINE can highlight data representation differences through the SHRINE ontology and webclient.
7. Protecting Patient Identity in SHRINE Without Lockout
David F. Walend, Benjamin D. Carmen, Marc Ciriello, Isha Johri, William Simons, Bhanu Bahl, Ty P. Coghlan
Harvard Clinical and Translational Science Center
Remote and broad access to distributed i2b2 patient databases via a SHRINE network raises concerns about enforcement of proper use, especially re-identification of a specific patient. We examined how researchers use SHRINE, SHRINE 1.21’s data protection mechanisms, and SHRINE’s distributed trust model. We decided to stop relying on automated lockout at the adapter, to increase the Gaussian noise SHRINE injects in its results, and to enhance the capabilities of the existing Data Steward Application (DSA) in SHRINE 1.22 to support audits of terms used in queries.
SHRINE Protecting Patient Identity
8. i2b2 on OMOP
Lori Phillips, Chris Herrick, Shawn Murphy
The i2b2 data model is comprised of a central fact table (observation_fact) surrounded by multiple dimension tables (star schema). In the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM)1, rather than a central fact table, we have a collection of them distinguished by domain: procedures, condition, drug, measurement, observation, etc. In this exercise we investigate how to modify the i2b2 CRC cell to query against a collection of fact tables as dictated by vocabulary domain.
2b2 on OMOP
9. Role of Basic Summary Statistics in Data Quality Assessment of Real-World Data
Laura Evans, Matvey B. Palchuk, MD, MS,
TriNetX Live is a platform that provides cohort identification, site selection, patient recruitment and other analytics services based on a federated database of clinical data provided and continually refreshed by member healthcare organizations. The data that underpins all analytic functions is primarily originated at the Electronic Health Systems and other systems. For use by TriNetX, the data gets extracted from the primary source systems and undergoes a transformation, clean-up, de-duplication, de-identification, optional obfuscation, semantic mapping, etc. Any consumer is faced with a question about data quality, or whether the data is “fit for purpose” given the requirements of secondary uses. Data quality is an important issue that must be addressed for every combination of a dataset and the purpose for which it is intended to be used. We will show that basic summary statistics demonstrates overall data consistency across multiple institutions and can clearly identify outliers.
10. HERON (i2b2)
Maren Wennberg, Tamara M. McMahon, Li Huang, Lemuel R. Waitman
Center for Medical Informatics and Enterprise Analytics, Department of Internal Medicine, University of Kansas Medical Center, Kansas City, KS
HERON (i2b2) training helps researchers accurately search for cohorts in i2b2. Some training approaches at The University of Kansas Medical Center include in-person trainings, online trainings, training within HERON (i2b2), and training on data requests.
Training resources are found at http://frontiersresearch.org/frontiers/HERON-introduction
HERON Training #110
11. Integration of Knowledge Extracted from Clinical Notes with Patient Reported Outcomes and Genetic Reports for Advancing Research into Phelan-McDermid Syndrome
Cartik Kothari , Maxime Wack , Claire-Hassan Khodja , Sean Finan , Guergana Savova , Megan O’Boyle[ 3], Geraldine Bliss , Andria Cornell , Elizabeth Horn , Rebecca Davis , Jacquelyn Jacobs , Isaac Kohane , and Paul Avillach [1,2]
 Department of Biomedical Informatics, Harvard Medical School, Boston MA  Boston Children’s Hospital, Boston MA;  Phelan-McDermid Syndrome Foundation, Venice FL
Phelan McDermid Syndrome (PMS) , also known as 22q13 Syndrome, is a rare genetic disease (~1400 diagnoses worldwide) primarily diagnosed in children. PMS is caused by terminal deletions in the long arm of the 22nd chromosome, leading to loss of function of the SHANK3 gene.  Deletions range in size from a few Kilobases to several Megabases, affecting between 1 and 140 genes. Diagnosis of PMS is only possible with genetic testing and often delayed. Knowledge extracted from clinical notes and Patient Reported Outcomes (PRO), two underused sources of detailed information about patient conditions, could significantly augment our understanding of PMS.IMFAR2017-#12 Avillach
12. Programmatically Generating c_metadataxml Metadata Fields
Essex Management, LLC
The poster will outline the programmatic generation of c_metadataxml entries for i2b2 metadata. The approach leverages LOINC concept information combined with analysis of the data available in the source system to generate a c_metadataxml entry tailored for the local site. The tool may be ran in a batch mode to process codes en masse