Archive for the ‘Data model’ Tag

Why are hierarchical databases like MUMPS still popular in Healthcare?

For my class on Healthcare IT Infrastructure at GGU (ITM 351) I had to explain to my students why hierarchical databases are not only still popular, but also practical, and show the limitations at the same time. It is amazing how little material there is comparing hierarchical and relational models – maybe because outside of Healthcare M is not used much and in computer science students only learn about RDBMS. But Healthcare is different. From Meditech to Epic, many EMR systems still use MUMPS or M. And actually for a good reason. Here is an excerpt from my GGU ITM 351 class:

Hierarchical database models continue to play a very important role in Healthcare IT, we need to review this aspect a lit tle bit more. First, let me explain what a normalized RDBMS is. Based on Codd (1971), the pioneer of modern Database technology, a database in normalized when it is at least in third normal form. Third normal form is required to prevent update and data inconsistency issues.

In figure 1 you find a very simple normalized relational database model:

Figure 1 Relational Model

In this example we have three entities (Patient, Encounter, and Procedure) to reflect a core concept of medical records. Every time a patient enters a Hospital, a new Encounter is created. One Patient can have multiple encounters over a period of time, so this is a one-to-many relationship. Each Encounter will require one or more Procedures, so again we have a one-to-many relationship, and each of the procedures can require one or many orders – all parent-child like relationships, all one to many and hence perfect for a hierarchical model. But in a relational model,  it is necessary to normalize to the 3rd normal form (NF), and in order to do this, the data architect needs to satisfy the requirements of the first normal form (1NF) first:

  • No repeating groups within or across columns

That means if a patient can have multiple phone numbers or multiple encounters, we cannot store these in the same table. Multiple Phone Numbers in one column would create problems when a program needs to update a single Phone number, because a single query would return the content of the field, which would be multiple phone numbers, not one in particular. And having multiple columns of phone numbers would violate 1NF, because we could have a lot of empty columns (i.e. if a patient has only one, but we anticipate up to three), or, if a patient has more phone numbers than we anticipate, we wouldn’t have enough. So that’s why we create a new table called “Encounter”, or “Phone Number”. We use MRN, the primary key of our Patient table, as a foreign key in the Encounter table (that establishes the link between the two tables), and we create an encounter ID. Encounter ID and MRN together identify Procedures, and we can add as many procedures as we like for each encounter, and as many encounters as we like, without ever violating 1NF.

Second normal form (2NF) requires:

  • All non-key attributes must be fully dependent on the key

The encounter table meets this requirement, because the table has three attributes (MRN, Encounter-ID, and Procedure Code). The non-key attribute “Procedure Code” is fully dependent on the composite key. Why do we need a composite key? One patient can have multiple Encounters, so I could have the same patient (= same MRN) with multiple encounters, and in each of those encounters a different set of procedures.

Third normal form is achieved if

  • No functional dependencies of non-key attributes

Orders are dependent and specific to each encounter, so we could not have orders and encounters in one table with MRN and Encounter ID as key. So the above simple diagram is a database schema in 3NF.

Now let us look at the same data organized hierarchical:

Figure 2 Hierarchical Data Model

In a hierarchical data model we have one restriction, which is that we can only model one-to-many relationships. But in the previous normalization exercise we discovered that that applies to all relationships. One patient -> many encounters. One encounter -> many orders. And when you think about this, medical records are always organized like this – they are logically hierarchical tree structures, which lend themselves to the database models like MUMPS, which not coincidentally was developed in Healthcare. The other advantage of hierarchical databases is that they do not have to be in 1NF. I could list multiple encounters and multiple phone numbers in the Patient table, and then link from there to the child, so for example encounter 1 links to a table with details about encounter 1, which contains many orders etc.

Because simple operations, like looking up a phone number, require costly table joins, some database designers purposely violate 3nf and design some redundancies in their database schema for the benefit of efficiency.

What is more efficient – RDBMS or M?

The Codd model of an RDBMS is very elegant, and a great implementation of mathematical set theory, which allows us to relate data in ways that were not necessarily predefined. If I would like to know all patients in the Hospital that have had an Appendectomy, I could simple formulate a query such as:

Select MRN from Procedure where Procedure Code = “47.19”

And if I wanted to know the Name of patients that had had appendectomies, my SQL query would be:

Select FNAME, LNAME from Patient where MRN =( Select MRN from Procedure where Procedure Code = 47.19”)

This is called a JOINT operation, in which two tables are joint on a common field (in this case MRN), which is great for all kinds of queries. It is the reason why RDBMS are so popular. But at the same time, every time I want to know a simple thing like which phone numbers a patient has, I have to use a joint also:

Select FNAME, LNAME from Patient AND Phone Number from Phone_Numbers where Patient MRN = Phone_Number MRN

Now, joints are computationally very costly, because in a joint first all data elements are brought together into an intermediary table, and then the joint condition or constraint is applied (in this case matching MRNs). If a join requires combining a database table with 500,000 rows with another table of 500,000 rows, the intermediary table will have 500,000 x 500,000 entries before the constraints are applied. In contemporary information systems that is not a big deal, but ten, twenty or forty years ago, it certainly was. And still today, RDBMS response time can be an issue if the database schema has not been designed with efficiency in mind. So while a relational model is very elegant, and allows all kinds of queries, a hierarchical model is very efficient BECAUSE of redundancies.

Downside of hierarchical models

It is hard to believe, but it is difficult for a hospital using a hierarchical database (and that is the majority of Hospitals in the US) to answer a simple question like “how many patients do you currently have with H1N1 diagnosis”. Hierarchical databases are built to structure data hierarchical, so if I Look up a patient X, I can most certainly find out if he has a diagnosis of “y”. But if I want to know all patients that have a diagnosis of “y”, which would be an unusual query for a regular Hospital process, but not so unusual for public health purposes, I would have to look up each and every patient tree to retrieve that information.

In interoperability the standard defined by the Department of Health and Human Services is the Continuity of Care Document (CCD) or Continuity of Care Record (CCR). Both CCD and CCR are XML schemas, which are built hierarchical. But XML uses metadata tags, so the query of diagnosis Y is a little easier, because I can look for a particular metadata tag, and select only entries where the metadata tag is non-empty. Still, the system would have to parse all records and then count the ones that match.

Another common way to overcome the disadvantage of a hierarchical model for non-standard queries is to load the data in data warehouses or data marts. The latter approach becomes very common for quality reporting and public health requirements, which are part of the ARRA/HITECH Meaningful use guidelines, and represent obviously a challenge for all the EMR systems based on MUMPS.

Codd, E.F (1971), “Further Normalization of the Data Base Relational Model.” (Presented at Courant Computer Science Symposia Series 6, “Data Base Systems,” New York City, May 24th-25th, 1971.) IBM Research Report RJ909 (August 31st, 1971). Republished in Randall J. Rustin (ed.), Data Base Systems: Courant Computer Science Symposia Series 6. Prentice-Hall, 1972

Advertisements