Why are hierarchical databases like MUMPS still popular in Healthcare?

For my class on Healthcare IT Infrastructure at GGU (ITM 351) I had to explain to my students why hierarchical databases are not only still popular, but also practical, and show the limitations at the same time. It is amazing how little material there is comparing hierarchical and relational models – maybe because outside of Healthcare M is not used much and in computer science students only learn about RDBMS. But Healthcare is different. From Meditech to Epic, many EMR systems still use MUMPS or M. And actually for a good reason. Here is an excerpt from my GGU ITM 351 class:

Hierarchical database models continue to play a very important role in Healthcare IT, we need to review this aspect a lit tle bit more. First, let me explain what a normalized RDBMS is. Based on Codd (1971), the pioneer of modern Database technology, a database in normalized when it is at least in third normal form. Third normal form is required to prevent update and data inconsistency issues.

In figure 1 you find a very simple normalized relational database model:

Figure 1 Relational Model

In this example we have three entities (Patient, Encounter, and Procedure) to reflect a core concept of medical records. Every time a patient enters a Hospital, a new Encounter is created. One Patient can have multiple encounters over a period of time, so this is a one-to-many relationship. Each Encounter will require one or more Procedures, so again we have a one-to-many relationship, and each of the procedures can require one or many orders – all parent-child like relationships, all one to many and hence perfect for a hierarchical model. But in a relational model,  it is necessary to normalize to the 3rd normal form (NF), and in order to do this, the data architect needs to satisfy the requirements of the first normal form (1NF) first:

  • No repeating groups within or across columns

That means if a patient can have multiple phone numbers or multiple encounters, we cannot store these in the same table. Multiple Phone Numbers in one column would create problems when a program needs to update a single Phone number, because a single query would return the content of the field, which would be multiple phone numbers, not one in particular. And having multiple columns of phone numbers would violate 1NF, because we could have a lot of empty columns (i.e. if a patient has only one, but we anticipate up to three), or, if a patient has more phone numbers than we anticipate, we wouldn’t have enough. So that’s why we create a new table called “Encounter”, or “Phone Number”. We use MRN, the primary key of our Patient table, as a foreign key in the Encounter table (that establishes the link between the two tables), and we create an encounter ID. Encounter ID and MRN together identify Procedures, and we can add as many procedures as we like for each encounter, and as many encounters as we like, without ever violating 1NF.

Second normal form (2NF) requires:

  • All non-key attributes must be fully dependent on the key

The encounter table meets this requirement, because the table has three attributes (MRN, Encounter-ID, and Procedure Code). The non-key attribute “Procedure Code” is fully dependent on the composite key. Why do we need a composite key? One patient can have multiple Encounters, so I could have the same patient (= same MRN) with multiple encounters, and in each of those encounters a different set of procedures.

Third normal form is achieved if

  • No functional dependencies of non-key attributes

Orders are dependent and specific to each encounter, so we could not have orders and encounters in one table with MRN and Encounter ID as key. So the above simple diagram is a database schema in 3NF.

Now let us look at the same data organized hierarchical:

Figure 2 Hierarchical Data Model

In a hierarchical data model we have one restriction, which is that we can only model one-to-many relationships. But in the previous normalization exercise we discovered that that applies to all relationships. One patient -> many encounters. One encounter -> many orders. And when you think about this, medical records are always organized like this – they are logically hierarchical tree structures, which lend themselves to the database models like MUMPS, which not coincidentally was developed in Healthcare. The other advantage of hierarchical databases is that they do not have to be in 1NF. I could list multiple encounters and multiple phone numbers in the Patient table, and then link from there to the child, so for example encounter 1 links to a table with details about encounter 1, which contains many orders etc.

Because simple operations, like looking up a phone number, require costly table joins, some database designers purposely violate 3nf and design some redundancies in their database schema for the benefit of efficiency.

What is more efficient – RDBMS or M?

The Codd model of an RDBMS is very elegant, and a great implementation of mathematical set theory, which allows us to relate data in ways that were not necessarily predefined. If I would like to know all patients in the Hospital that have had an Appendectomy, I could simple formulate a query such as:

Select MRN from Procedure where Procedure Code = “47.19”

And if I wanted to know the Name of patients that had had appendectomies, my SQL query would be:

Select FNAME, LNAME from Patient where MRN =( Select MRN from Procedure where Procedure Code = 47.19”)

This is called a JOINT operation, in which two tables are joint on a common field (in this case MRN), which is great for all kinds of queries. It is the reason why RDBMS are so popular. But at the same time, every time I want to know a simple thing like which phone numbers a patient has, I have to use a joint also:

Select FNAME, LNAME from Patient AND Phone Number from Phone_Numbers where Patient MRN = Phone_Number MRN

Now, joints are computationally very costly, because in a joint first all data elements are brought together into an intermediary table, and then the joint condition or constraint is applied (in this case matching MRNs). If a join requires combining a database table with 500,000 rows with another table of 500,000 rows, the intermediary table will have 500,000 x 500,000 entries before the constraints are applied. In contemporary information systems that is not a big deal, but ten, twenty or forty years ago, it certainly was. And still today, RDBMS response time can be an issue if the database schema has not been designed with efficiency in mind. So while a relational model is very elegant, and allows all kinds of queries, a hierarchical model is very efficient BECAUSE of redundancies.

Downside of hierarchical models

It is hard to believe, but it is difficult for a hospital using a hierarchical database (and that is the majority of Hospitals in the US) to answer a simple question like “how many patients do you currently have with H1N1 diagnosis”. Hierarchical databases are built to structure data hierarchical, so if I Look up a patient X, I can most certainly find out if he has a diagnosis of “y”. But if I want to know all patients that have a diagnosis of “y”, which would be an unusual query for a regular Hospital process, but not so unusual for public health purposes, I would have to look up each and every patient tree to retrieve that information.

In interoperability the standard defined by the Department of Health and Human Services is the Continuity of Care Document (CCD) or Continuity of Care Record (CCR). Both CCD and CCR are XML schemas, which are built hierarchical. But XML uses metadata tags, so the query of diagnosis Y is a little easier, because I can look for a particular metadata tag, and select only entries where the metadata tag is non-empty. Still, the system would have to parse all records and then count the ones that match.

Another common way to overcome the disadvantage of a hierarchical model for non-standard queries is to load the data in data warehouses or data marts. The latter approach becomes very common for quality reporting and public health requirements, which are part of the ARRA/HITECH Meaningful use guidelines, and represent obviously a challenge for all the EMR systems based on MUMPS.

Codd, E.F (1971), “Further Normalization of the Data Base Relational Model.” (Presented at Courant Computer Science Symposia Series 6, “Data Base Systems,” New York City, May 24th-25th, 1971.) IBM Research Report RJ909 (August 31st, 1971). Republished in Randall J. Rustin (ed.), Data Base Systems: Courant Computer Science Symposia Series 6. Prentice-Hall, 1972


10 comments so far

  1. Rob on

    You may find this new paper on Globals is a useful addition to your analysis:


    • healthcaresecurity on

      Very much so, very nice paper!

  2. […] Read more. MEDITECH, Recent Items   ARRA, Data model, GGU, Health-IT, Healthcare, M, MUMPS, MUMPS RDBMS Healthcare data models Healthcare IT Infrastructure      UPDATE #2: iPad and MediTech Electronic Medical / Health Records » […]

  3. Caregraf on

    An excellent “in a nutshell”. And I’ve seen exactly what you’re saying: pure hierarchal is great for per-patient queries and rollups (make a CCD) but difficult for cross patient querying.

    One detail to add (if we’re diving down) is that VistA/RPMS, the biggest MUMPS system in healthcare, isn’t all hierarchal. It is for some things. It’s more of a star for others.

    Take a Patient type that would have one entry per patent (see: http://vista.caregraf.org/schema/9000001.html ) Scroll down and see the node types that “hang off” this node – Problem, Visit, Immunization etc. In this case, the data store is star-like. As a result, you can look up “all patients with problem X” (Problem schema: http://vista.caregraf.org/schema/9000011.html. See the fields indexed).

    Now of course, some data is arranged in a hierarchy. For example, pharmacy data ( http://vista.caregraf.org/schema/55.html ) has embeds concrete data. For example, Unit Doses: http://vista.caregraf.org/schema/55_06.html holds in patient medications taken and is a contained type. This makes looking up “all patients who took medication X” difficult.

    So MUMPS systems are a mixed bag. They can be hierarchal or star or a mix of both.

  4. keith on

    Thanks for clarifying why the data I receive from hospitals is in the format it is. I design data warehouses and I was extremely curious as to the flat file layouts.

    I couldn’t believe the original layout. The efforts taken to actually normalize the data going to the dw before creating cubes was surprising.

  5. Keith B HFS on

    Thanks! very nice work

  6. Neil A on

    Don’t sell MUMPS short. While you have to look through every patient visit in your example, performing the join is not that much different from using a cobinations of $Order and $Select statements

    • healthcaresecurity on

      I’m not trying to short sell MUMPS, but just explain what it is good at, and what it is not good at. For hierarchical relationships, its great – better than SQL, and that’s why it’s still around in HC. At the same time – why do you think no EMR can tell you right now how many patients with diagnosis X you currently have in the hospital? Because MUMPS is really not good for that. One approach I see emerging is to use the MUMPS system for production, and then off-load the data into a DWH, where it is easy to create these types of ad-hoc queries. In this context there is a huge movement now to explore no-SQL DWH platforms. Look also for more development akin to Amalga, a SQL data aggregation platform.

      • Neil A on

        Good point. The ability to have MUMPS provide what you are looking for is not a technical obstacle – it can be relatively easy to implement. In the mid-1980’s I worked with a query language that overlaid on a MUMPS based 4GL that provided very near SQL capabilities. With MUMPS alone, you definitely need to get much more involved than you would with SQL.

        On your idea of separating a production environment from a data warehouse, I can’t agree more. It accomplishes a great deal including indirect redundancy.

  7. AmateurScholar on

    Excellent article. Worth noting that datamarts and the like are now also popular even with RDBMS which require special considerations for high performance versus reporting… So, if we need external transformations for reporting anyway….maybe hierarchical core transactions deserve more consideration than they get in mainstream IT.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: