OUP user menu

Establishing Reference Intervals for Clinical Laboratory Test Results
Is There a Better Way?

Alex Katayev MD, Claudiu Balciza, David W. Seccombe MD, PhD
DOI: http://dx.doi.org/10.1309/AJCPN5BMTSF1CDYP 180-186 First published online: 1 February 2010


Reference intervals are essential for clinical laboratory test interpretation and patient care. Methods for estimating them are expensive, difficult to perform, often inaccurate, and nonreproducible. A computerized indirect Hoffmann method was studied for accuracy and reproducibility. The study used data collected retrospectively for 5 analytes without exclusions and filtering from a nationwide chain of clinical reference laboratories in the United States. The accuracy was assessed by the comparability of reference intervals as calculated by the new method with published peer-reviewed studies, and reproducibility was assessed by the comparability of 2 sets of reference intervals derived from 2 different data sets. There was no statistically significant difference between the calculated and published reference intervals or between the 2 sets of intervals that were derived from different data sets. A computerized Hoffmann method for indirect estimation of reference intervals using stored test results is proved to be accurate and reproducible.

Key Words:
  • Reference
  • Interval
  • Estimation
  • Laboratory
  • Test
  • Result
  • Interpretation

It is hard to underestimate the importance of clinical laboratory test results. Nearly 80% of physicians’ medical decisions are based on information provided by laboratory reports.1 A test result by itself is of little value unless it is reported with the appropriate information for its interpretation. Typically, this information is provided in the form of a reference interval (RI) or medical decision limit. An RI as defined by Ceriotti “is an interval that, when applied to the population serviced by the laboratory correctly includes most of the subjects with characteristics similar to the reference group and excludes the others.”2(p115) No RI is completely “right” or “wrong.” The majority of RIs in use today refer to the central 95% of the reference population of subjects. By definition, 5% of all results from “healthy” people will fall outside of the reported RI and, as such, will be flagged as being “abnormal.”

There are many problems associated with the calculation of RI. The latest edition of the Clinical and Laboratory Standards Institute–approved guideline, “Defining, Establishing, and Verifying Reference Intervals in the Clinical Laboratory,” recognizes the difficulties and controversies surrounding the establishment of RIs and the verification process: “…the working group recognizes the reality that, in practice, very few laboratories perform their own reference interval studies,” “…instead of performing a new reference interval study, laboratories and manufacturers refer to studies done many decades ago, when both the methods and the population were very different.”3(p1)

It has been recommended that an RI be established by selecting a statistically sufficient group (a minimum of 120) of healthy reference subjects. However, it is noted in the guideline that “Health is a relative condition lacking a universal definition. Defining what is considered healthy becomes the initial problem in any study….”3(p8) In reality, there will always be some level of uncertainty with a given selection protocol not only because of the definition of health that was selected but also because of the very real possibility that some of the selected subjects may, in fact, have subclinical disease.

Recruiting a valid group of reference subjects and obtaining informed consent in today’s environment is costly, time-intensive, and virtually an impossible task for most laboratories. The challenge is further magnified in establishing RIs for different age groups (eg, pediatric patients and geriatric patients), uncommon sample types (eg, cerebrospinal fluid and aspirations), timed collections, challenge tests, and serial measurements.

In light of these difficulties, most laboratories elect not to establish their own RIs, but rather choose to verify RIs that have been reported by the manufacturer or as established by another laboratory. This is a relatively simple study and requires only 20 healthy subjects to recruit.3 The underlying assumption here is that the laboratory analytic system is calibrated and producing similar results as the method that was originally used in the published RIs. However, this may not be true because in many cases, the details of the reference study such as its design, the inclusion and exclusion criteria used for selecting the healthy recruits, preanalytic sources of variation, etc, are lacking.

A laboratory can elect to “transfer” the RIs that were in use with an older method (or from another laboratory) to a new method. To do this, the laboratory must first demonstrate that the 2 methods produce comparable results. It is well known that analytic systems drift over time, and there is no guarantee that the method of today is producing results that are comparable to those that were produced at the time of the original RI study. This technique is the main reason why many laboratories today are using RIs that were established decades ago and are out-of-date.

Even if a laboratory was able to obtain a rigorously selected statistically sufficient number of healthy subjects and perform all the necessary testing, the next step would require statistical analysis of data. What statistical technique should be used: parametric, transformed parametric, non-parametric, or many others described? The judgment in most cases will be made subjectively because there are no clear guidelines, and the resultant intervals will differ depending on the method used.

Laboratories are often faced with test data that exhibit a multimodal or an asymmetric distribution. This may reflect a large prevalence of subclinical disease within the selected population or subgroup-related differences in normal ranges. The latter requires partitioning of test subjects by sex, age, race, and other factors. Partitioning by sex is relatively easy (select a minimum of 120 males and 120 females). However, partitioning by age groups is not a simple matter. What age cutoffs should be used? How many groups should be studied? There are some complex statistical techniques available, but none seem ideal for solving partitioning problems.3

The last major challenge is cost. In the modern environment when laboratories are struggling to stay profitable, not everyone is willing to budget the appropriate resources for a lengthy and expensive RI study.

An alternative approach for establishing RIs is to do an indirect so-called a posteriori study of the patient data already collected and stored in the laboratory database. This is appealing because the data are readily available and will result in time and cost savings. A number of publications discuss this approach.48 Most of these studies were able to report clinically relevant and meaningful RIs. All of them used various sophisticated filters to exclude results from “unhealthy” subjects, and some used data from hospital laboratories and some from outpatient care settings or noninstitutionalized population study databases. Most of these studies used complex statistical algorithms to derive the final intervals. However, current guidelines do not endorse these methods as a primary approach for establishing RIs, mainly out of concern for the fact that most of the data may not come from reference or healthy subjects.3 This position may be justified for test results collected from hospitalized patients but is questionable when considering a very large number of results that have been collected in outpatient settings. Indeed, there is no disease with prevalence close to 50%. On the other hand, as discussed, the recommended direct sampling techniques are not without their own assumptions.

The reliability of an RI study should be a function of its accuracy and reproducibility and have a direct relationship with the number of observations used and method standardization. Statistically, it is more robust to analyze thousands of measurements that may include some unhealthy subjects than 120 measurements that are assumed to be from healthy subjects. The main problem with most of the reported indirect studies is that they used statistical analyses designed for a direct sampling technique. Hoffmann, in his classic JAMA article from 1963, described a technique designed for indirect estimation of RIs using all available test results from a laboratory’s database: “This statistical technique can be used for obtaining any normal values in medicine where a group of measurements are available and the mathematical assumptions are reasonable.”9(p868) Although his work has been widely cited, few authors have actually applied the Hoffmann method in their calculations.

A notable exception is the manual of pediatric RIs by Soldin et al10 that is now in its sixth edition and published by American Association for Clinical Chemistry. This fundamental work was limited by the relatively small number of observations (typically 50–100) that were used and by the semimanual application of Hoffmann analysis of data, which added subjectivity to the calculations.10

Our goal in this study was to assess the reliability of the Hoffmann approach using a newly developed computer program designed to remove subjectivity from RI calculations.

Materials and Methods


The study was carried out using the data from Laboratory Corporation of America, one of the largest providers of laboratory testing in the United States. Data were collected from 6 regional laboratories: Burlington, NC; Columbus, OH; Houston, TX; Birmingham, AL; Raritan, NJ; and Tampa, FL. The average ordering source by test volume for this group of laboratories was as follows: 87% from outpatient general practice facilities and 13% from acute care facilities. The patient test results stored in the laboratory information system were queried for a given time frame without any filtering, no results were excluded, and all of these results constituted the original database for this study. The protocol for this study was determined to be exempt under existing regulations by the institutional review board.

Accuracy Study

The following 5 analytes were evaluated: mean corpuscular volume (MCV), hemoglobin, creatinine, calcium, and thyroid-stimulating hormone (TSH). Each of these analytes are known to have peer-reviewed RIs as calculated from direct sampling of volunteers, multicentered studies, or national survey databases. The following numbers of test results were collected retrospectively from the laboratory information system: MCV, 40,744 results (men and women); hemoglobin, men, 16,463; hemoglobin, women, 24,809; creatinine, men, 24,068; creatinine, women, 29,471; calcium, 51,492 (men and women); and TSH, 129,443 (men and women). All results originated from tests performed in June 2008 and were from adult patients (18 years or older).

Reproducibility Study

The reproducibility of this method was assessed using TSH as an analyte that historically has been problematic for RI studies. Some authors have suggested that “TSH reference range cannot be determined from population data, because occult thyroid dysfunction skews the TSH upper limit.”11(p4239) A comparison of 2 TSH RIs as calculated from 2 different sets of test results collected 6 months apart was considered to be a good challenge for testing the reproducibility of this method. The second set of TSH results (n = 151,235) from adult patients (18 years or older) originated from December 2008 was collected.

Technical Information

Standardization to the same methods, reagent lot numbers, calibrators, controls, and standard operating procedures significantly reduces the between-laboratory variability in test results. All 6 laboratories that participated in this study had been standardized in this manner. The methods were as follows: MCV and hemoglobin, LH750/GEN*S (Beckman Coulter, Miami, FL); creatinine and calcium, Roche/Hitachi MODULAR (Roche Diagnostics, Indianapolis, IN); TSH, ADVIA Centaur (Siemens Medical Solutions Diagnostics, Tarrytown, NY). All methods are US Food and Drug Administration–approved and were operated according to the manufacturer’s specifications. Centralized comparative method performance monitoring for each laboratory location was conducted on a monthly basis using quality control data and patient mean data.

Statistical Analyses

The Hoffmann indirect method for the derivation of RIs was programmed as originally described.9 Chauvenet criteria were used for the detection and elimination of outliers. With Chauvenet criteria, a measurement (result) is eliminated if the probability of its occurrence is less than 1/(2N), where N is the number of measurements in the data pool and is greater than 4.

Following the elimination of outliers, the cumulative frequency for each test result was determined. The frequency of a test result was taken as the number of times a result occurs in the data set divided by the total number of results times 100%. FXi=CountXiCountdata-pool×100%

The cumulative frequency is CFXi=k=2iFXk ordered by Xi.

Once the outliers had been eliminated, the data were refined so that only values from the linear portion of the cumulative frequency graph were used.

Visual analysis provides a good approximation of the linear data so that only values from the linear portion of the cumulative frequency graph were used. By computing the maximum deviation of the data from the regression line in this interval, the acceptable linearity error may be defined. This approach was taken because it is the best way to translate human fuzzy logic into computer precision. Once the acceptable linearity error had been defined, the portion of the data pool that was deemed to be “linear” was selected and used in computing the associated regression line Figure 1.

Subsequently, the best-fitting linear regression (y i = α * x i + β + ɛ i ) equation was determined by least-squares analysis (α is the slope, β is the intercept of the line and ɛ i is the error). The line with the minimum sum of square residual values was identified accordingly. A residual value (r i ) was taken as the difference between the measured value (y i ) and the approximated one as determined by the linear regression function [f(x i )]: ri=yi-f(xi)

The RIs were then determined from the linear regression equation following extrapolation of the preceding curve. The RI was calculated (for x = 2.5% and 97.5%): RImin=α * 2.5 + β RImax =α * 97.5 +β

The reference change value (RCV) was selected for determining the statistical significance of observed differences between the calculated RI and the published RI. If the observed difference is less than the RCV, the difference is not significant.

RCV was calculated as described by Fraser et al12: RCV = 21/2*Z*(CVa2+CVi2)1/2 where Z is the probability selected for significance (a Z value of 1.96 was selected for 95% probability corresponding to a significant change), CVa is the analytic variation, and CVi is the within-subject biologic variation. The CVa and CVi values were derived from Ricos et al.13


The representative output graph is given in Figure 2, and the derived regression functions and calculated RIs for all 5 analytes are given in Table 1. The comparison of calculated RIs using the computerized Hoffmann method with peer-reviewed published intervals is given in Table 2 (accuracy study).1418 The comparison of 2 sets of TSH RIs derived from 2 separate data sets collected 6 months apart are given in the Table 3 (reproducibility study).

Figure 1

A typical plot of cumulative frequencies (hatched line) with the linear portion of the cumulative frequency graph (solid line).

Figure 2

A representative output graph for hemoglobin (g/dL) for men. A, Cumulative frequencies (dots) and regression line. y = 0.03594x + 12.93457. B, Scatter graph showing good data (black dots) and outliers (gray dots). Linear range (%), 23.69302–90.67008; N value, 16,457 (without outliers); regression function, y = 0.03594x + 12.93457; reference interval, 13.02442–16.43872; maximum error, 0.05000. To convert hemoglobin values to Système International units (g/L), multiply by 10.0.


The computerized Hoffmann method for the indirect determination of RIs produced intervals that were remarkably similar to peer-reviewed RIs (Table 2).1418 None of the calculated lower or upper limits displayed statistically significant differences from the corresponding published limits.

View this table:
Table 1
View this table:
Table 2
View this table:
Table 3

The calculated ranges in most cases were slightly narrower than the ranges obtained from direct sampling techniques. A concern with the use of indirect methods has been that the resulting interval may be wider than it should be or skewed because of inclusion of unhealthy subjects. This was not observed in the present study. It is noteworthy that the choice of the statistical method and the large number of observations are critical components in maintaining the accuracy of RIs derived from indirect methods. The described technique may be the method of choice for this purpose.

It is well recognized that test results may be impacted by a host of different factors, some of which are known and some are not (preanalytic sources of variation). Preanalytic sources of variation should be taken into consideration and controlled when selecting the healthy subjects who will be used for establishing an RI by traditional techniques. Despite this, preanalytic sources of variation that may have been controlled for in establishing the RI are often neglected when it comes to the routine testing of patients. It may be argued that the Hoffmann indirect technique as applied in the present study should provide a more robust estimation of the RI than traditional methods in that the majority (if not all) of the preanalytic sources of variation impacting the test result will be reflected in the outcome.

Another interesting aspect from this study relates to estimates of disease prevalence. It is reasonable to suggest that the percentage of test results above and below the upper and lower limits in the entire data set used for RI calculation for a given analyte may correlate with a prevalence of conditions that cause its concentration to be abnormal. This assumption is less reasonable while using smaller sets of patient results or results from hospital settings because the results distribution will be skewed by the significant number of repeated tests from the same patient and results from patients with serious conditions. The large number of observations from outpatient settings will negate those factors. A recent study of the prevalence of chronic kidney disease in the United States reported the prevalence in 1999 to 2004 as 13.1% and raised concerns about its future increase.19 In the present study, the filtering technique excluded from the healthy subgroup a total of 14.4% of test results for creatinine as being above the calculated upper range (men and women). This mirrors the reported prevalence of kidney disease in the United States.

The American Thyroid Association has estimated the prevalence of thyroid dysfunction in the United States for subclinical hypothyroidism to be up to 17% and for clinical hypothyroidism to be up to 2%, which makes a combined prevalence of both hypothyroid conditions of up to 19%. The prevalence of subclinical hyperthyroidism was reported to be up to 6% and clinical hyperthyroidism to be up to 0.2%, which makes a combined prevalence of both hyperthyroid conditions of up to approximately 6%.20 The described method excluded from the healthy subgroup a total 18.6% of test results for TSH that were above the calculated upper limit and 5.9% of test results for TSH that were below the calculated lower limit of the RI. Again, these values are very close to the reported prevalence of thyroid diseases.

As expected, the reported RIs for TSH showed the greatest degree of variation with the reference studies. However, the observed difference was well below the RCV. It is interesting that both of the reference studies for TSH reported higher upper limits than the upper limit that was generated in the present study. This may be explained in part by the observations of Spencer et al11 regarding the difficulties in the estimation of an accurate TSH upper limit using the existing techniques. The presented method seems to accurately reflect the actual distribution of TSH in the thyroid disease–free population, and the new calculated TSH upper limit is in complete agreement with a recently suggested upper limit of 3.0 mIU/L and a treatment goal for hypothyroidism.21,22

The limitations of this method for the use in many clinical laboratories may apply when one of the following is present: a large prevalence of results from hospitalized patients; a limited number of observations, especially for subject groups like pediatric or geriatric patients or rare sample types like synovial fluid; and lack of standardization between the methods in use.

This limitation could be minimized by linking laboratories that are operating similar instrumentation and methods into peer group–based operational networks. The performance of the peer group within the network would be monitored to ensure that standardization goals are being met. Under such circumstances, each of the participating laboratories could contribute its patient test results to a central network data bank from which the RIs for the network could be established.


The described computerized Hoffmann method for the indirect estimation of the RIs from the existing laboratory database of test results has proven to be reliable and reproducible. The method produced RIs that were statistically not different from the ones reported in peer-reviewed publications. This method was able to derive RIs for a problematic analyte (TSH), producing a result that was in strong agreement with recent scientific observations and clinical recommendations. In addition to its reliability and reproducibility, this technique has a number of advantages over the conventional direct sampling techniques. Notably, it provides a mechanism for deriving RIs for difficult-to-study populations like pediatric and geriatric, as well as for rare sample types like aspirations, for calculating RIs for timed collections, and for challenge tests while providing overall laboratory resources savings. In addition, it may be used on an ongoing basis as part of a quality management program by providing an auditing tool for confirming the appropriateness of the current RIs for the existing methods. The caveat for this method is that it may be used only with large numbers of observations, with test results that are mostly from outpatient settings, and when standardization of methods has been implemented and monitored.


We thank Mark Sharp for the collection of data and Mark Brecher, MD, for review of the manuscript.


  • Authors’ contributions were as follows: study concept, design, and drafting the manuscript, Drs Katayev and Seccombe; software programming, Mr Balciza; statistical analysis, Drs Katayev and Seccombe and Mr Balciza. Drs Katayev and Seccombe had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
View Abstract