Main Menu

Our Process

We transform raw CMS hospital cost report data into user-friendly panel datasets to make data analysis easier.

Gather Data Select Hospitals Standardize Cost Reporting Periods Outlier Correction Geographic Summary Files Variable Derivation Panel Dataset Distribution

Gather Data

The RAND Hospital Data cover 1996 onwards and contain raw and derived variables drawn from the following source datasets:

Medicare Hospital Cost Report (CMS Forms 2552-10 and 2552-96)
Centers for Medicare and Medicaid Services (CMS) Provider of Services File
Inpatient Prospective Payment System (IPPS) Impact File
Area Health Resources File (AHRF)
Agency for Healthcare Research and Quality (AHRQ) Compendium of US Health Systems – 2016-2023

Select Hospitals

The first step of our data processing is to use the CMS Provider of Services File to select providers located in the United States and Puerto Rico that are required to submit Medicare cost reports.

Standardize Cost Reporting Periods

CMS allows hospitals to select their reporting periods for cost reports; this results in a variance of time periods across submitted cost reports. For example, Hospital A may elect to submit a cost report based on the calendar year, and Hospital B may elect to report based on a July 1 – June 30 year. These differing reporting periods can make it difficult to compare patterns across facilities.

As such, the second step of our data processing consists of selecting raw values from cost reports covering different lengths and periods of time and standardizing them to three time periods:

Calendar year: Jan 1 to Dec 31
Federal fiscal year: Oct 1 to Sept 30
Hospital fiscal year: all hospital cost reports with start dates falling within a given Federal fiscal year

Outlier Correction

RAND applies an algorithm that identifies numeric values that fall far outside the normal range of variation ("outliers") and replaces them with interpolated values. In general, the data are allowed a very wide range of variation before being corrected, and the degree of variation is adjusted based on two factors:

The degree of observed variation within a given hospital over time.
The degree of variation across hospitals for a given variable.

Subscribers can choose datasets with or without outlier correction.

Geographic Summary Files

In addition to hospital-level files, we create summary datasets at various geographic levels of observation:

County-level files contain one observation for all hospitals in each county for each year.
CBSA-level files contain one observation for each core-based statistical area (CBSA) for each year. See "What is the Core-Based Statistical Area?" in the FAQ for more information.
State-level files contain one observation for each US state, as well as DC and Puerto Rico, for each year.
National level files contain one observation for each year.

Variable Derivation

In addition to approximately 1,500 raw variables pulled directly from the cost report, the RAND Hospital Data contains around 400 derived variables. Calculations in the datasets with outliers corrected are based on corrected values and calculations in the uncorrected datasets are based on raw values.

Formulas for each derived variable are included in the provided documentation. See "How do I get detailed documentation on each variable in the raw CMS cost report?" in the FAQ for more information.

Panel Dataset Distribution

The RAND Hospital Data is distributed as a series of panel datasets, with each dataset containing one observation for each entity per year from 1996 onwards.

Ready to get started?

Subscribe for access to all of RAND Hospital Data with customization options.