- What's New?
Updates rolled out in the 2019_08_01 vintage:
- Updated raw HCRIS source data to most recent (uploaded by CMS on July 9, 2019), updated CMS Provider of Services file to 2018, and updated the Impact File to the 2020 proposed rule. This brings the number of hospitals reporting in calendar 2017 up to 4709 and number of hospital-years in calendar 2017 up to 4678.2, so data for calendar 2017 can be considered complete. This update also includes partial data for calendar 2018: 4165 hospitals reporting and 2909.4 hospital-years.
- We added "mdcr_outpat_margin", i.e. the margin on Medicare outpatient services. This margin equals Medicare outpatient revenues minus Medicare outpatient costs, divided by Medicare outpatient revenues.
Updates rolled out in the 2019_05_01 vintage:
- Updated raw HCRIS source data to most recent (uploaded by CMS on April 8, 2019), which brings the number of hospitals reporting in calendar 2017 up to 4692 and number of hospital-years in calendar 2017 up to 3437.0 (about three-quarters complete).
- We added "employees_FTEs", i.e. a count of full-time equivalent employees on payroll, from Worksheet S-3. Fun fact: hospital employee FTEs grew fairly steadily from 3.3 million in 1996 to around 4.5 million in 2016 (2017 is a bit lower, but that is due to incomplete data).
- We added several measures of efficiency and labor costs:
- "costs_per_discharge_equivalent", which equals "total_costs" divided by "discharge_equivalents". This is a simple, comprehensive metric (i.e. includes all payers, and includes inpatient and outpatient services) of hospital operating costs per unit of output. It is useful for tracking trends over time within hospitals or within geographic aggregates, but note that it is not casemix-adjusted and, so, is probably not a good metric for comparing hospitals or geographic areas with different levels of patient acuity.
- "discharge_equivalents_per_FTE", which measures output per employee, not casemix-adjusted.
- "mdcr_inpat_costs_per_disc", which measures Medicare inpatient costs (from Worksheet D-1, Part II) per Medicare inpatient discharge (from Worksheet S-3), not casemix-adjusted.
- "mdcr_inpat_costs_per_disc_CMIadj", which measures casemix-adjusted Medicare inpatient costs per discharge. This only includes Medicare fee-for-service and only includes inpatient care, but it includes casemix adjustment and is a key efficiency metric. Comparisons across hospitals and geographic aggregates are meaningful. Trends over time are also meaningful, with the caveat that casemix creep (or, "upcoding") may bias trends, and Medicare made a major shift to its casemix adjustment (switching from DRGs to MS-DRGs) beginning in October, 2007. Unfortunately, the casemix index (and, therefore, this measure) is only available for hospitals paid under Medicare's inpatient prospective payment system. (The casemix index comes from CMS's "Impact File," which only includes IPPS hospitals.)
- "total_salaries_bens", total salaries and employee benefits. Fun fact: total salaries and benefits paid to hospital employees rose from around 150 billion in 1997 to 380 billion in 2016.
- "salaries_bens_per_FTE", salaries plus employee benefits per employee FTE.
- We added fourteen measures of special Medicare payment adjustments, all from Worksheet E, Part A, most are only available in 2552-10 reports. If you want to use these fields, please consult the Provider Reimbursement Manual (P152_36.zip and P152_40.zip, included in RAND_hospital_data_documentation_2019_05_01.zip) and review the instructions provided to hospitals on how to fill them out. These adjustments include:
- sequestration adjustments ("sequestration_adjustment"),
- hospital readmission reduction program penalties ("HRRP_adjust_only10").
- hospital value-based payment adjustments ("HVBP_adjust_only10"), and others.
- Both the Medicare program and Medicaid have increasingly shifted beneficiaries into managed care, and these inpatient day counts are important for a more-complete measure of payer mix. We expanded our measure of payer mix by adding:
- inpatient days provided to enrollees in Medicare HMO plans (generally synonymous with Medicare Advantage) ("mdcr_hmo_inpat_days"), and
- inpatient days provided to enrollees in Medicaid HMO plans (generally synonymous with Medicaid managed care organizations) ("mdcd_hmo_inpat_days").
- Using the Medicare and Medicaid HMO inpatient day counts, we now calculate:
- "mdcrinclHMO_inpat_day_share", which is the Medicare inpatient day share including Medicare fee-for-service and Medicare HMO days
- "mdcdinclHMO_inpat_day_share", which is the Medicaid inpatient day share including Medicaid fee-for-service and Medicaid HMO days, and
- "nonmdcrmdcdinHMO_inpat_day_share", which is the share of inpatient days that are non-Medicare (including Medicare HMO) and non-Medicaid (including Medicaid HMO). This non-Medicare/non-Medicaid inpatient day share includes commercially insured plus self-pay/uninsured plus odds and ends like workers compensation. Fun fact: at the national level this share has fallen fairly steadily over time and, as of 2017, was below 30%.
- We added several "Interim variables", these are variables that are required for calculating some of the new ratios that have been added, but that are not inherently interesting by themselves. For example, "mdcr_inpat_costs_per_disc_CMIadj" is missing for critical access hospitals because they do not have casemix index values. When calculating aggregate measures of that ratio, we do not want to include Medicare inpatient costs for critical access hospitals (or, any hospital with a missing casemix index) in the numerator. So, we calculate "mdcr_inpat_costs_CMI_nonmissing", which, when summed, gives us the correct numerator for the aggregate calculations. (You may reasonably choose not to worry about or even look at these interim variables, they are put at the very end of the "List of Variables" for a reason.)
Updates first rolled out in the 2019_02_01 vintage:
- Updated raw HCRIS source data to most recent (uploaded by CMS on Jan. 18, 2019), which brings the number of hospitals reporting in calendar 2017 up to 4689 and number of hospital-years in calendar 2017 up to 3435.7 (about three-quarters complete).
- For hospitals that are members of chains, we added the name and address of the home office. (Previous vintages only included a 6-character code for the home office.)
Updates first rolled out in the 2018_12_01 vintage (this was a big upgrade):
- Updated raw HCRIS source data (uploaded by CMS on Oct. 9, 2018), which brings the number of hospitals reporting in calendar 2017 up to 4683 and number of hospital-years in calendar 2017 up to 3428.1.
- We added a new time period ("Hospital Cost Report," see FAQ below) and a full set of datasets using that time period.
- We added a set of measures of administrative expenses, and administrative expenses as a share of overall expenses.
- We added measures of different types of capital assets, including beginning and ending values and purchases.
- We added estimates of hospital revenues from commercial payers. These revenues are, unfortunately, not reported directly in the cost reports. We estimated commercial revenues by starting with net patient revenues and subtracting off Medicare (including revenues from fee-for-service and an estimate of revenues from Medicare Advantage), Medicaid, CHIP, state and local indigent care revenues, government grants, and partial payments by patients approved for charity care. These commercial revenues should be analyzed with some caution, as they are vulnerable to misreporting on any of the fields that feed into the calculation.
- We added estimates of the ratio of commercial prices to Medicare prices ("commercial_to_mdcr_est"), using commercial revenues and charges and Medicare revenues and charges. This price ratio should also be analyzed with some caution, particularly at the hospital level, but it can provide some insights into geographic variations and trends. For two states that I know well (Maryland, where I live, and Indiana, where I have an ongoing hospital price study) the commercial-to-Medicare estimates align well with reality. (Hint: Maryland uses all-payer rate setting, and Indiana has some of the higher commercial hospital prices in the country.)
- What Are Hospital Cost Reports?
Every Medicare-certified hospital is required to submit an annual financial report to the Centers for Medicare & Medicaid Services (CMS). These financial reports are a bit like tax returns: they include basic information on the facility (its name, address, and so on) and the services it provided, but the bulk of the cost report is devoted to information on the hospital's finances. This financial information includes a huge range of detail on revenues, expenses, profits, assets, liabilities, wages, etc.
- Subscription via PayPal
When you subscribe using the PayPal option, there system, under your account, will add a corresponding subscription set to auto-renew by default.
Before your annual sbuscription expires and you wish to not renew, you can terminate the PayPal subscription associated with https://www.hospitaldatasets.org/, and your account will remain active until the expiration date.
- What Do "2552-10" And "2552-96" Refer To?
CMS requires that hospitals and many other types of health care providers fill out and submit annual financial reports, and different reports are used by different types of providers. The "2552" refers to the forms that hospitals are required to fill out. Skilled nursing facilities fill out form "2540," home health agencies fill out form "1728," and so on.
CMS has made several major updates to the forms hospitals fill out, and the "96" and "10" indicate the version the hospital filled out and submitted. One of those updates to the hospital forms occurred in 1996, and another in 2010. Some new worksheets were added in 2010, and many worksheets were overhauled. From 1996 to 2010, hospitals had to fill out form "2552-96," and from 2010 on hospitals have had to fill out form "2552-10."
- How Do I Get Detailed Documentation On Each Variable In The Cost Reports?
The most detailed documentation is the full set of instructions that CMS provides to hospitals on how to fill out the cost reports. These instructions are part of the CMS "Provider Reimbursement Manual." The Provider Reimbursement Manual is made freely available online by CMS, but its intended audience is finance professionals working in the health care industry and initially it can be overwhelming for researchers. The Manual comes in 2 parts, the first part (Publication 15-1) consisting of 31 chapters on accounting topics spanning multiple provider types, and the second part (Publication 15-2) consisting of 44 chapters each relating to a specific type of provider and set of forms.
For researchers interested in the hospital cost reports, the key chapters are Part 2, Chapter 36 ("Hospital and Healthcare Complex, Form CMS-2552-96"), and Part 2, Chapter 40 ("Hospital and Hospital Health Care Complex Cost Report, Form CMS-2552-10") (see FAQ "What are '2552-10' and '2552-96').
Each of those two chapters includes a detailed set of instructions for hospitals, and a full set of the forms (in pdf format) hospitals fill out.
- Can I Get A Data Dictionary For The Hospital Cost Reports?
If you download a data set from the RAND Hospital Data website, those are flat files that are similar to data sets that researchers are used to working with. We provide a spreadsheet data dictionary with a description of each variable in those flat files—some of those variables are taken directly from the cost reports; others are processed and added by RAND. We also provide spreadsheets in the documentation zip file (one for 2552-10, and another for 2552-96) showing where each raw variable comes from, and how it fits into the cost report forms.
If you download and work with the raw cost report data provided by CMS, those data are not structured as flat files, and the Provider Reimbursement Manual (see ‘How Do I Get Detailed Documentation On Each Variable In The Cost Reports?’) is the closest thing to a data dictionary.
CMS provides "rollup" SAS data sets, which are flat files with each observation representing a cost report—CMS also provides a spreadsheet record layout that can be helpful in identifying and interpreting fields in the cost reports.
- What Do The Different Time Periods ("Hospital Cost Report," "Hospital Fiscal Year," "Calendar Year," And "Federal Fiscal Year") Mean?
First, a little background may be helpful. Each hospital can select its own cost reporting period, which is also referred to as the hospital's fiscal year. In most cases, those fiscal years run either from October 1 through September 30 (which aligns with the federal government's fiscal year), or January 1 through December 31, or July 1 through June 30. But, hospitals can and do have other cost reporting periods. And, sometimes, hospitals will change their cost reporting period in the middle of a year and have a cost reporting period shorter than a year.
Researchers and analysts have many options for how to deal with the lack of uniformity in hospital cost reporting periods. For research purposes, it is useful to have the cost report data processed so that each value represents a time period that is consistent across hospitals. In preparing the "Calendar Year" data sets, and the "Federal Fiscal Year" data sets, RAND uses an allocation method so that each record represents the same time period for all hospitals in the data.
In the datasets that use the "Hospital Cost Report" as the time period, each record corresponds to one hospital cost reporting period. So, the start and end dates will differ by hospital and will not necessarily be one year long.
When CMS distributes hospital cost report data, they group cost reports based on the federal fiscal year of the beginning of the hospital cost reporting period. For example, if a hospital's cost reporting period begins on December 1, 2016, that day falls in federal fiscal 2017, and so that cost report would be distributed with the "2017" data by CMS--that grouping of cost reports into time periods is what we refer to as "Hospital Fiscal Year" (though it would be more accurate to call it "the federal fiscal year in which the hospital cost reporting period begins").
Which time period do we recommend? The "Calendar Year" datasets are the most straightforward, and, for most users, the best place to start. If you plan to merge RAND Hospital Data with hospital cost report records that you have read in or obtained from another source, you may want to use the "Hospital Cost Report" datasets--those datasets include a report record number ("rpt_rec_num") that allow one-to-one merging. The "Federal Fiscal Year" datasets are appropriate for merging with other data sources (such as inpatient hospital provider utilization files from CMS) that use the federal fiscal year. We do not recommend using the "Hospital Fiscal Year" datasets unless you have previously processed hospital cost reports using CMS's time periods and want to be consistent.
- Do the Data Sets Available from RAND Hospital Data Have Every Field?
No, RAND Hospital Data only includes a small, but useful, subset of HCRIS data. The full HCRIS 2552-10 data set includes an overwhelming number of fields: around 38 thousand separate alphanumeric fields and 327 thousand separate numeric fields (this count treats "sublines" as separate fields).
If there is a field that you need for your analysis and that is not included in RAND Hospital Data, please contact us and let us know the worksheet, column, and line(s) that interest you. In future vintages, we will accommodate as many of these requests and upgrades as possible.
- What Does "Level Of Aggregation" Mean?
If you download a data set with hospital as the level of aggregation, you will get a data set where each record represents a single hospital in a single year. If you download a data set with "County," "Market," State", or "National" as the level of aggregation, each record represents the aggregate for that geographic area in a single year. For some variables, such as number of beds or number of inpatient stays, the aggregate is the simple sum of the hospital-level values. For other variables, such as occupancy, the aggregate value is calculated based on aggregate sums (e.g., aggregate occupancy equals the sum of inpatient days divided by the sum of bed-days available).
- What Does "Market" Mean?
“Market” refers to core-based statistical areas, which are defined by the U.S. Census Bureau. CBSAs “consist of the county or counties or equivalent entities associated with at least one core (urbanized area or urban cluster) of at least 10,000 population, plus adjacent counties having a high degree of social and economic integration with the core as measured through commuting ties with the counties associated with the core.“ CBSAs include metropolitan areas and micropolitan areas. When processing and creating RAND Hospital Data, all rural counties within a state (i.e., those counties that are not included in a metropolitan area or a micropolitan area) are grouped into a market—those CBSAs are coded as “XX999” where “XX” is the 2-character postal abbreviation for the state.
- What Does "Vintage" Mean?
RAND periodically updates the raw HCRIS source data and the code used to process and prepare the data sets. Vintage represents the version or, more specifically, the date on which RAND created the data sets.
Different vintages of the same data set will differ for several reasons:
- in the more-recent vintage, the processed data sets will use more-recent raw data with additional hospital-years;
- in the more-recent vintage, the processed data sets and some cost reports will be audited, revised, and resubmitted, and some data values will differ as a result;
- the more-recent vintage may include additional variables that were not included in the earlier vintage.
- What Does "Data Errors Corrected" Mean?
Premium subscribers, when they are selecting data to download, can choose between data with errors corrected versus without errors corrected. (Registered users only have access to data with errors corrected.) To correct data errors, RAND applies an algorithm that identifies numeric values that fall far outside the normal range of variation ("outliers"), and replaces them with interpolated values. In general, the data is allowed a very wide range variation before being corrected, and the degree of variation is adjusted based on the degree of observed variation within a given hospital over time (so that hospitals that typically exhibit wider-than-normal variation are given more latitude), and the typical degree of variation for a given variable.
For each hospital and for each numeric variable, the allowable range of values is calculated following these eight steps:
- First, for each hospital we calculate the minimum value reported over all the years, the maximum value, the 25th percentile, the 50th percentile (median), and the 75th percentile.
- Second, for each hospital we calculate three measures of variability over time: the interquartile range (i.e., the difference between the 75th percentile and the 25th percentile, which is always nonnegative), the difference between the maximum value and the median ("max-to-median," which is always nonnegative), and the difference between the minimum value and the median ("min-to-median," which is always nonpositive).
- Third, for each hospital we calculate two ratios: the ratio of the max-to-median over the interquartile range ("max-to-IQR ratio"), and the ratio of the min-to-median over the interquartile range ("min-to-IQR ratio," which takes on negative values).
- Fourth, among all hospitals we calculate the 5th percentile and 50th percentile (median) of the min-to-IQR ratios, and the 50th percentile (median) and 95th percentile of the max-to-IQR ratio.
- Fifth, for each hospital we calculate a lower bound ratio as the median min-to-IQR ratio plus 4 times the difference between the 5th percentile min-to-IQR ratio and the median min-to-IQR ratio.
- Sixth, for each hospital we calculate an upper bound ratio as the median min-to-IQR ratio plus 4 times the difference between the 95th percentile max-to-IQR ratio and the median max-to-IQR ratio.
- Seventh, for each hospital we calculate the lower bound as the median (from the first step) minus the product of the IQR (from the second step) and the lower bound ratio (from the fifth step).
- Eighth, we calculate the upper bound as the median (from the first step) plus the product of the IQR (from the second step) and the upper bound ratio (from the sixth step).
In the data sets with data errors corrected, values that exceed the upper bound or fall below the lower bound are replaced with linearly interpolated values. (If the values to be replaced are the first or the last in the time series, then they are replaced with linearly extrapolated values.) The interpolation and extrapolation is based only on values that fall within the upper and lower bound (i.e. if two consecutive data points fall outside those bounds, then they will both be replaced).
- What Do The Raw Healthcare Cost Report Information System (HCRIS) Data Look Like?
The raw HCRIS data are stored in a relational database that includes three file types: the report record, numeric values, and alphanumeric values.
The report record includes a report record number (rpt_rec_num), which is the key that uniquely identifies a single cost report and that links to the numeric values file and to the alphanumeric values file. The report record includes the provider number of the hospital that submitted the report (prvdr_num), the beginning and end dates of the cost reporting period (fy_bgn_dt and fy_end_dt), the status of the cost report (rpt_stus_cd), the identity of the fiscal intermediary that processed the cost report, and a handful of other fields. In the 2015 HCRIS data released by CMS on July 15, 2017, there were 6216 report records, and the first 5 of those records are shown here:
The numeric values include the report record number (rpt_rec_num), the worksheet, line and column, and the numeric value reported by the hospital. In the 2015 HCRIS data released by CMS on July 15, 2017, there were 19.5 million numeric values, and the first 5 of those values are shown here (note that the report record number for these 5 values correspond to the first report record shown above):
The alphanumeric values also include the report record number (rpt_rec_num), the worksheet, line and column, and the alphanumeric value. In the 2015 HCRIS data released by CMS on July 15, 2017, there were 3.6 million alphanumeric values, and the first 5 of those values are shown here (note that the report record number for these 5 values correspond to the first report record shown above):
- How Do I Know If The Cost Report Data Are Complete For A Given Year Or Time Period?
Users need to be aware that there are lags and checks in data processing that will affect the completeness of the cost report data, particularly for more-recent years. Hospitals have up to five months after the end of their cost reporting period to submit their data to CMS, and the Medicare Administrative Contractor then checks and applies edits to the data, which takes some time. CMS then updates the raw HCRIS data quarterly, and RAND downloads and processes the raw data.
The best way to check completeness is to open the documentation zip file, then open the "RAND_hospital_data_contents_[vintage].xlsx" file and look at the "Number of hospitals" worksheet. That worksheet reports the number of hospitals and the number of hospital-years of data available for each year (on the rows) and for each combination of vintage and time period (on the columns). As you go from the older vintages (columns to the right) to the more-recent vintages (columns to the left) you can see the number of hospitals and hospital-years increasing, and looking at previous years (on the rows) gives us a sense of the total number of hospital-years to expect once data are complete. For example, using the RAND_hospital_data_contents_2019_02_01.xlsx spreadsheet, we can see that for calendar year 2017 (row 23) the number of hospital-years reporting increases from 2796.9 (column G) in vintage 2018_09_01 to 3428.1 in vintage 2018_12_01, and increases again to 3435.7 in vintage 2019_02_01. In calendar year 2016 (row 22), the number of hospital-years reporting was 4700.6 in the most recent vintage (2019_02_01). So, if we want to assess the completeness of data for calendar year 2017 in the most recent vintage (2019_02_01), it is roughly three-quarters complete (3435.7/4700.6), and in the next vintage (2019_05_01) we expect it will be close to fully complete.
- Where Can I Find More Information On Medicare Cost Reports, And Examples Of How They Have Been Used?
- The BlueCross BlueShield Association (BCBSA) released a series of commentaries and case studies on Medicare Cost Reporting Forms—these reports are not publicly available, but if they can be obtained from BCBSA or another source they are extremely helpful.
- The National Bureau of Economic Research (NBER) provides HCRIS data and documentation, including the relational databases (in original text format, as well as SAS and Stata formats), and processed flat files containing select variables.
- The Sheps Center at the University of North Carolina has published several studies that use Medicare hospital cost reports to measure the financial performance of rural hospitals.
- CMS answers a set of Frequently Asked Questions relating to Medicare cost reports.
- In 2001, Nancy Kane and Stephen Magnus discussed limitations of the Medicare cost reports, some of which were addressed in the update from 2552-96 to 2552-10.
- In June, 2004, the Medicare Payment Advisory Commission (MedPAC) compared several sources of data on hospital financial performance, including Medicare cost reports and audited financial statements.
- In 2012, a team of researchers at the University of North Carolina compared selected items from Medicare cost reports with "gold standard" audited financial statements.
- How Do the RAND Hospital Data Differ from CMS "Rollup" Files and NBER Cost Report Files?
CMS provides rollup cost report files in SAS format. These files are referred to as "rollups" because within a single cost report from a single hospital they sum (roll up) into one variable the values reported on separate sublines. The rollup files from CMS are similar to the RAND Hospital Data in that they are flat files containing selected fields from the Medicare hospital cost reports, and the values on sublines are rolled up.
The CMS rollups differ from RAND Hospital Data in the following ways:
- The rollup files are only available for hospital fiscal years (i.e., calendar year files are not available).
- The rollup files do not include any processed variables (such as operating margins, or occupancy).
- The variable names in the rollup files are non-intuitive (e.g. the variable "beds" in RAND Hospital Data corresponds to "S3_1_C2_14" in the rollup files--the rollup variable name reflects the fact that the values come from Worksheet S-3, Part 1, column 2, line 14).
- The variable names are inconsistent across 2552-96 and 2552-10 (because variable names reflect worksheets, columns, and lines).
- The rollup files are available from CMS only in SAS format; there is no error correction algorithm applied.
- There are no geographic summary files available.
NBER provides Medicare cost report files for researchers in a flat format, available either in SAS format, Stata, or csv. NBER offers the CMS rollup files, plus "select variables" data sets that are created by NBER and that only include a subset of frequently used fields. NBER has also created custom data sets only containing a handful of variables for specific purposes, such as calculating cost-to-charge ratios or calculating Medicare add-on payments for medical education. But, like the CMS rollup files, the NBER files do not contain any preprocessed variables, the variable names are non-intuitive, the files are only available for hospital fiscal years (not calendar years), there is no error correction algorithm, and there are no geographic summary files available.