Analyzing Hospital Prices in California using GridDB¶

Introduction¶

As part of the 'Hospital Price Transparency Law' passed in 2021 in the United States, hospitals are required to submit their medical service costs and associated offerings in a machine readable format such as csv, json or xml; as well as a consumer-friendly display. While the transparency helps patients compare prices and researchers to study healthcare costs effectively; they also help regulate healthcare prices across the country. For our study, we take up the Hospital Price dataset published in the California Open Data Portal. We use CKAN’s API to programmatically download the datasets from 2012 to 2024. We then ingest the data into GridDB, an in-memory, distributed NoSQL database and analyze the data using GridDB and Python.

The main objective here is to demonstrate the capabilities of GridDB in terms of data ingestion and querying capabilities. GridDB supports two types of containers namely Collections and Timeseries. Collections are used to store lookup or transactional data whereas Timeseries containers are used for time-sensitive data collected in real-time or near real-time. Time series containers are commonly used in IoT-based applications where data collection is time-sensitive. In the case of hospital pricing data, we go with a collections container. Although a NoSQL database, GridDB supports SQL-like querying which is useful for analyzing transactional data such as hospital pricing. Furthermore, it allows developers familiar with SQL to easily adapt to GridDB’s query model while taking advantage of its high-performance architecture.

Importing pre-requisites¶

In [12]:

#File system and I/O operations
import os
from io import BytesIO
# Data manipulation 
import pandas as pd
import numpy as np
from IPython.display import Image

# Time functions
import math
import time
from datetime import datetime, timedelta
from collections import Counter
import re


# Data Gathering
import requests
import zipfile

#Encoding and Serialization
import base64
import json

#Debugging
import http
http.client.HTTPConnection.debuglevel = 0

#Visualization Libraries
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff

Downloading the Dataset¶

We download the data from CKAN's API using the dataset ID 'hospital-annual-financial-data-selected-data-pivot-tables'. Note that the dataset had a labeling error with an XLS file being mislabeled as XLSX. This error has been accounted for in the below logic using the nested if else logic shown below. The data is licensed under a Creative Commons Attribution License (cc-by). The dataset description can be accessed here .

In [73]:

# Step 1: Get resources from the CKAN API
url = "https://data.ca.gov/api/3/action/package_show"
params = {"id": "hospital-annual-financial-data-selected-data-pivot-tables"}

r = requests.get(url, params=params)
r.raise_for_status()
data = r.json()

resources = data["result"]["resources"]

# Step 2: Filter the files you want (Excel files with "CY Hospital Annual Selected File")
files = [
    r for r in resources
    if "CY Hospital Annual Selected File" in r["name"] and r["format"].lower() in ("xlsx", "xls")
]

# Step 3: Download and read each Excel file
dfs = []

for f in files:
    download_url = f["url"]
    print("Downloading:", f["name"])

    resp = requests.get(download_url)
    resp.raise_for_status()

    # Determine engine based on file type
    if download_url.endswith(".xls"): # This is to account for an error in the site where a 2013 xls data file has been mislabelled as xlsx.
        df = pd.read_excel(BytesIO(resp.content), engine="xlrd")
        #print('written',f)
    elif f["format"].lower() == "xlsx":
        df = pd.read_excel(BytesIO(resp.content), engine="openpyxl")
    else:  # xls
        df = pd.read_excel(BytesIO(resp.content), engine="xlrd")
    dfs.append(df)

Downloading: 2024 CY Hospital Annual Selected File (October 2025 Extract)
Downloading: 2023 CY Hospital Annual Selected File (October 2025 Extract)
Downloading: 2022 CY Hospital Annual Selected File (October 2024 Extract)
Downloading: 2021 CY Hospital Annual Selected File (October 2023 Extract)
Downloading: 2020 CY Hospital Annual Selected File (October 2022 Extract)
Downloading: 2019 CY Hospital Annual Selected File (April 2022 Extract)
Downloading: 2018 CY Hospital Annual Selected File (November 2020 Extract)
Downloading: 2017 CY Hospital Annual Selected File (October 2019 Extract)
Downloading: 2016 CY Hospital Annual Selected File (September 2018 Extract)
Downloading: 2015 CY Hospital Annual Selected File (March 2018 Extract)
Downloading: 2014 CY Hospital Annual Selected File (April 2017 Extract)
Downloading: 2013 CY Hospital Annual Selected File (September 2014 Extract)
Downloading: 2012 CY Hospital Annual Selected File (October 2014 Extract)

In [96]:

# Step 4: Combine all DataFrames
if dfs:
    california_hospital_pricing_data = pd.concat(dfs, ignore_index=True)
    print("Combined DataFrame - california_hospital_pricing_data: shape:", california_hospital_pricing_data.shape)
else:
    print("No Excel files found.")

Combined DataFrame - california_hospital_pricing_data: shape: (5810, 273)

In [97]:

# Columns with NaN
cols_with_nan = california_hospital_pricing_data.columns[california_hospital_pricing_data.isna().any()].tolist()
print(cols_with_nan)

['COUNTY', 'HSA', 'HFPA', 'TEACH_RURL', 'CITY', 'CEO_TITLE', 'WEB_SITE', 'ORG_NAME', 'ER_DESIG', 'MCAR_PRO#', 'MCAL_PRO#', 'REG_MCAL#', 'DAY_MCAR_TR', 'DAY_MCAL_MC', 'DAY_OTH_IND', 'DAYS_PIPS', 'NAT_0BIRTHS', 'GR_OP_OTH_IND', 'C_ADJ_MCAR_TR', 'C_ADJ_THRD_TR', 'BAD_DEBT', 'CHAR_OTH', 'EXP_RES', 'CUR_ASST', 'ASST_LIMTD', 'CUR_LIAB', 'ACCTS_0REC', 'ALLOW_UNCOLL', 'BLDGS', 'INTER-REC', 'NAT_ BIRTHS', 'ACCTS_ REC', 'HSA ', 'CITY        ', 'CEO_TITLE               ', 'C_ADJ_MCAR_TR   ', 'BAD_DEBT                   ', 'CHAR_OTH                    ', 'EXP_RES        ', 'CUR_ASST ', 'ASST_LIMTD ', 'BLDGS \n', 'WEB-SITE', 'ORG-NAME', 'DAY_PIPS', 'NAT-BIRTHS', 'ACCTC_REC', 'ALLOW-UNCOL', 'INTER_REC', 'DAY_ MCAR_TR', 'DAY_ MCAL_ MC', 'DAY_OTH_ IND', 'NAT_BIRTHS', 'GR_ OP_ OTH_IND', 'C_ ADJ_THRD_TR', 'CURR_LIAB', 'ACCTS_REC']

In [168]:

#len(california_hospital_pricing_data)

In [98]:

# Counting the number of NaNs in each column
california_hospital_pricing_data.isna().sum()[california_hospital_pricing_data.isna().sum() > 0]

# Dropping columns where the missing entries are about 80%` of the length of the dataframe
california_hospital_pricing_data = california_hospital_pricing_data.dropna(axis=1, thresh=int(0.2 * len(california_hospital_pricing_data)))

In [99]:

# convert NaN to Null
california_hospital_pricing_data = california_hospital_pricing_data.replace({np.nan: None})

Sanitizing Column Headers¶

Some of the column headers have a '#' and newline characters which would lead to errors during container creation. GridDB does not support the use of special characters in column names.

In [100]:

def make_columns_unique(columns, suffix="_ALT"):
    counts = Counter()
    new_columns = []

    for col in columns:
        if counts[col] == 0:
            # First occurrence, keep as is
            new_columns.append(col)
        else:
            # Duplicate found, add suffix
            new_col = f"{col}{suffix}" if counts[col] == 1 else f"{col}{suffix}{counts[col]}"
            new_columns.append(new_col)
        counts[col] += 1

    return new_columns

In [101]:

# Looking for column namws having a # and newlines
def clean_column(col_name):
    # Remove leading/trailing spaces and newlines
    col = col_name.strip()
    # Replace # with _ first
    col = col.replace('#', '_')
    # Replace any other non-alphanumeric characters (except underscore) with _
    col = re.sub(r'\W+', '_', col)
    # Collapse multiple underscores
    col = re.sub(r'_+', '_', col)
    col = col.strip('_')
    return col.upper()

# Applying to all column headers
california_hospital_pricing_data.columns = [
    clean_column(col) for col in california_hospital_pricing_data.columns
]

california_hospital_pricing_data.columns = make_columns_unique(california_hospital_pricing_data.columns)

In [102]:

# Convert start/end date columns
for col in ['BEG_DATE', 'END_DATE']:
    # Convert to datetime, coerce errors to NaT
    california_hospital_pricing_data[col] = pd.to_datetime(
        california_hospital_pricing_data[col], errors='coerce'
    )

# Replace NaT with None so GridDB sees it as null
california_hospital_pricing_data[col] = california_hospital_pricing_data[col].where(
    california_hospital_pricing_data[col].notna(), None
)

In [103]:

#YYYY-MM-DDThh:mm:ss.SSSZ
# Converting the timestamp value to ISO 8601 format
california_hospital_pricing_data['BEG_DATE'] = california_hospital_pricing_data['BEG_DATE'].dt.strftime('%Y-%m-%dT%H:%M:%S.%f')  # Format with microseconds
california_hospital_pricing_data['BEG_DATE'] = california_hospital_pricing_data['BEG_DATE'].str[:-3] + 'Z'  # Adjust to milliseconds and append 'Z'
california_hospital_pricing_data['END_DATE'] = california_hospital_pricing_data['END_DATE'].dt.strftime('%Y-%m-%dT%H:%M:%S.%f')  # Format with microseconds
california_hospital_pricing_data['END_DATE'] = california_hospital_pricing_data['END_DATE'].str[:-3] + 'Z'  # Adjust to milliseconds and append 'Z'
california_hospital_pricing_data.head()

Out[103]:

	FAC_NO	FAC_NAME	BEG_DATE	END_DATE	DAY_PER	DATA_IND	AUDIT_IND	COUNTY	HSA	HFPA	...	PD_HR_AMB	PD_HR_ANC	PD_HR_ED	PD_HR_GEN	PD_HR_FIS	PD_HR_ADM	NAT_BIRTHS	ACCTS_REC	INTER_REC_ALT
0	106580996	ADVENTIST HEALTH AND RIDEOUT	2024-01-01T00:00:00.000Z	2024-12-31T00:00:00.000Z	366	In Process	Incl. Ind. Audit Adj.	Yuba	2.0	227.0	...	445195	858174	0	295839	22716	322762	None	None	None
1	106150788	ADVENTIST HEALTH BAKERSFIELD	2024-01-01T00:00:00.000Z	2024-12-31T00:00:00.000Z	366	In Process	Incl. Ind. Audit Adj.	Kern	9.0	617.0	...	681866	1300483	7442	536756	49673	397028	None	None	None
2	106171049	ADVENTIST HEALTH CLEARLAKE	2024-01-01T00:00:00.000Z	2024-12-31T00:00:00.000Z	366	In Process	Incl. Ind. Audit Adj.	Lake	1.0	115.0	...	766075	242818	0	112941	69085	128270	None	None	None
3	106150706	ADVENTIST HEALTH DELANO	2024-01-01T00:00:00.000Z	2024-12-31T00:00:00.000Z	366	In Process	Incl. Ind. Audit Adj.	Kern	9.0	617.0	...	156907	230166	0	63358	38767	54724	None	None	None
4	106190323	ADVENTIST HEALTH GLENDALE	2024-01-01T00:00:00.000Z	2024-12-31T00:00:00.000Z	366	In Process	Incl. Ind. Audit Adj.	Los Angeles	11.0	909.0	...	601709	1012577	136108	748485	18815	407171	None	None	None

5 rows × 247 columns

In [82]:

"""
important_columns = [
'FAC_NO','FAC_NAME','COUNTY','HSA','HFPA','TYPE_CNTRL','TYPE_CARE','TYPE_HOSP','TEACH_RURL',
'BEG_DATE','END_DATE','DAY_PER','BED_LIC','BED_AVL','BED_STF','BED_ACUTE','BED_PSYCH','BED_REHAB','BED_LTC',
'DAY_TOT','DIS_TOT','DAY_ACUTE','DIS_ACUTE','VIS_TOT','VIS_ER','SURG_IP','SURG_OP','C_SECTIONS',
'OCC_LIC','OCC_AVL','ALOS_ALL','ALOS_EXLTC',
'GR_PT_REV','NET_PT_REV','TOT_OP_EXP','NET_FRM_OP','NET_INCOME','NONOP_REV','NONOP_EXP',
'TOT_ASST','CUR_ASST','CUR_LIAB','NET_LTDEBT','EQUITY',
'HOSP_FTE','NURS_FTE','PROD_HRS','PAID_HRS','MED_STAFF','STDNT_FTE'
]
"""

In [83]:

#california_hospital_pricing_subset = california_hospital_pricing_data[important_columns]

Connecting to GridDB¶

In [7]:

username = "XXX" # Provide your GridDB credentials here
password =   "XXX"
credentials = f"{username}:{password}"
encoded_credentials = base64.b64encode(credentials.encode()).decode()
#print(f"Encoded credentials: Basic {encoded_credentials}")

In [8]:

#Construct an object to hold the request headers (ensure that you replace the XXX placeholder with the correct value that matches the credentials for your GridDB instance)
#header_obj = {"Authorization":"Basic {EncodedAuthorizationToken}","Content-Type":"application/json; charset=UTF-8","User-Agent":"PostmanRuntime/7.29.0"}

header_obj = {
    "Authorization": f"Basic {encoded_credentials}",  # Add encoded credentials here
    "Content-Type": "application/json", 
    "charset": "UTF-8",
    "User-Agent":"PostmanRuntime/7.29.0"
}

#Construct the base URL based on your GridDB cluster you'd like to connect to (ensure that you replace the placeholders in the URL below with the correct values that correspond to your GridDB instance)
base_url = 'https://[host]:[port]/griddb/v2/[clustername]/dbs/[database_name]/'

Creating the Container¶

Here are some utility functions for facilitating container creation in GridDB. The function 'map_dtype_to_griddb()' maps the Pandas Dataframe data types to the corresponding data types in GridDB. The function 'generate_griddb_data_obj()' helps generate the container structure for container creation. GridDB supports two types of containers namely Collections and TimeSeries containers. To learn more on GridDB's key-container model and the container types, refer to this official GridDB resource .

In [254]:

INT32_MIN = -2147483648
INT32_MAX = 2147483647

def map_dtype_to_griddb(series):
    """
    Maps Pandas series to GridDB data types.
    Detects INTEGER overflow and upgrades to LONG.
    """

    if pd.api.types.is_integer_dtype(series):

        # Check integer overflow
        if series.dropna().between(INT32_MIN, INT32_MAX).all():
            return "INTEGER"
        else:
            return "LONG"

    elif pd.api.types.is_float_dtype(series):
        return "DOUBLE"

    elif pd.api.types.is_bool_dtype(series):
        return "BOOL"

    elif pd.api.types.is_datetime64_any_dtype(series):
        return "TIMESTAMP"

    else:
        return "STRING"

In [255]:

def generate_griddb_data_obj(df, container_name="MyContainer", container_type="COLLECTION", rowkey=False):
    columns = []
    for col in df.columns:
        griddb_type = map_dtype_to_griddb(df[col])
        columns.append({"name": col, "type": griddb_type})
    
    data_obj = {
        "container_name": container_name,
        "container_type": container_type,
        "rowkey": rowkey,
        "columns": columns
    }
    
    return data_obj

In [ ]:

#Set up the GridDB WebAPI URL
url = base_url + 'containers'

#Invoke the GridDB WebAPI with the headers and the request body
x = requests.post(url, json = data_obj, headers = header_obj)
#x.text

Ingesting Data into the Container in GridDB¶

In [ ]:

#Setup the URL to be used to invoke the GridDB WebAPI to register rows in the container created previously
url = base_url + 'containers/hospital_pricing_data/rows'
print(url)

In [ ]:

batch_size = 5000 
total_rows = len(california_hospital_pricing_data)
total_batches = math.ceil(total_rows / batch_size)

print(f"Starting ingestion of {total_rows:,} rows in {total_batches} batches of {batch_size} each...")
start_time = time.time()

for i in range(total_batches):
    start_idx = i * batch_size
    end_idx = min((i + 1) * batch_size, total_rows)
    batch_df = california_hospital_pricing_data.iloc[start_idx:end_idx]
    batch_json = batch_df.to_json(orient='values')
    #print(batch_json)
    #break
    try:
        print(url)
        response = requests.put(url, headers=header_obj, data=batch_json)
        elapsed = time.time() - start_time
        #print(batch_json)
        if response.status_code in [200, 201]:
            print(f" Batch {i+1}/{total_batches} | Rows {start_idx}-{end_idx-1} | "
                  f"Time elapsed: {elapsed:.1f}s")
        else:
            print(f" Batch {i+1}/{total_batches} FAILED "
                  f"(HTTP {response.status_code}) - {response.text[:150]}")

    except Exception as e:
        print(f" Batch {i+1}/{total_batches} ERROR: {e}")
print(f"\n Ingestion completed in {time.time() - start_time:.1f} seconds.")

Analyzing Data in GridDB using SQL¶

GridDB supports both a NoSQL interface and a NewSQL interface. The NoSQL interface is the native API of GridDB, used for direct container operations and high-performance data ingestion. The NewSQL interface provides a SQL layer built on top of GridDB, enabling standard SQL queries and integration with analytics and BI tools. The NoSQL interface is useful for IoT applications where real-time data operations are crucial.

In [304]:

from IPython.display import Image, display, Markdown

display(Markdown("### GridDB Architecture Overview"))

# Path to your PNG file
image_path = "GridDB_Image_Overall_Idea.png"

# Display the image
display(Image(filename=image_path, width = 600))

GridDB Architecture Overview¶

No description has been provided for this image

Revenue level above which hospitals fall in the top 10% each year¶

NET_PT_REV (Net Patient Revenue) represents the actual revenue a hospital receives from patient care after deductions. We use GridDB's PERCENTILE_CONT() function to calculate the value below which a given percentage of observations fall. To learn more on the PERCENTILE_CONT() function, refer to this GridDB resource. We use a subquery approach for this. The inner query fetches the Net Patient Revenue, which is at the 90th percentile of the data distribution. As we know, if a hospital is at the 90th percentile, it means that 90% of hospitals have revenue lower than this value, and only 10% have revenue equal to or higher than this value. GridDB supports correlated subqueries, such as the one below. Refer to this resource to learn more.

In [43]:

sql_query1 = """
                SELECT DISTINCT EXTRACT(YEAR, TIMESTAMP(beg_date)) AS Cal_Year,
                (
                    SELECT PERCENTILE_CONT(0.9) 
                    WITHIN GROUP (ORDER BY NET_PT_REV)
                    FROM hospital_pricing_data
                    WHERE EXTRACT(YEAR,TIMESTAMP(beg_date)) = EXTRACT(YEAR, TIMESTAMP(hosp_data.beg_date))
                ) Percentile_Value
                FROM 
                hospital_pricing_data hosp_data;
                """

In [44]:

#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'

# Construct the request body
request_body = json.dumps([
    {
        "type": "sql-select",
        "stmt": sql_query1
    }
])

# Disable HTTP debugging
http.client.HTTPConnection.debuglevel = 0

#Invoke the GridDB WebAPI
data_req1 = requests.post(url, data=request_body, headers=header_obj)

#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req1.json()
top_10_perc_hosp_rev = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"]])

In [45]:

top_10_perc_hosp_rev['Percentile_Value'] = top_10_perc_hosp_rev['Percentile_Value'].apply(lambda x: f"{x/1e6:.1f} M")
top_10_perc_hosp_rev = top_10_perc_hosp_rev.sort_values(by="Percentile_Value", ascending=False)

In [46]:

fig = ff.create_table(top_10_perc_hosp_rev)
fig.show()

Insight(s): We see that as of 2023, the top 10% hospitals have had a Net Patient Revenue of USD 889 Million. We can clearly see that the Revenue increased by a USD 100 Million every five years or so.

Counties with More Hospitals than the Average County¶

GridDB supports standard subqueries, date functions, and aggregations similar to standard relational databases.

In [67]:

sql_query2 = """SELECT COUNTY, COUNT(*) AS hospital_count
                FROM hospital_pricing_data
                WHERE EXTRACT(YEAR, TIMESTAMP(beg_date)) = 2024
                GROUP BY COUNTY
                HAVING COUNT(*) >
                (
                    SELECT AVG(hospital_count)
                    FROM (
                        SELECT subset.COUNTY, COUNT(*) AS hospital_count
                        FROM hospital_pricing_data subset
                        WHERE EXTRACT(YEAR, TIMESTAMP(subset.beg_date)) = 2024
                        GROUP BY subset.COUNTY
                    ) t
                )
                ORDER BY hospital_count DESC;
                """

In [68]:

#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'

# Construct the request body
request_body = json.dumps([
    {
        "type": "sql-select",
        "stmt": sql_query2
    }
])

# Disable HTTP debugging
http.client.HTTPConnection.debuglevel = 0

#Invoke the GridDB WebAPI
data_req2 = requests.post(url, data=request_body, headers=header_obj)

#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req2.json()
counties_hosp_more_avg = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"]])

In [70]:

# Ensure hospital_count is numeric
counties_hosp_more_avg['hospital_count'] = counties_hosp_more_avg['hospital_count'].astype(int)

fig = px.bar(
    counties_hosp_more_avg,
    x="COUNTY",
    y="hospital_count",
    text="hospital_count"  # Display values
)

# Make values appear above the bars
fig.update_traces(textposition='outside')

# Optional: tidy layout
fig.update_layout(
    yaxis_title='Number of Hospitals',
    xaxis_title='County',
    title='Counties with Above-Average Number of Hospitals',
    xaxis_tickangle=-45,
    uniformtext_minsize=8,
    uniformtext_mode='hide'
)

fig.show()

Facilities with the Most Number of Surgeries¶

We combine the inpatient and outpatient surgeries for each facility and compare them against the median. The reason we consider the Median is that if the Average were used, it would be affected by outliers and, hence, give a skewed estimate of the average number of surgeries. To learn more about the Median function in GridDB, check out this resource .

In [117]:

sql_query3 = """SELECT FAC_NAME, EXTRACT(YEAR, TIMESTAMP(beg_date)) As Cal_Year, AVG(SURG_IP + SURG_OP) AS total_surgeries
                FROM hospital_pricing_data
                WHERE (SURG_IP + SURG_OP) >
                (
                SELECT MEDIAN(SURG_IP + SURG_OP) as average_surg
                FROM hospital_pricing_data
                )
                GROUP BY 1,2
                ORDER BY 3 DESC
                LIMIT 10
                """

In [118]:

#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'

# Construct the request body
request_body = json.dumps([
    {
        "type": "sql-select",
        "stmt": sql_query3
    }
])

# Disable HTTP debugging
http.client.HTTPConnection.debuglevel = 0

#Invoke the GridDB WebAPI
data_req3 = requests.post(url, data=request_body, headers=header_obj)

#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req3.json()
high_vol_surgical_centers = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"], myJson[0]["columns"][2]["name"]])
high_vol_surgical_center_sorted = high_vol_surgical_centers.sort_values(by='total_surgeries', ascending=False)

In [133]:

fig = ff.create_table(high_vol_surgical_center_sorted)
fig.show()

Insight(s): In terms of high-volume surgical centers, UNIVERSITY OF CALIFORNIA SAN DIEGO MEDICAL CENTER has consistently had the most surgeries done that outweigh the average surgery count.

Most Efficient Hospitals in terms of Revenue per Bed¶

In [ ]:

We calculate the revenue per bed by dividing the Gross Patient Revenue by the number of licensed beds.

In [13]:

sql_query4 = """
            SELECT FAC_NAME, EXTRACT(YEAR, TIMESTAMP(beg_date)) As Cal_Year, AVG(GR_PT_REV / BED_LIC) AS revenue_per_bed
            FROM hospital_pricing_data
            WHERE BED_LIC > 0
              AND (GR_PT_REV / BED_LIC) > (
                  SELECT AVG(GR_PT_REV / BED_LIC)
                  FROM hospital_pricing_data
                  WHERE BED_LIC > 0
              )
              GROUP BY 1,2
              ORDER BY 3 DESC
              LIMIT 10;
            """

In [14]:

#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'

# Construct the request body
request_body = json.dumps([
    {
        "type": "sql-select",
        "stmt": sql_query4
    }
])

# Disable HTTP debugging
http.client.HTTPConnection.debuglevel = 0

#Invoke the GridDB WebAPI
data_req4 = requests.post(url, data=request_body, headers=header_obj)

#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req4.json()
hosp_rev_per_bed = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"], myJson[0]["columns"][2]["name"]])
hosp_rev_per_bed_sorted = hosp_rev_per_bed.sort_values(by='revenue_per_bed', ascending=False)
hosp_rev_per_bed_sorted['revenue_per_bed'] = hosp_rev_per_bed_sorted['revenue_per_bed'].apply(lambda x: f"${x:,.0f}")

In [ ]:

fig = ff.create_table(hosp_rev_per_bed_sorted)
fig.show()

In [15]:

C:\Users\mg_su\conda\anaconda3\Lib\site-packages\kaleido\_sync_server.py:11: UserWarning:



Warning: You have Plotly version 5.22.0, which is not compatible with this version of Kaleido (1.2.0).

This means that static image generation (e.g. `fig.write_image()`) will not work.

Please upgrade Plotly to version 6.1.1 or greater, or downgrade Kaleido to version 0.2.1.

Insight(s): The revenue per bed at Stanford Health Care seems to be the highest across multiple years. This is followed by City of Hope Helford and USC Kenneth Norris Jr. Cancer Hospital.

Average Stay Length by Hospital and Calendar Year¶

In [17]:

sql_query5 = """SELECT FAC_NAME, EXTRACT(YEAR, TIMESTAMP(beg_date)) AS Calendar_Year, 
                AVG(ALOS_ALL) AS "Average_Stay_Length"
                FROM hospital_pricing_data
                GROUP BY 1,2
                HAVING AVG(ALOS_ALL) > 0
                ORDER BY 3 DESC
                LIMIT 10"""

In [18]:

#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'

# Construct the request body
request_body = json.dumps([
    {
        "type": "sql-select",
        "stmt": sql_query5
    }
])

# Disable HTTP debugging
http.client.HTTPConnection.debuglevel = 0

#Invoke the GridDB WebAPI
data_req5 = requests.post(url, data=request_body, headers=header_obj)

#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req5.json()
alos_all_df = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"], myJson[0]["columns"][2]["name"]])
alos_all_df = alos_all_df.sort_values(by='Average_Stay_Length', ascending=False)

In [19]:

fig = ff.create_table(alos_all_df)
fig.show()

Insight(s): Sonoma Developmental Center and Coalinga State Hospital have the longest days in terms of stay. This could mean that they may be treating complex and long-term conditions.

Concluding Remarks¶

We were able to effectively analyze a large dataset using GridDB's powerful query engine and NewSQL interface. Almost all functions available in a standard relational database are also made available in GridDB. This makes the transition to GridDB easier, as Business Intelligence developers can effectively leverage their SQL skills to query the containers and then gradually transition to the NoSQL interface for their real-time data processing needs. For our analysis, we used aggregation functions, subqueries, and correlated subqueries to answer specific questions efficiently. Additionally, GridDB's seamless integration with Python allowed us to perform complex analytical tasks without compromising performance. Overall, GridDB proved to be a versatile platform that bridges the gap between traditional relational databases and modern NoSQL solutions, empowering teams to scale their analytics while maintaining familiarity with SQL-based workflows and helping aide decision making.

Also available on Github Last updated: 18/04/2026 05:57:14

Analyzing Hospital Prices in California using GridDB¶

Introduction¶

Importing pre-requisites¶

Downloading the Dataset¶

Sanitizing Column Headers¶

Connecting to GridDB¶

Creating the Container¶

Ingesting Data into the Container in GridDB¶

Analyzing Data in GridDB using SQL¶

GridDB Architecture Overview¶

Revenue level above which hospitals fall in the top 10% each year¶

Counties with More Hospitals than the Average County¶

Facilities with the Most Number of Surgeries¶

Most Efficient Hospitals in terms of Revenue per Bed¶

Average Stay Length by Hospital and Calendar Year¶

Concluding Remarks¶

Leave a Reply Cancel reply