Analyzing Hospital Prices in California using GridDBΒΆ
IntroductionΒΆ
As part of the 'Hospital Price Transparency Law' passed in 2021 in the United States, hospitals are required to submit their medical service costs and associated offerings in a machine readable format such as csv, json or xml; as well as a consumer-friendly display. While the transparency helps patients compare prices and researchers to study healthcare costs effectively; they also help regulate healthcare prices across the country. For our study, we take up the Hospital Price dataset published in the California Open Data Portal. We use CKANβs API to programmatically download the datasets from 2012 to 2024. We then ingest the data into GridDB, an in-memory, distributed NoSQL database and analyze the data using GridDB and Python.
The main objective here is to demonstrate the capabilities of GridDB in terms of data ingestion and querying capabilities. GridDB supports two types of containers namely Collections and Timeseries. Collections are used to store lookup or transactional data whereas Timeseries containers are used for time-sensitive data collected in real-time or near real-time. Time series containers are commonly used in IoT-based applications where data collection is time-sensitive. In the case of hospital pricing data, we go with a collections container. Although a NoSQL database, GridDB supports SQL-like querying which is useful for analyzing transactional data such as hospital pricing. Furthermore, it allows developers familiar with SQL to easily adapt to GridDBβs query model while taking advantage of its high-performance architecture.
Importing pre-requisitesΒΆ
#File system and I/O operations
import os
from io import BytesIO
# Data manipulation
import pandas as pd
import numpy as np
from IPython.display import Image
# Time functions
import math
import time
from datetime import datetime, timedelta
from collections import Counter
import re
# Data Gathering
import requests
import zipfile
#Encoding and Serialization
import base64
import json
#Debugging
import http
http.client.HTTPConnection.debuglevel = 0
#Visualization Libraries
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff
Downloading the DatasetΒΆ
We download the data from CKAN's API using the dataset ID 'hospital-annual-financial-data-selected-data-pivot-tables'. Note that the dataset had a labeling error with an XLS file being mislabeled as XLSX. This error has been accounted for in the below logic using the nested if else logic shown below. The data is licensed under a Creative Commons Attribution License (cc-by). The dataset description can be accessed here .
# Step 1: Get resources from the CKAN API
url = "https://data.ca.gov/api/3/action/package_show"
params = {"id": "hospital-annual-financial-data-selected-data-pivot-tables"}
r = requests.get(url, params=params)
r.raise_for_status()
data = r.json()
resources = data["result"]["resources"]
# Step 2: Filter the files you want (Excel files with "CY Hospital Annual Selected File")
files = [
r for r in resources
if "CY Hospital Annual Selected File" in r["name"] and r["format"].lower() in ("xlsx", "xls")
]
# Step 3: Download and read each Excel file
dfs = []
for f in files:
download_url = f["url"]
print("Downloading:", f["name"])
resp = requests.get(download_url)
resp.raise_for_status()
# Determine engine based on file type
if download_url.endswith(".xls"): # This is to account for an error in the site where a 2013 xls data file has been mislabelled as xlsx.
df = pd.read_excel(BytesIO(resp.content), engine="xlrd")
#print('written',f)
elif f["format"].lower() == "xlsx":
df = pd.read_excel(BytesIO(resp.content), engine="openpyxl")
else: # xls
df = pd.read_excel(BytesIO(resp.content), engine="xlrd")
dfs.append(df)
Downloading: 2024 CY Hospital Annual Selected File (October 2025 Extract) Downloading: 2023 CY Hospital Annual Selected File (October 2025 Extract) Downloading: 2022 CY Hospital Annual Selected File (October 2024 Extract) Downloading: 2021 CY Hospital Annual Selected File (October 2023 Extract) Downloading: 2020 CY Hospital Annual Selected File (October 2022 Extract) Downloading: 2019 CY Hospital Annual Selected File (April 2022 Extract) Downloading: 2018 CY Hospital Annual Selected File (November 2020 Extract) Downloading: 2017 CY Hospital Annual Selected File (October 2019 Extract) Downloading: 2016 CY Hospital Annual Selected File (September 2018 Extract) Downloading: 2015 CY Hospital Annual Selected File (March 2018 Extract) Downloading: 2014 CY Hospital Annual Selected File (April 2017 Extract) Downloading: 2013 CY Hospital Annual Selected File (September 2014 Extract) Downloading: 2012 CY Hospital Annual Selected File (October 2014 Extract)
# Step 4: Combine all DataFrames
if dfs:
california_hospital_pricing_data = pd.concat(dfs, ignore_index=True)
print("Combined DataFrame - california_hospital_pricing_data: shape:", california_hospital_pricing_data.shape)
else:
print("No Excel files found.")
Combined DataFrame - california_hospital_pricing_data: shape: (5810, 273)
# Columns with NaN
cols_with_nan = california_hospital_pricing_data.columns[california_hospital_pricing_data.isna().any()].tolist()
print(cols_with_nan)
['COUNTY', 'HSA', 'HFPA', 'TEACH_RURL', 'CITY', 'CEO_TITLE', 'WEB_SITE', 'ORG_NAME', 'ER_DESIG', 'MCAR_PRO#', 'MCAL_PRO#', 'REG_MCAL#', 'DAY_MCAR_TR', 'DAY_MCAL_MC', 'DAY_OTH_IND', 'DAYS_PIPS', 'NAT_0BIRTHS', 'GR_OP_OTH_IND', 'C_ADJ_MCAR_TR', 'C_ADJ_THRD_TR', 'BAD_DEBT', 'CHAR_OTH', 'EXP_RES', 'CUR_ASST', 'ASST_LIMTD', 'CUR_LIAB', 'ACCTS_0REC', 'ALLOW_UNCOLL', 'BLDGS', 'INTER-REC', 'NAT_ BIRTHS', 'ACCTS_ REC', 'HSA ', 'CITY ', 'CEO_TITLE ', 'C_ADJ_MCAR_TR ', 'BAD_DEBT ', 'CHAR_OTH ', 'EXP_RES ', 'CUR_ASST ', 'ASST_LIMTD ', 'BLDGS \n', 'WEB-SITE', 'ORG-NAME', 'DAY_PIPS', 'NAT-BIRTHS', 'ACCTC_REC', 'ALLOW-UNCOL', 'INTER_REC', 'DAY_ MCAR_TR', 'DAY_ MCAL_ MC', 'DAY_OTH_ IND', 'NAT_BIRTHS', 'GR_ OP_ OTH_IND', 'C_ ADJ_THRD_TR', 'CURR_LIAB', 'ACCTS_REC']
#len(california_hospital_pricing_data)
# Counting the number of NaNs in each column
california_hospital_pricing_data.isna().sum()[california_hospital_pricing_data.isna().sum() > 0]
# Dropping columns where the missing entries are about 80%` of the length of the dataframe
california_hospital_pricing_data = california_hospital_pricing_data.dropna(axis=1, thresh=int(0.2 * len(california_hospital_pricing_data)))
# convert NaN to Null
california_hospital_pricing_data = california_hospital_pricing_data.replace({np.nan: None})
Sanitizing Column HeadersΒΆ
Some of the column headers have a '#' and newline characters which would lead to errors during container creation. GridDB does not support the use of special characters in column names.
def make_columns_unique(columns, suffix="_ALT"):
counts = Counter()
new_columns = []
for col in columns:
if counts[col] == 0:
# First occurrence, keep as is
new_columns.append(col)
else:
# Duplicate found, add suffix
new_col = f"{col}{suffix}" if counts[col] == 1 else f"{col}{suffix}{counts[col]}"
new_columns.append(new_col)
counts[col] += 1
return new_columns
# Looking for column namws having a # and newlines
def clean_column(col_name):
# Remove leading/trailing spaces and newlines
col = col_name.strip()
# Replace # with _ first
col = col.replace('#', '_')
# Replace any other non-alphanumeric characters (except underscore) with _
col = re.sub(r'\W+', '_', col)
# Collapse multiple underscores
col = re.sub(r'_+', '_', col)
col = col.strip('_')
return col.upper()
# Applying to all column headers
california_hospital_pricing_data.columns = [
clean_column(col) for col in california_hospital_pricing_data.columns
]
california_hospital_pricing_data.columns = make_columns_unique(california_hospital_pricing_data.columns)
# Convert start/end date columns
for col in ['BEG_DATE', 'END_DATE']:
# Convert to datetime, coerce errors to NaT
california_hospital_pricing_data[col] = pd.to_datetime(
california_hospital_pricing_data[col], errors='coerce'
)
# Replace NaT with None so GridDB sees it as null
california_hospital_pricing_data[col] = california_hospital_pricing_data[col].where(
california_hospital_pricing_data[col].notna(), None
)
#YYYY-MM-DDThh:mm:ss.SSSZ
# Converting the timestamp value to ISO 8601 format
california_hospital_pricing_data['BEG_DATE'] = california_hospital_pricing_data['BEG_DATE'].dt.strftime('%Y-%m-%dT%H:%M:%S.%f') # Format with microseconds
california_hospital_pricing_data['BEG_DATE'] = california_hospital_pricing_data['BEG_DATE'].str[:-3] + 'Z' # Adjust to milliseconds and append 'Z'
california_hospital_pricing_data['END_DATE'] = california_hospital_pricing_data['END_DATE'].dt.strftime('%Y-%m-%dT%H:%M:%S.%f') # Format with microseconds
california_hospital_pricing_data['END_DATE'] = california_hospital_pricing_data['END_DATE'].str[:-3] + 'Z' # Adjust to milliseconds and append 'Z'
california_hospital_pricing_data.head()
| FAC_NO | FAC_NAME | BEG_DATE | END_DATE | DAY_PER | DATA_IND | AUDIT_IND | COUNTY | HSA | HFPA | ... | PD_HR_AMB | PD_HR_ANC | PD_HR_ED | PD_HR_GEN | PD_HR_FIS | PD_HR_ADM | PD_HR_NON | NAT_BIRTHS | ACCTS_REC | INTER_REC_ALT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 106580996 | ADVENTIST HEALTH AND RIDEOUT | 2024-01-01T00:00:00.000Z | 2024-12-31T00:00:00.000Z | 366 | In Process | Incl. Ind. Audit Adj. | Yuba | 2.0 | 227.0 | ... | 445195 | 858174 | 0 | 295839 | 22716 | 322762 | 0 | None | None | None |
| 1 | 106150788 | ADVENTIST HEALTH BAKERSFIELD | 2024-01-01T00:00:00.000Z | 2024-12-31T00:00:00.000Z | 366 | In Process | Incl. Ind. Audit Adj. | Kern | 9.0 | 617.0 | ... | 681866 | 1300483 | 7442 | 536756 | 49673 | 397028 | 0 | None | None | None |
| 2 | 106171049 | ADVENTIST HEALTH CLEARLAKE | 2024-01-01T00:00:00.000Z | 2024-12-31T00:00:00.000Z | 366 | In Process | Incl. Ind. Audit Adj. | Lake | 1.0 | 115.0 | ... | 766075 | 242818 | 0 | 112941 | 69085 | 128270 | 0 | None | None | None |
| 3 | 106150706 | ADVENTIST HEALTH DELANO | 2024-01-01T00:00:00.000Z | 2024-12-31T00:00:00.000Z | 366 | In Process | Incl. Ind. Audit Adj. | Kern | 9.0 | 617.0 | ... | 156907 | 230166 | 0 | 63358 | 38767 | 54724 | 0 | None | None | None |
| 4 | 106190323 | ADVENTIST HEALTH GLENDALE | 2024-01-01T00:00:00.000Z | 2024-12-31T00:00:00.000Z | 366 | In Process | Incl. Ind. Audit Adj. | Los Angeles | 11.0 | 909.0 | ... | 601709 | 1012577 | 136108 | 748485 | 18815 | 407171 | 0 | None | None | None |
5 rows Γ 247 columns
"""
important_columns = [
'FAC_NO','FAC_NAME','COUNTY','HSA','HFPA','TYPE_CNTRL','TYPE_CARE','TYPE_HOSP','TEACH_RURL',
'BEG_DATE','END_DATE','DAY_PER','BED_LIC','BED_AVL','BED_STF','BED_ACUTE','BED_PSYCH','BED_REHAB','BED_LTC',
'DAY_TOT','DIS_TOT','DAY_ACUTE','DIS_ACUTE','VIS_TOT','VIS_ER','SURG_IP','SURG_OP','C_SECTIONS',
'OCC_LIC','OCC_AVL','ALOS_ALL','ALOS_EXLTC',
'GR_PT_REV','NET_PT_REV','TOT_OP_EXP','NET_FRM_OP','NET_INCOME','NONOP_REV','NONOP_EXP',
'TOT_ASST','CUR_ASST','CUR_LIAB','NET_LTDEBT','EQUITY',
'HOSP_FTE','NURS_FTE','PROD_HRS','PAID_HRS','MED_STAFF','STDNT_FTE'
]
"""
#california_hospital_pricing_subset = california_hospital_pricing_data[important_columns]
Connecting to GridDBΒΆ
username = "XXX" # Provide your GridDB credentials here
password = "XXX"
credentials = f"{username}:{password}"
encoded_credentials = base64.b64encode(credentials.encode()).decode()
#print(f"Encoded credentials: Basic {encoded_credentials}")
#Construct an object to hold the request headers (ensure that you replace the XXX placeholder with the correct value that matches the credentials for your GridDB instance)
#header_obj = {"Authorization":"Basic {EncodedAuthorizationToken}","Content-Type":"application/json; charset=UTF-8","User-Agent":"PostmanRuntime/7.29.0"}
header_obj = {
"Authorization": f"Basic {encoded_credentials}", # Add encoded credentials here
"Content-Type": "application/json",
"charset": "UTF-8",
"User-Agent":"PostmanRuntime/7.29.0"
}
#Construct the base URL based on your GridDB cluster you'd like to connect to (ensure that you replace the placeholders in the URL below with the correct values that correspond to your GridDB instance)
base_url = 'https://[host]:[port]/griddb/v2/[clustername]/dbs/[database_name]/'
Creating the ContainerΒΆ
Here are some utility functions for facilitating container creation in GridDB. The function 'map_dtype_to_griddb()' maps the Pandas Dataframe data types to the corresponding data types in GridDB. The function 'generate_griddb_data_obj()' helps generate the container structure for container creation. GridDB supports two types of containers namely Collections and TimeSeries containers. To learn more on GridDB's key-container model and the container types, refer to this official GridDB resource .
INT32_MIN = -2147483648
INT32_MAX = 2147483647
def map_dtype_to_griddb(series):
"""
Maps Pandas series to GridDB data types.
Detects INTEGER overflow and upgrades to LONG.
"""
if pd.api.types.is_integer_dtype(series):
# Check integer overflow
if series.dropna().between(INT32_MIN, INT32_MAX).all():
return "INTEGER"
else:
return "LONG"
elif pd.api.types.is_float_dtype(series):
return "DOUBLE"
elif pd.api.types.is_bool_dtype(series):
return "BOOL"
elif pd.api.types.is_datetime64_any_dtype(series):
return "TIMESTAMP"
else:
return "STRING"
def generate_griddb_data_obj(df, container_name="MyContainer", container_type="COLLECTION", rowkey=False):
columns = []
for col in df.columns:
griddb_type = map_dtype_to_griddb(df[col])
columns.append({"name": col, "type": griddb_type})
data_obj = {
"container_name": container_name,
"container_type": container_type,
"rowkey": rowkey,
"columns": columns
}
return data_obj
#Set up the GridDB WebAPI URL
url = base_url + 'containers'
#Invoke the GridDB WebAPI with the headers and the request body
x = requests.post(url, json = data_obj, headers = header_obj)
#x.text
Ingesting Data into the Container in GridDBΒΆ
#Setup the URL to be used to invoke the GridDB WebAPI to register rows in the container created previously
url = base_url + 'containers/hospital_pricing_data/rows'
print(url)
batch_size = 5000
total_rows = len(california_hospital_pricing_data)
total_batches = math.ceil(total_rows / batch_size)
print(f"Starting ingestion of {total_rows:,} rows in {total_batches} batches of {batch_size} each...")
start_time = time.time()
for i in range(total_batches):
start_idx = i * batch_size
end_idx = min((i + 1) * batch_size, total_rows)
batch_df = california_hospital_pricing_data.iloc[start_idx:end_idx]
batch_json = batch_df.to_json(orient='values')
#print(batch_json)
#break
try:
print(url)
response = requests.put(url, headers=header_obj, data=batch_json)
elapsed = time.time() - start_time
#print(batch_json)
if response.status_code in [200, 201]:
print(f" Batch {i+1}/{total_batches} | Rows {start_idx}-{end_idx-1} | "
f"Time elapsed: {elapsed:.1f}s")
else:
print(f" Batch {i+1}/{total_batches} FAILED "
f"(HTTP {response.status_code}) - {response.text[:150]}")
except Exception as e:
print(f" Batch {i+1}/{total_batches} ERROR: {e}")
print(f"\n Ingestion completed in {time.time() - start_time:.1f} seconds.")
Analyzing Data in GridDB using SQLΒΆ
GridDB supports both a NoSQL interface and a NewSQL interface. The NoSQL interface is the native API of GridDB, used for direct container operations and high-performance data ingestion. The NewSQL interface provides a SQL layer built on top of GridDB, enabling standard SQL queries and integration with analytics and BI tools. The NoSQL interface is useful for IoT applications where real-time data operations are crucial.
from IPython.display import Image, display, Markdown
display(Markdown("### GridDB Architecture Overview"))
# Path to your PNG file
image_path = "GridDB_Image_Overall_Idea.png"
# Display the image
display(Image(filename=image_path, width = 600))
GridDB Architecture OverviewΒΆ
Revenue level above which hospitals fall in the top 10% each yearΒΆ
NET_PT_REV (Net Patient Revenue) represents the actual revenue a hospital receives from patient care after deductions. We use GridDB's PERCENTILE_CONT() function to calculate the value below which a given percentage of observations fall. To learn more on the PERCENTILE_CONT() function, refer to this GridDB resource. We use a subquery approach for this. The inner query fetches the Net Patient Revenue, which is at the 90th percentile of the data distribution. As we know, if a hospital is at the 90th percentile, it means that 90% of hospitals have revenue lower than this value, and only 10% have revenue equal to or higher than this value. GridDB supports correlated subqueries, such as the one below. Refer to this resource to learn more.
sql_query1 = """
SELECT DISTINCT EXTRACT(YEAR, TIMESTAMP(beg_date)) AS Cal_Year,
(
SELECT PERCENTILE_CONT(0.9)
WITHIN GROUP (ORDER BY NET_PT_REV)
FROM hospital_pricing_data
WHERE EXTRACT(YEAR,TIMESTAMP(beg_date)) = EXTRACT(YEAR, TIMESTAMP(hosp_data.beg_date))
) Percentile_Value
FROM
hospital_pricing_data hosp_data;
"""
#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'
# Construct the request body
request_body = json.dumps([
{
"type": "sql-select",
"stmt": sql_query1
}
])
# Disable HTTP debugging
http.client.HTTPConnection.debuglevel = 0
#Invoke the GridDB WebAPI
data_req1 = requests.post(url, data=request_body, headers=header_obj)
#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req1.json()
top_10_perc_hosp_rev = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"]])
top_10_perc_hosp_rev['Percentile_Value'] = top_10_perc_hosp_rev['Percentile_Value'].apply(lambda x: f"{x/1e6:.1f} M")
top_10_perc_hosp_rev = top_10_perc_hosp_rev.sort_values(by="Percentile_Value", ascending=False)
fig = ff.create_table(top_10_perc_hosp_rev)
fig.show()
Insight(s): We see that as of 2023, the top 10% hospitals have had a Net Patient Revenue of USD 889 Million. We can clearly see that the Revenue increased by a USD 100 Million every five years or so.
Counties with More Hospitals than the Average CountyΒΆ
GridDB supports standard subqueries, date functions, and aggregations similar to standard relational databases.
sql_query2 = """SELECT COUNTY, COUNT(*) AS hospital_count
FROM hospital_pricing_data
WHERE EXTRACT(YEAR, TIMESTAMP(beg_date)) = 2024
GROUP BY COUNTY
HAVING COUNT(*) >
(
SELECT AVG(hospital_count)
FROM (
SELECT subset.COUNTY, COUNT(*) AS hospital_count
FROM hospital_pricing_data subset
WHERE EXTRACT(YEAR, TIMESTAMP(subset.beg_date)) = 2024
GROUP BY subset.COUNTY
) t
)
ORDER BY hospital_count DESC;
"""
#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'
# Construct the request body
request_body = json.dumps([
{
"type": "sql-select",
"stmt": sql_query2
}
])
# Disable HTTP debugging
http.client.HTTPConnection.debuglevel = 0
#Invoke the GridDB WebAPI
data_req2 = requests.post(url, data=request_body, headers=header_obj)
#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req2.json()
counties_hosp_more_avg = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"]])
# Ensure hospital_count is numeric
counties_hosp_more_avg['hospital_count'] = counties_hosp_more_avg['hospital_count'].astype(int)
fig = px.bar(
counties_hosp_more_avg,
x="COUNTY",
y="hospital_count",
text="hospital_count" # Display values
)
# Make values appear above the bars
fig.update_traces(textposition='outside')
# Optional: tidy layout
fig.update_layout(
yaxis_title='Number of Hospitals',
xaxis_title='County',
title='Counties with Above-Average Number of Hospitals',
xaxis_tickangle=-45,
uniformtext_minsize=8,
uniformtext_mode='hide'
)
fig.show()
Facilities with the Most Number of SurgeriesΒΆ
We combine the inpatient and outpatient surgeries for each facility and compare them against the median. The reason we consider the Median is that if the Average were used, it would be affected by outliers and, hence, give a skewed estimate of the average number of surgeries. To learn more about the Median function in GridDB, check out this resource .
sql_query3 = """SELECT FAC_NAME, EXTRACT(YEAR, TIMESTAMP(beg_date)) As Cal_Year, AVG(SURG_IP + SURG_OP) AS total_surgeries
FROM hospital_pricing_data
WHERE (SURG_IP + SURG_OP) >
(
SELECT MEDIAN(SURG_IP + SURG_OP) as average_surg
FROM hospital_pricing_data
)
GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 10
"""
#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'
# Construct the request body
request_body = json.dumps([
{
"type": "sql-select",
"stmt": sql_query3
}
])
# Disable HTTP debugging
http.client.HTTPConnection.debuglevel = 0
#Invoke the GridDB WebAPI
data_req3 = requests.post(url, data=request_body, headers=header_obj)
#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req3.json()
high_vol_surgical_centers = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"], myJson[0]["columns"][2]["name"]])
high_vol_surgical_center_sorted = high_vol_surgical_centers.sort_values(by='total_surgeries', ascending=False)
fig = ff.create_table(high_vol_surgical_center_sorted)
fig.show()
Insight(s): In terms of high-volume surgical centers, UNIVERSITY OF CALIFORNIA SAN DIEGO MEDICAL CENTER has consistently had the most surgeries done that outweigh the average surgery count.
Most Efficient Hospitals in terms of Revenue per BedΒΆ
We calculate the revenue per bed by dividing the Gross Patient Revenue by the number of licensed beds.
sql_query4 = """
SELECT FAC_NAME, EXTRACT(YEAR, TIMESTAMP(beg_date)) As Cal_Year, AVG(GR_PT_REV / BED_LIC) AS revenue_per_bed
FROM hospital_pricing_data
WHERE BED_LIC > 0
AND (GR_PT_REV / BED_LIC) > (
SELECT AVG(GR_PT_REV / BED_LIC)
FROM hospital_pricing_data
WHERE BED_LIC > 0
)
GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 10;
"""
#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'
# Construct the request body
request_body = json.dumps([
{
"type": "sql-select",
"stmt": sql_query4
}
])
# Disable HTTP debugging
http.client.HTTPConnection.debuglevel = 0
#Invoke the GridDB WebAPI
data_req4 = requests.post(url, data=request_body, headers=header_obj)
#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req4.json()
hosp_rev_per_bed = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"], myJson[0]["columns"][2]["name"]])
hosp_rev_per_bed_sorted = hosp_rev_per_bed.sort_values(by='revenue_per_bed', ascending=False)
hosp_rev_per_bed_sorted['revenue_per_bed'] = hosp_rev_per_bed_sorted['revenue_per_bed'].apply(lambda x: f"${x:,.0f}")
fig = ff.create_table(hosp_rev_per_bed_sorted)
fig.show()
C:\Users\mg_su\conda\anaconda3\Lib\site-packages\kaleido\_sync_server.py:11: UserWarning: Warning: You have Plotly version 5.22.0, which is not compatible with this version of Kaleido (1.2.0). This means that static image generation (e.g. `fig.write_image()`) will not work. Please upgrade Plotly to version 6.1.1 or greater, or downgrade Kaleido to version 0.2.1.
Insight(s): The revenue per bed at Stanford Health Care seems to be the highest across multiple years. This is followed by City of Hope Helford and USC Kenneth Norris Jr. Cancer Hospital.
Average Stay Length by Hospital and Calendar YearΒΆ
sql_query5 = """SELECT FAC_NAME, EXTRACT(YEAR, TIMESTAMP(beg_date)) AS Calendar_Year,
AVG(ALOS_ALL) AS "Average_Stay_Length"
FROM hospital_pricing_data
GROUP BY 1,2
HAVING AVG(ALOS_ALL) > 0
ORDER BY 3 DESC
LIMIT 10"""
#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'
# Construct the request body
request_body = json.dumps([
{
"type": "sql-select",
"stmt": sql_query5
}
])
# Disable HTTP debugging
http.client.HTTPConnection.debuglevel = 0
#Invoke the GridDB WebAPI
data_req5 = requests.post(url, data=request_body, headers=header_obj)
#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req5.json()
alos_all_df = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"], myJson[0]["columns"][2]["name"]])
alos_all_df = alos_all_df.sort_values(by='Average_Stay_Length', ascending=False)
fig = ff.create_table(alos_all_df)
fig.show()
Insight(s): Sonoma Developmental Center and Coalinga State Hospital have the longest days in terms of stay. This could mean that they may be treating complex and long-term conditions.
Concluding RemarksΒΆ
We were able to effectively analyze a large dataset using GridDB's powerful query engine and NewSQL interface. Almost all functions available in a standard relational database are also made available in GridDB. This makes the transition to GridDB easier, as Business Intelligence developers can effectively leverage their SQL skills to query the containers and then gradually transition to the NoSQL interface for their real-time data processing needs. For our analysis, we used aggregation functions, subqueries, and correlated subqueries to answer specific questions efficiently. Additionally, GridDB's seamless integration with Python allowed us to perform complex analytical tasks without compromising performance. Overall, GridDB proved to be a versatile platform that bridges the gap between traditional relational databases and modern NoSQL solutions, empowering teams to scale their analytics while maintaining familiarity with SQL-based workflows and helping aide decision making.
