{"id":52076,"date":"2024-12-11T00:00:00","date_gmt":"2024-12-11T08:00:00","guid":{"rendered":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/blog\/llm-generate-sql\/"},"modified":"2024-12-11T00:00:00","modified_gmt":"2024-12-11T08:00:00","slug":"llm-generate-sql","status":"publish","type":"post","link":"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/","title":{"rendered":"Creating an LLM to Generate SQL Queries for GridDB"},"content":{"rendered":"<p><style>\n  table, tr, td {\n  width: 100%;\n  border: 1px solid black;\n  margin: 30px;\n}\n<\/style>\n<\/p>\n<h2>Introduction<\/h2>\n<p>GridDB is a time series database available both on-premise and on the cloud optimized for IoT and Big Data that was developed by Toshiba Digital Solutions Corporation. It features a unique key-container model designed specifically to handle both metadata and time series data. Its in-memory architecture allows incredible ingestion and query performance. GridDB is also horizontally scalable to improve both performance and reliability.<\/p>\n<p>With the rise in popularity of using Large Language Models (LLMs), general purpose models like GPT-4 have attracted the most interest and media coverage, but many other specialized models exist for the purpose of code generation. While many of these models are extremely large and require immense compute power to train and execute, some LLMs can be trained and executed on platforms that are practical and cost effective.<\/p>\n<p>Every database uses a slightly different form of SQL and GridDB is no different. In particular, GridDB has different time functions compared to other SQL databases and also uses a unique Key-Container data model where it is encouraged to store homogeneous data in multiple tables. With these differences in mind, the LLM must be fine-tuned for GridDB and experience tells us that an off the shelf LLM would not produce suitable queries.<\/p>\n<p>There are both consumer and business use cases for using an LLM to query GridDB. For consumers, the LLM would enable the end user to ask simple questions about their own data such as \u201cWhen do I consume the most electricity?\u201d. For business analysts and managers, it extends their business intelligence tools allowing for more ad-hoc queries to dive deeper into their organization\u2019s data.<\/p>\n<p>In this technical report, we demonstrate how software developers can utilize an LLM to generate queries for use with GridDB\u2019s SQL interface to enhance their application. The process used to create and use an LLM for GridDB is as follows:<\/p>\n<ol>\n<li>Determine SQL Generation performance of other models and the feasibility of fine tuning these other models.<\/li>\n<li>Find the datasets that were used to train the models selected in Step 1.<\/li>\n<li>Filter out any queries that are not supported by GridDB and fine-tune the model to ensure accuracy is still reasonable.<\/li>\n<li>Create a data set that uses GridDB specific features: time series range selection and Key-Container.<\/li>\n<li>Fine-tune the model with our new GridDB specific dataset and evaluate the accuracy as measured by the percentage of the number of responses that matched the human answer in the evaluation data split.<\/li>\n<li>Demonstrate inference within a Python Flask application.<\/li>\n<\/ol>\n<p><a href=\"https:\/\/griddb.net\/wp-content\/uploads\/2024\/12\/demo.gif\"><img fetchpriority=\"high\" decoding=\"async\" src=\"https:\/\/griddb.net\/wp-content\/uploads\/2024\/12\/demo.gif\" alt=\"\" width=\"1396\" height=\"616\" class=\"aligncenter size-full wp-image-31196\" \/><\/a><\/p>\n<h3>Source Code<\/h3>\n<p>This project&#8217;s code can found on the griddbnet GitHub page: <a href=\"https:\/\/github.com\/griddbnet\/sql_llm_model\">https:\/\/github.com\/griddbnet\/sql_llm_model<\/a><\/p>\n<h2>Existing Model Evaluation<\/h2>\n<p>Closed source, general purpose models such as GPT-4 were immediately dismissed as the bulk of SQL related material they consumed for training would have been for mainstream databases such as Oracle, Postgres, and SQLite. Not only that, but their closed source nature would make it difficult to train or fine-tune the models for GridDB.<\/p>\n<p>DeFog.AI\u2019s SQLCoder model based on LLAMA was tested and performed well, but the original models did not support GridDB\u2019s time series SQL semantics. Furthermore, the hardware and time requirements to fine-tune or run inference with SQLCoder was not feasible. Likewise, StarCoder was examined and its reported accuracy was deemed to be significantly poorer than SQLCoder while being just as difficult to fine-tune.<\/p>\n<p>The last two models left for consideration were OpenAI\u2019s GPT-2 and Google\u2019s T5-Small models. After fine tuning both with our selected datasets, T5-Small was generally more accurate and trouble-free while performing fine tuning. Other parties had already used GPT-2 and T5-Small to create SQL generating LLMs and between the two, accuracy was better with T5-Small.<\/p>\n<h2>Dataset Selection<\/h2>\n<p>Three individual datasets with similar characteristics that had been used to train text to SQL models were found:<\/p>\n<ul>\n<li><a href=\"https:\/\/huggingface.co\/datasets\/b-mc2\/sql-create-context\">https:\/\/huggingface.co\/datasets\/b-mc2\/sql-create-context<\/a><\/li>\n<li><a href=\"https:\/\/huggingface.co\/datasets\/Clinton\/Text-to-sql-v1\">https:\/\/huggingface.co\/datasets\/Clinton\/Text-to-sql-v1<\/a><\/li>\n<li><a href=\"https:\/\/huggingface.co\/datasets\/knowrohit07\/know_sql\">https:\/\/huggingface.co\/datasets\/knowrohit07\/know_sql<\/a><\/li>\n<\/ul>\n<p>Between the three datasets, they have nearly 400,000 rows of data. Each row contains:<\/p>\n<ul>\n<li>A context or the SQL schema<\/li>\n<li>The question to be converted into a SQL query.<\/li>\n<li>The answer, the SQL query based on the question and context.<\/li>\n<\/ul>\n<p>For example:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">{ \n  \"answer\" : \"SELECT COUNT(*) FROM head WHERE age > 56\",\n  \"question\" : \"How many heads of the departments are older than 56 ?\"\n  \"context\" : \"CREATE TABLE head (age INTEGER)\"\n}<\/code><\/pre>\n<\/div>\n<h2>Dataset Filtering<\/h2>\n<p>The first problem seen in these third party datasets is that some of the queries do not work with GridDB. A simple script was created to execute the context statement and then the query statement for each row in the dataset, if the query was executed successfully, then it would be saved to the filtered dataset.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">for line in fd.readlines():\n    data = json.loads(line)\n    data[answer_name] = re.sub('\"', ''', data[answer_name])\n    try:\n        for stmt in data[context_name].split(\";\"):\n            stmt = stmt.strip()\n            table = stmt.split(\" \")[2]\n            curs.execute(\"DROP TABLE IF EXISTS \"+table)\n            curs.execute(stmt)\n    except:\n        pass\n\n    try:\n        curs.execute(data[answer_name])\n        good=good+1\n        print(json.dumps({\"question\": data[question_name], \"context\": data[context_name], \"answer\": data[answer_name]}))\n\n    except:\n        bad=bad+1<\/code><\/pre>\n<\/div>\n<p>Of the nearly 400,000 queries in the original datasets, 170,000 of them functioned in GridDB and were used to perform initial model fine tuning. The dataset was split 80\/10\/10 for training, validation, and testing and fine tuning ran on top of the base T5-small model.<\/p>\n<h2>Dataset Creation<\/h2>\n<p>None of the data in the filtered dataset supports GridDB specific functionality. While GridDB has many unique SQL features not shared with any other database, we wanted to focus on the two most basic.<\/p>\n<p>The first is GridDB\u2019s Key-Container data model. Most relational databases store all data in one table, while GridDB recommends splitting data into multiple tables. For example, data for device #1 would be stored in tsdata_1 and data for device #20 in tsdata_20 so the human question of \u201cWhat is the maximum temperature recorded on device 12?\u201d would be the SQL query of <code>SELECT max(temp) FROM tsdata_12<\/code> instead of <code>SELECT max(temp) FROM tsdata WHERE device = '12'\"<\/code> as would be normal in another database.<\/p>\n<p>The second feature we wanted to support was fetching data, in particular aggregations for a given period. GridDB uses the TIMESTAMP() SQL function to compare time series data in a query. For example, to select average temperature in 2024, you would use the query <code>SELECT avg(temp) from tsdata_12 where ts &gt;= TIMESTAMP('2024-01-01') and ts &amp;lt; TIMESTAMP('2025-01-01')<\/code>.<\/p>\n<p>To do this, a new tool was developed that could generate a human question with corresponding SQL query answer based on a given template. Iterating on the template with random values for the time period, container identifier, aggregation, etc would build a reasonable sized data set to fine tune the model with.<\/p>\n<p>An example input template:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">{\n    \"context\" : \"CREATE TABLE IF NOT EXISTS devices (device_id INTEGER, ts TIMESTAMP, co DOUBLE, humidity DOUBLE,light BOOL,lpg DOUBLE,motion BOOL,smoke DOUBLE,temp DOUBLE);\",\n    \"queries\" : [\n        {\n            \"columns\" : [\"co\", \"humidity\", \"lpg\", \"smoke\", \"temp\"],\n            \"human\" : \"What is the {HUMAN_AGGREGATE} {COLUMN} ?\",\n            \"sql\" : \"SELECT {AGGREGATE}({COLUMN}) FROM devices;\"\n        }\n     ]\n}<\/code><\/pre>\n<\/div>\n<p>Would produce :<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">{\"context\": \"CREATE TABLE IF NOT EXISTS devices (device_id INTEGER, ts TIMESTAMP, co DOUBLE, humidity DOUBLE,light BOOL,lpg DOUBLE,motion BOOL,smoke DOUBLE,temp DOUBLE);\", \n\"question\": \"What is the lowest smoke in 2009 for all devices?\", \n\"answer\": \"SELECT MIN(smoke) FROM devices WHERE ts > TIMESTAMP('2009-01-01T00:00:00Z') and ts &lt; TIMESTAMP('2010-01-01T00:00:00Z');\"}\n{\"context\": \"CREATE TABLE IF NOT EXISTS devices (device_id INTEGER, ts TIMESTAMP, co DOUBLE, humidity DOUBLE,light BOOL,lpg DOUBLE,motion BOOL,smoke DOUBLE,temp DOUBLE);\", \n\"question\": \"What is the highest humidity in June 2011 for all devices?\", \n\"answer\": \"SELECT MAX(humidity) FROM devices WHERE ts > TIMESTAMP('2011-06-01T00:00:00Z') and ts &lt; TIMESTAMP('2011-07-01T00:00:00Z');\"}<\/code><\/pre>\n<\/div>\n<p>We created five to six templated queries that both did and did not use the TIMESTAMP() function for five different contexts which both did and did not use multiple tables per the Key-Container model and then with the dataset creation tool, generated 100 different question\/answer pairs per query for a total of 3600 queries. As automatic dataset splitting resulted in a disproportionate amount of one context over another in the test dataset, a second test dataset was generated with only a single question\/answer pair for each templated query.<\/p>\n<h2>Fine Tuning<\/h2>\n<p>For both the filtered data set and the generated GridDB data set, each training data item combined into a single string of the format:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-sh\">Tables:\n{context}\n\nQuestion:\n{question}\nAnswer:<\/code><\/pre>\n<\/div>\n<p>The above string is then tokenized using the HuggingFace AutoTokenizer and used as the input identifier while the answer is tokenized as the labels. After the dataset has been tokenized, it is trained using HuggingFace\u2019s Trainer library.<\/p>\n<p>Additional tokens for <code>&amp;lt;<\/code> and <code>&amp;lt;=<\/code> need to be added to the tokenizer otherwise those symbols with the SQL statements would be ignored by the model during training.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">def tokenize_function(example):\n    \n    start_prompt = \"Tables:n\"\n    middle_prompt = \"nnQuestion:n\"\n    end_prompt = \"nnAnswer:n\"\n  \n    data_zip = zip(example['context'], example['question'])\n    prompt = [start_prompt + context + middle_prompt + question + end_prompt for context, question in data_zip]\n    example['input_ids'] = tokenizer(prompt, padding='max_length', truncation=True, return_tensors=\"pt\").input_ids\n    example['labels'] = tokenizer(example['answer'], padding='max_length', truncation=True, return_tensors=\"pt\").input_ids\n    \n    return example\n\n\nfinetuned_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)\ntokenizer = AutoTokenizer.from_pretrained(tok_model_name)\ntokenizer.add_tokens(['&lt;=', '&lt;= ', ' &lt;=', ' &lt;', '&lt;', '&lt; ', '>= ', ' >=', '>='])\nfinetuned_model.resize_token_embeddings(len(tokenizer))\n\ntokenized_datasets = dataset.map(tokenize_function, batched=True)\n\noutput_dir = f'.\/sql-training-{str(int(time.time()))}'\n\ntraining_args = TrainingArguments(\n    output_dir=output_dir,\n    learning_rate=5e-3,\n    num_train_epochs=2,\n    per_device_train_batch_size=8,     # batch size per device during training\n    per_device_eval_batch_size=8,      # batch size for evaluation\n    weight_decay=0.01,\n    logging_steps=50,\n    evaluation_strategy='steps',        # evaluation strategy to adopt during training\n    eval_steps=500,                     # number of steps between evaluation\n)\n\ntrainer = Trainer(\n    model=finetuned_model,\n    args=training_args,\n    train_dataset=tokenized_datasets['train'],\n    eval_dataset=tokenized_datasets['validation'],\n)\n\ntrainer.train()<\/code><\/pre>\n<\/div>\n<p>Using an AMD Ryzen Threadripper 2990WX with an NVIDIA 4070GTX, training took approximately 3-4 hours to complete for the filtered dataset and under an hour to complete for the generated dataset.<\/p>\n<h2>Evaluation<\/h2>\n<p>Using either the 10% test split of the training dataset or the generated test dataset, the same tokenization method was used to build input for the model. The output answer was generated for every input and compared using HuggingFace\u2019s ROUGE evaluation library.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">try:\n    for stmt in data[context_name].split(\";\"):\n        stmt = stmt.strip()\n        table = stmt.split(\" \")[2]\n        curs.execute(\"DROP TABLE IF EXISTS \"+table)\n        curs.execute(stmt)\nexcept:\n    pass\n\ntry:\n    curs.execute(data[answer_name])\n    good=good+1\n    print(json.dumps({\"question\": data[question_name], \"context\": data[context_name], \"answer\": data[answer_name]}))\n\nexcept:\n    bad=bad+1<\/code><\/pre>\n<\/div>\n<p>This evaluation was performed for both the original filtered data set and also the generated GridDB specific data set and ROUGE metrics were gathered. ROUGE or Recall-Oriented Understudy for Gisting Evaluation is a set of metrics used to evaluate text transformation or summarization models by comparing human generated baseline answer versus the model generated response. Each ROUGE metric varies from 0 to 1, with 1 being a perfect match.<\/p>\n<table>\n<tr>\n<td>\n      <strong>Metric<\/strong>\n    <\/td>\n<td>\n      <strong>Filtered Queries<\/strong>\n    <\/td>\n<td>\n      <strong>GridDB Specific Queries<\/strong>\n    <\/td>\n<\/tr>\n<tr>\n<td>\n      ROUGE-1\n    <\/td>\n<td>\n      0.9220341258369449\n    <\/td>\n<td>\n      0.893189189189189\n    <\/td>\n<\/tr>\n<tr>\n<td>\n      ROUGE-2\n    <\/td>\n<td>\n      0.8328271928176021\n    <\/td>\n<td>\n      0.8556992481203007\n    <\/td>\n<\/tr>\n<tr>\n<td>\n      ROUGE-L\n    <\/td>\n<td>\n      0.9039756047111251\n    <\/td>\n<td>\n      0.8807387387387388\n    <\/td>\n<\/tr>\n<\/table>\n<ul>\n<li>ROUGE-1 measures the overlap of the words between the original and infered answer.<\/li>\n<li>ROUGE-2 refers to the overlap of pairs of words between the reference and infered answer.<\/li>\n<li>ROUGE-L measures the longest sequence of words between the reference and infered answer that match.<\/li>\n<\/ul>\n<h2>Application Implementation<\/h2>\n<p>There are many ways to integrate the LLM into an application. LLM inference could be performed on the edge on the user\u2019s device which would allow for greater scalability but also much higher end user system requirements.<\/p>\n<p>If the inference is performed on the server side, it can be bundled into the current application or as a separate service that communicates with the current application. This would allow inference to run on dedicated high performance instances and thus inference would have minimal impact on the existing application\u2019s performance.<\/p>\n<p>We will directly bundle the LLM into our application into the demo for simplicity\u2019s sake.<\/p>\n<p><a href=\"https:\/\/griddb.net\/wp-content\/uploads\/2024\/12\/inference_diagram.png\"><img decoding=\"async\" src=\"https:\/\/griddb.net\/wp-content\/uploads\/2024\/12\/inference_diagram.png\" alt=\"\" width=\"1434\" height=\"428\" class=\"aligncenter size-full wp-image-31198\" srcset=\"\/wp-content\/uploads\/2024\/12\/inference_diagram.png 1434w, \/wp-content\/uploads\/2024\/12\/inference_diagram-300x90.png 300w, \/wp-content\/uploads\/2024\/12\/inference_diagram-1024x306.png 1024w, \/wp-content\/uploads\/2024\/12\/inference_diagram-768x229.png 768w, \/wp-content\/uploads\/2024\/12\/inference_diagram-600x179.png 600w\" sizes=\"(max-width: 1434px) 100vw, 1434px\" \/><\/a><\/p>\n<p>Now adding the code to use the model in your application is straight forward. The context can be fetched using GridDB\u2019s NoSQL API:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">containers = []\nx = 0\nwhile x &lt; gridstore.partition_info.partition_count:\n    containers.extend(gridstore.partition_info.get_container_names(x, 0))\n    x=x+1\n\nconts_and_schemas = {}\nfor cont in containers:\n    col_list = gridstore.get_container_info(cont).column_info_list\n    schema = {}\n    for row in col_list:\n        schema[row[0]] = type_mapping(row[1])\n    conts_and_schemas[cont] = schema\n\ncreate_stmts = []\nfor key in conts_and_schemas:\n    create_stmts.append(create_table_statement(key, conts_and_schemas[key]))\nreturn create_stmts<\/code><\/pre>\n<\/div>\n<p>Inference for a single question is performed in a similiar fashion to how evalution was performed.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">model = AutoModelForSeq2SeqLM.from_pretrained(\"griddb_model_2_epoch\")\ntokenizer = AutoTokenizer.from_pretrained(\"t5-small\")\ndef translate_to_sql_select(context, question):\n    prompt = f\"\"\"Tables:\n{context}\nQuestion:\n{question}\nAnswer:\n\"\"\"\n    input_ids = tokenizer.encode(prompt, return_tensors=\"pt\")\n    outputs = model.generate(input_ids)\n    sql_query = tokenizer.decode(outputs[0], skip_special_tokens=True)\n    return sql_query\nFinally a Flask route gets the local context, calls the model, executes the query, and returns the response.\n@app.route('\/nlquery')\ndef nlquery():\n    question = request.args.get('question')\n    context = get_local_context() \n    query = translate_to_sql_select(context,question)\n    curs = conn.cursor()\n    try:\n        curs.execute(query)\n        rows = curs.fetchall()\n        return json.dumps(rows)\n    else:\n        abort(400, 'Generated query was not successful')<\/code><\/pre>\n<\/div>\n<p>While the model is easily incorporated into any Flask or other Python application as shown, scalability may be difficult as each LLM invocation takes approximately 500 milliseconds using an AMD Ryzen Threadripper and NVIDIA 4070GTX. There are other projects such as https:\/\/github.com\/Ki6an\/fastT5 that will greatly improve the scalability of the GridDB LLM model.<\/p>\n<h2>Conclusion<\/h2>\n<p>We hope the process of creating a training dataset, performing the training and using the resulting LLM within an application to query your data was insightful and educational.<\/p>\n<p>Using LLM, end users including IoT device owners, corporate analysts, managers, customer service, and others are able to query data stored in GridDB without having to know SQL. While the queries used to demonstrate the LLM in this project are relatively simple, the model appears to be extensible to other query types and methods. Furthermore, the T5-small model is efficient to train, not requiring large investments in hardware to to train or run inference on.<\/p>\n<p>In the future, with a larger, more diverse training dataset and advancements even in the base model performing natural language queries will become even more commonplace and accurate. The source code used in the project is available at <a href=\"https:\/\/github.com\/griddbnet\/sql_llm_model\">https:\/\/github.com\/griddbnet\/sql_llm_model<\/a>. The finished model can be downloaded from HuggingFace <a href=\"https:\/\/huggingface.co\/griddbnet\">https:\/\/huggingface.co\/griddbnet<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction GridDB is a time series database available both on-premise and on the cloud optimized for IoT and Big Data that was developed by Toshiba Digital Solutions Corporation. It features a unique key-container model designed specifically to handle both metadata and time series data. Its in-memory architecture allows incredible ingestion and query performance. GridDB is [&hellip;]<\/p>\n","protected":false},"author":71,"featured_media":52077,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[121],"tags":[],"class_list":["post-52076","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.1.1 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Creating an LLM to Generate SQL Queries for GridDB | GridDB: Open Source Time Series Database for IoT<\/title>\n<meta name=\"description\" content=\"Introduction GridDB is a time series database available both on-premise and on the cloud optimized for IoT and Big Data that was developed by Toshiba\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Creating an LLM to Generate SQL Queries for GridDB | GridDB: Open Source Time Series Database for IoT\" \/>\n<meta property=\"og:description\" content=\"Introduction GridDB is a time series database available both on-premise and on the cloud optimized for IoT and Big Data that was developed by Toshiba\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/\" \/>\n<meta property=\"og:site_name\" content=\"GridDB: Open Source Time Series Database for IoT\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/griddbcommunity\/\" \/>\n<meta property=\"article:published_time\" content=\"2024-12-11T08:00:00+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.griddb.net\/wp-content\/uploads\/2025\/12\/SCR-20241212-ktmh-scaled.png\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"934\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Owen\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@GridDBCommunity\" \/>\n<meta name=\"twitter:site\" content=\"@GridDBCommunity\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Owen\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"13 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/\"},\"author\":{\"name\":\"Owen\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/0f2f6d4b593adde8c43cf3ea5c794c66\"},\"headline\":\"Creating an LLM to Generate SQL Queries for GridDB\",\"datePublished\":\"2024-12-11T08:00:00+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/\"},\"wordCount\":1750,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/#primaryimage\"},\"thumbnailUrl\":\"\/wp-content\/uploads\/2025\/12\/SCR-20241212-ktmh-scaled.png\",\"articleSection\":[\"Blog\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/\",\"url\":\"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/\",\"name\":\"Creating an LLM to Generate SQL Queries for GridDB | GridDB: Open Source Time Series Database for IoT\",\"isPartOf\":{\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/#primaryimage\"},\"thumbnailUrl\":\"\/wp-content\/uploads\/2025\/12\/SCR-20241212-ktmh-scaled.png\",\"datePublished\":\"2024-12-11T08:00:00+00:00\",\"description\":\"Introduction GridDB is a time series database available both on-premise and on the cloud optimized for IoT and Big Data that was developed by Toshiba\",\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/#primaryimage\",\"url\":\"\/wp-content\/uploads\/2025\/12\/SCR-20241212-ktmh-scaled.png\",\"contentUrl\":\"\/wp-content\/uploads\/2025\/12\/SCR-20241212-ktmh-scaled.png\",\"width\":2560,\"height\":934},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#website\",\"url\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/\",\"name\":\"GridDB: Open Source Time Series Database for IoT\",\"description\":\"GridDB is an open source time-series database with the performance of NoSQL and convenience of SQL\",\"publisher\":{\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization\",\"name\":\"Fixstars\",\"url\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png\",\"contentUrl\":\"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png\",\"width\":200,\"height\":83,\"caption\":\"Fixstars\"},\"image\":{\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/griddbcommunity\/\",\"https:\/\/x.com\/GridDBCommunity\",\"https:\/\/www.linkedin.com\/company\/griddb-by-toshiba\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/0f2f6d4b593adde8c43cf3ea5c794c66\",\"name\":\"Owen\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/47438a5c81215c7a9043be1b427e0bbd8dc0f77bd536f147f8495575149e4325?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/47438a5c81215c7a9043be1b427e0bbd8dc0f77bd536f147f8495575149e4325?s=96&d=mm&r=g\",\"caption\":\"Owen\"},\"url\":\"https:\/\/www.griddb.net\/en\/author\/owen\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Creating an LLM to Generate SQL Queries for GridDB | GridDB: Open Source Time Series Database for IoT","description":"Introduction GridDB is a time series database available both on-premise and on the cloud optimized for IoT and Big Data that was developed by Toshiba","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/","og_locale":"en_US","og_type":"article","og_title":"Creating an LLM to Generate SQL Queries for GridDB | GridDB: Open Source Time Series Database for IoT","og_description":"Introduction GridDB is a time series database available both on-premise and on the cloud optimized for IoT and Big Data that was developed by Toshiba","og_url":"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/","og_site_name":"GridDB: Open Source Time Series Database for IoT","article_publisher":"https:\/\/www.facebook.com\/griddbcommunity\/","article_published_time":"2024-12-11T08:00:00+00:00","og_image":[{"width":2560,"height":934,"url":"https:\/\/www.griddb.net\/wp-content\/uploads\/2025\/12\/SCR-20241212-ktmh-scaled.png","type":"image\/png"}],"author":"Owen","twitter_card":"summary_large_image","twitter_creator":"@GridDBCommunity","twitter_site":"@GridDBCommunity","twitter_misc":{"Written by":"Owen","Est. reading time":"13 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/#article","isPartOf":{"@id":"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/"},"author":{"name":"Owen","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/0f2f6d4b593adde8c43cf3ea5c794c66"},"headline":"Creating an LLM to Generate SQL Queries for GridDB","datePublished":"2024-12-11T08:00:00+00:00","mainEntityOfPage":{"@id":"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/"},"wordCount":1750,"commentCount":0,"publisher":{"@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization"},"image":{"@id":"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/#primaryimage"},"thumbnailUrl":"\/wp-content\/uploads\/2025\/12\/SCR-20241212-ktmh-scaled.png","articleSection":["Blog"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/","url":"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/","name":"Creating an LLM to Generate SQL Queries for GridDB | GridDB: Open Source Time Series Database for IoT","isPartOf":{"@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/#primaryimage"},"image":{"@id":"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/#primaryimage"},"thumbnailUrl":"\/wp-content\/uploads\/2025\/12\/SCR-20241212-ktmh-scaled.png","datePublished":"2024-12-11T08:00:00+00:00","description":"Introduction GridDB is a time series database available both on-premise and on the cloud optimized for IoT and Big Data that was developed by Toshiba","inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.griddb.net\/en\/blog\/llm-generate-sql\/#primaryimage","url":"\/wp-content\/uploads\/2025\/12\/SCR-20241212-ktmh-scaled.png","contentUrl":"\/wp-content\/uploads\/2025\/12\/SCR-20241212-ktmh-scaled.png","width":2560,"height":934},{"@type":"WebSite","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#website","url":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/","name":"GridDB: Open Source Time Series Database for IoT","description":"GridDB is an open source time-series database with the performance of NoSQL and convenience of SQL","publisher":{"@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization","name":"Fixstars","url":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/logo\/image\/","url":"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png","contentUrl":"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png","width":200,"height":83,"caption":"Fixstars"},"image":{"@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/griddbcommunity\/","https:\/\/x.com\/GridDBCommunity","https:\/\/www.linkedin.com\/company\/griddb-by-toshiba"]},{"@type":"Person","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/0f2f6d4b593adde8c43cf3ea5c794c66","name":"Owen","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/47438a5c81215c7a9043be1b427e0bbd8dc0f77bd536f147f8495575149e4325?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/47438a5c81215c7a9043be1b427e0bbd8dc0f77bd536f147f8495575149e4325?s=96&d=mm&r=g","caption":"Owen"},"url":"https:\/\/www.griddb.net\/en\/author\/owen\/"}]}},"_links":{"self":[{"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/posts\/52076","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/users\/71"}],"replies":[{"embeddable":true,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/comments?post=52076"}],"version-history":[{"count":0,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/posts\/52076\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/media\/52077"}],"wp:attachment":[{"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/media?parent=52076"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/categories?post=52076"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/tags?post=52076"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}