{"id":52263,"date":"2025-11-07T00:00:00","date_gmt":"2025-11-07T08:00:00","guid":{"rendered":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/"},"modified":"2026-03-30T11:23:36","modified_gmt":"2026-03-30T18:23:36","slug":"voice-based-image-generation-using-imagen-4-and-elevenlabs","status":"publish","type":"post","link":"https:\/\/www.griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/","title":{"rendered":"Voice-Based Image Generation Using Imagen 4 and ElevenLabs"},"content":{"rendered":"<h2>Project Overview<\/h2>\n<p>A modern web application that transforms spoken descriptions into high-quality images using cutting-edge AI technologies. Users can record their voice describing an image they want to create, and the system will transcribe their speech and generate a corresponding image.<\/p>\n<h2>What Problem We Solved<\/h2>\n<p>Traditional image generation tools require users to type detailed prompts, which can be:<\/p>\n<ul>\n<li>Time-consuming for complex descriptions.<\/li>\n<li>Limiting for users with typing difficulties.<\/li>\n<li>Less natural than speaking.<\/li>\n<\/ul>\n<p>This project solution makes AI image generation more accessible through voice interaction.<\/p>\n<h2>Architecture &amp; Tech Stack<\/h2>\n<p><a href=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/11\/system-arch-scaled.png\"><img fetchpriority=\"high\" decoding=\"async\" src=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/11\/system-arch-scaled.png\" alt=\"\" width=\"2560\" height=\"1627\" class=\"aligncenter size-full wp-image-32560\" \/><\/a><\/p>\n<p>This diagram shows a pipeline for converting speech into images and storing the result in the GridDB database:<\/p>\n<ol>\n<li><strong>User speaks<\/strong> into a microphone.<\/li>\n<li><strong>Speech recording<\/strong> captures the audio.<\/li>\n<li>Audio is sent to <strong>ElevenLabs (Scriber-1)<\/strong> for <strong>speech-to-text<\/strong> transcription.<\/li>\n<li>The transcribed text becomes a <strong>prompt<\/strong> for <strong>Imagen 4 API<\/strong>, which generates an image.<\/li>\n<li>The data saved into the database are:\n<ul>\n<li>The <strong>audio reference<\/strong> and<\/li>\n<li>The <strong>image<\/strong><\/li>\n<li>The <strong>prompt<\/strong> text<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<h3>Frontend Stack<\/h3>\n<p>In this project, we will use <a href=\"https:\/\/nextjs.org\/\">Next.js<\/a> as the frontend framework.<\/p>\n<h3>Backend Stack<\/h3>\n<p>There is no specific backend code because all services used are from APIs. There are three main APIs:<\/p>\n<h4>1. Speech-to-Text API<\/h4>\n<p>This project utilizes the ElevenLabs API for speech-to-text transcription. The API is at<a href=\"https:\/\/api.elevenlabs.io\/v1\/text-to-speech\/\">https:\/\/api.elevenlabs.io\/v1\/text-to-speech\/<\/a>. ElevenLabs also provides JavaScript SDK for easier API integration. You can see <a href=\"https:\/\/github.com\/FelixWaweru\/elevenlabs-node\">the SDK documentation<\/a> for more details.<\/p>\n<h4>2. Image Generation API<\/h4>\n<p>This project uses Imagen 4 API from <a href=\"https:\/\/fal.ai\">fal<\/a>. The API is hosted on <a href=\"https:\/\/fal.ai\/models\/fal-ai\/imagen4\/preview\">https:\/\/fal.ai\/models\/fal-ai\/imagen4\/preview<\/a>. Fal provides JavaScript SDK for easier API integration. You can see <a href=\"https:\/\/github.com\/fal-ai\/fal-js\">the SDK documentation<\/a> for more details.<\/p>\n<h4>3. Database API<\/h4>\n<p>We will use the GridDB Cloud version in this project. So there is no need for local installation. Please read the next section on how to set up GridDB Cloud.<\/p>\n<h2>Prerequisites<\/h2>\n<h3>Node.js<\/h3>\n<p>This project is built using Next.js, which requires Node.js version 16 or higher. You can download and install Node.js from <a href=\"https:\/\/nodejs.org\/en\">https:\/\/nodejs.org\/en<\/a>.<\/p>\n<h3>GridDB<\/h3>\n<h4>Sign Up for GridDB Cloud Free Plan<\/h4>\n<p>If you would like to sign up for a GridDB Cloud Free instance, you can do so at the following link: <a href=\"https:\/\/form.ict-toshiba.jp\/download_form_griddb_cloud_freeplan_e\">https:\/\/form.ict-toshiba.jp\/download_form_griddb_cloud_freeplan_e<\/a>.<\/p>\n<p>After successfully signing up, you will receive a free instance along with the necessary details to access the GridDB Cloud Management GUI, including the <strong>GridDB Cloud Portal URL<\/strong>, <strong>Contract ID<\/strong>, <strong>Login<\/strong>, and <strong>Password<\/strong>.<\/p>\n<h4>GridDB WebAPI URL<\/h4>\n<p>Go to the GridDB Cloud Portal and copy the WebAPI URL from the <strong>Clusters<\/strong> section. It should look like this:<\/p>\n<p><a href=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/11\/griddb-cloud-portal-scaled.png\"><img decoding=\"async\" src=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/11\/griddb-cloud-portal-scaled.png\" alt=\"\" width=\"2560\" height=\"1265\" class=\"aligncenter size-full wp-image-32564\" \/><\/a><\/p>\n<h4>GridDB Username and Password<\/h4>\n<p>Go to the <strong>GridDB Users<\/strong> section of the GridDB Cloud portal and create or copy the username for <code>GRIDDB_USERNAME<\/code>. The password is set when the user is created for the first time, use this as the <code>GRIDDB_PASSWORD<\/code>.<\/p>\n<p><a href=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/11\/griddb-cloud-users-scaled.png\"><img decoding=\"async\" src=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/11\/griddb-cloud-users-scaled.png\" alt=\"\" width=\"2560\" height=\"1351\" class=\"aligncenter size-full wp-image-32563\" \/><\/a><\/p>\n<p>For more details, to get started with GridDB Cloud, please follow this <a href=\"https:\/\/griddb.net\/en\/blog\/griddb-cloud-quick-start-guide\/\">quick start guide<\/a>.<\/p>\n<h4>IP Whitelist<\/h4>\n<p>When running this project, please ensure that the IP address where the project is running is whitelisted. Failure to do so will result in a 403 status code or forbidden access.<\/p>\n<p>You can use a website like <a href=\"https:\/\/whatismyipaddress.com\/\">What Is My IP Address<\/a> to find your public IP address.<\/p>\n<p>To whitelist the IP, go to the GridDB Cloud Admin and navigate to the <strong>Network Access<\/strong> menu.<\/p>\n<p><a href=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/11\/ip-whitelist-scaled.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/11\/ip-whitelist-scaled.png\" alt=\"\" width=\"2560\" height=\"1095\" class=\"aligncenter size-full wp-image-32562\" \/><\/a><\/p>\n<h3>ElevenLabs<\/h3>\n<p>You need an ElevenLabs account and API key to use this project. You can sign up for an account at <a href=\"https:\/\/elevenlabs.io\/app\/sign-up\">https:\/\/elevenlabs.io\/app\/sign-up<\/a>.<\/p>\n<p>After signing up, go to the <strong>Account<\/strong> section, and create and copy your API key.<\/p>\n<p><a href=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/11\/elevenlabs-api-key.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/11\/elevenlabs-api-key.png\" alt=\"\" width=\"2114\" height=\"926\" class=\"aligncenter size-full wp-image-32566\" \/><\/a><\/p>\n<h3>Imagen 4 API<\/h3>\n<p>You need an Imagen 4 API key to use this project. You can sign up for an account at <a href=\"https:\/\/fal.ai\">https:\/\/fal.ai<\/a>.<\/p>\n<p>After signing up, go to the <strong>Account<\/strong> section, and create and copy your API key.<\/p>\n<p><a href=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/11\/fal-imagen-api-key.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/11\/fal-imagen-api-key.png\" alt=\"\" width=\"2166\" height=\"798\" class=\"aligncenter size-full wp-image-32565\" \/><\/a><\/p>\n<h2>How to Run<\/h2>\n<h3>1. Clone the repository<\/h3>\n<p>Clone the repository from <a href=\"https:\/\/github.com\/junwatu\/speech-image-gen\">https:\/\/github.com\/junwatu\/speech-image-gen<\/a> to your local machine.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-sh\">git clone https:\/\/github.com\/junwatu\/speech-image-gen.git\ncd speech-image-gen\ncd apps<\/code><\/pre>\n<\/div>\n<h3>2. Install dependencies<\/h3>\n<p>Install all project dependencies using npm.<\/p>\n<div class=\"clipboard\">\n<pre><code>npm install<\/code><\/pre>\n<\/div>\n<h3>3. Set up environment variables<\/h3>\n<p>Copy file <code>.env.example<\/code> to <code>.env<\/code> and fill in the values:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-sh\"># Copy this file to .env.local and add your actual API keys\n# Never commit .env.local to version control\n\n# Fal.ai API Key for Imagen 4\n# Get your key from: https:\/\/fal.ai\/dashboard\nFAL_KEY=\n\n# ElevenLabs API Key for Speech-to-Text\n# Get your key from: https:\/\/elevenlabs.io\/app\/speech-synthesis\nELEVENLABS_API_KEY=\n\nGRIDDB_WEBAPI_URL=\nGRIDDB_PASSWORD=\nGRIDDB_USERNAME=<\/code><\/pre>\n<\/div>\n<p>Please look the section on <a href=\"#prerequisites\">Prerequisites<\/a> before running the project.<\/p>\n<h3>4. Run the project<\/h3>\n<p>Run the project using the following command:<\/p>\n<div class=\"clipboard\">\n<pre><code>npm run dev<\/code><\/pre>\n<\/div>\n<h3>5. Open the application<\/h3>\n<p>Open the application in your browser at <a href=\"http:\/\/localhost:3000\">http:\/\/localhost:3000<\/a>. You also need to allow the browser to access your microphone.<\/p>\n<h2>Implementation Details<\/h2>\n<h3>Speech Recording<\/h3>\n<p>The user will speak into the microphone and the audio will be recorded. The audio will be sent to ElevenLabs API for speech-to-text transcription. Please, remember that the language supported is English.<\/p>\n<p>The code to save the recording file is in the main <code>page.tsx<\/code>. It uses a native media recorder HTML 5 API to record the audio. Below is the snippet code:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-javascript\">const startRecording = useCallback(async () => {\n  try {\n    setError(null);\n    const stream = await navigator.mediaDevices.getUserMedia({ \n      audio: {\n        echoCancellation: true,\n        noiseSuppression: true,\n        sampleRate: 44100\n } \n });\n    \n    \/\/ Try different MIME types based on browser support\n    let mimeType = 'audio\/webm;codecs=opus';\n    if (!MediaRecorder.isTypeSupported(mimeType)) {\n      mimeType = 'audio\/webm';\n      if (!MediaRecorder.isTypeSupported(mimeType)) {\n        mimeType = 'audio\/mp4';\n        if (!MediaRecorder.isTypeSupported(mimeType)) {\n          mimeType = ''; \/\/ Let browser choose\n }\n }\n }\n    \n    const mediaRecorder = new MediaRecorder(stream, {\n      ...(mimeType && { mimeType })\n });\n    \n    mediaRecorderRef.current = mediaRecorder;\n    audioChunksRef.current = [];\n    recordingStartTimeRef.current = Date.now();\n\n    mediaRecorder.ondataavailable = (event) => {\n      if (event.data.size > 0) {\n        audioChunksRef.current.push(event.data);\n }\n };\n\n    mediaRecorder.onstop = async () => {\n      const duration = Date.now() - recordingStartTimeRef.current;\n      const audioBlob = new Blob(audioChunksRef.current, { \n        type: mimeType || 'audio\/webm' \n });\n      const audioUrl = URL.createObjectURL(audioBlob);\n      \n      const recording: AudioRecording = {\n        blob: audioBlob,\n        url: audioUrl,\n        duration,\n        timestamp: new Date()\n };\n      \n      setCurrentRecording(recording);\n      await transcribeAudio(recording);\n      stream.getTracks().forEach(track => track.stop());\n };\n\n    mediaRecorder.start(1000); \/\/ Collect data every second\n    setIsRecording(true);\n } catch (error) {\n    setError('Failed to access microphone. Please check your permissions and try again.');\n }\n}, []);<\/code><\/pre>\n<\/div>\n<p>The audio processing flow is as follows:<\/p>\n<ol>\n<li>User clicks record button \u00e2\u0086\u0092 <code>startRecording()<\/code> is called.<\/li>\n<li>Requests microphone access via <code>getUserMedia()<\/code>.<\/li>\n<li>Creates <code>MediaRecorder<\/code> with optimal settings.<\/li>\n<li>Collects audio data in chunks.<\/li>\n<li>When stopped, create an audio blob and trigger transcription.<\/li>\n<\/ol>\n<p>The audio data will be saved in the <code>public\/uploads\/audio<\/code> folder. Below is the snippet code to save the audio file:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-javascript\">export async function saveAudioToFile(audioBlob: Blob, extension: string = 'webm'): Promise<string> {\n  \/\/ Create uploads directory if it doesn't exist\n  const uploadsDir = join(process.cwd(), 'public', 'uploads', 'audio');\n  await mkdir(uploadsDir, { recursive: true });\n  \n  \/\/ Generate unique filename\n  const filename = `${generateRandomID()}.${extension}`;\n  const filePath = join(uploadsDir, filename);\n  \n  \/\/ Convert blob to buffer and save file\n  const arrayBuffer = await audioBlob.arrayBuffer();\n  const buffer = Buffer.from(arrayBuffer);\n  await writeFile(filePath, buffer);\n  \n  \/\/ Return relative path for storage in database\n  return `\/uploads\/audio\/${filename}`;\n}<\/code><\/pre>\n<\/div>\n<p>The full code for the <code>saveAudioToFile()<\/code> function is in the <code>app\/lib\/audio-storage.ts<\/code> file.<\/p>\n<h3>Speech to Text Transcription<\/h3>\n<p>The transcribed text will be sent to ElevenLabs API for text-to-speech synthesis. The code to send the audio to ElevenLabs API is in the <code>transcribeAudio()<\/code> function. The full code is in the <code>lib\/elevenlabs-client.ts<\/code> file.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-javascript\">\/\/ Main transcription function\nexport async function transcribeAudio(\n  client: ElevenLabsClient, \n  audioBuffer: Buffer, \n  modelId: ElevenLabsModel = ELEVENLABS_MODELS.SCRIBE_V1\n) {\n  try {\n    const result = await client.speechToText.convert({\n      audio: audioBuffer,\n      model_id: modelId,\n }) as TranscriptionResponse;\n\n    return {\n      success: true,\n      text: result.text,\n      language_code: result.language_code,\n      language_probability: result.language_probability,\n      words: result.words || [],\n      additional_formats: result.additional_formats || []\n };\n } catch (error) {\n    console.error('ElevenLabs transcription error:', error);\n    return {\n      success: false,\n      error: error instanceof Error ? error.message : 'Unknown error'\n };\n }\n}<\/code><\/pre>\n<\/div>\n<h3>Transcription Route<\/h3>\n<p>The <code>transcribeAudio()<\/code> function is called when accessing the <code>\/api\/transcribe<\/code> route. This route only accepts the <code>POST<\/code> method and processes the audio file sent in the request body. The <code>ELEVENLABS_API_KEY<\/code> environment variable in the <code>.env<\/code> is used in the route to initialize the ElevenLabs client.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-javascript\">export async function POST(request: NextRequest) {\n  \/\/ Get audio file from form data\n  const formData = await request.formData();\n  const audioFile = formData.get('audio') as File;\n  \n  \/\/ Convert to buffer\n  const arrayBuffer = await audioFile.arrayBuffer();\n  const audioBuffer = Buffer.from(arrayBuffer);\n  \n  \/\/ Initialize ElevenLabs client\n  const elevenlabs = new ElevenLabsClient({ apiKey: apiKey });\n  \n  \/\/ Convert audio to text\n  const result = await elevenlabs.speechToText.convert({\n    file: audioBlob,\n    modelId: \"scribe_v1\",\n    languageCode: \"en\", \n    tagAudioEvents: true,\n    diarize: false,\n });\n  \n  return NextResponse.json({ \n    transcription: result.text,\n    language_code: result.languageCode,\n    language_probability: result.languageProbability,\n    words: result.words\n });\n}<\/code><\/pre>\n<\/div>\n<p>The route will return the following JSON object:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-sh\">{\n  \"transcription\": \"Transcribed text from the audio\",\n  \"language_code\": \"en\",\n  \"language_probability\": 0.99,\n  \"words\": [\n {\n      \"start\": 0.0,\n      \"end\": 1.0,\n      \"word\": \"Transcribed\",\n      \"probability\": 0.99\n },\n    \/\/ ... more words\n ]\n}<\/code><\/pre>\n<\/div>\n<p>The transcribed text serves as the input prompt for image generation using Imagen 4 from fal.ai, which creates high-quality images based on the provided text description.<\/p>\n<h3>Image Generation<\/h3>\n<p>The Fal API endpoint used is <code>fal-ai\/imagen4\/preview<\/code>. You must have a Fal API key to use this endpoint and set the <code>FAL_KEY<\/code> in the <code>.env<\/code> file. Please look into this <a href=\"#imagen-4-api\">section<\/a> on how to get the API key.<\/p>\n<p>The Fal Imagen 4 image generation API is called directly in the <code>\/api\/generate-image<\/code> route. The route will create the image using the <code>subscribe()<\/code> method from the <code>@fal-ai\/client<\/code> SDK package.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-javascript\">export async function POST(request: NextRequest) {\n  const { prompt, style = 'photorealistic' } = await request.json();\n  \n  \/\/ Configure fal client\n  fal.config({\n    credentials: process.env.FAL_KEY || ''\n });\n  \n  \/\/ Generate image using fal.ai Imagen 4\n  const result = await fal.subscribe(\"fal-ai\/imagen4\/preview\", {\n    input: {\n      prompt: prompt,\n      \/\/ Add style to prompt if needed\n      ...(style !== 'photorealistic' && { \n        prompt: `${prompt}, ${style} style` \n })\n },\n    logs: true,\n    onQueueUpdate: (update) => {\n      if (update.status === \"IN_PROGRESS\") {\n        update.logs.map((log) => log.message).forEach(console.log);\n }\n },\n });\n  \n  \/\/ Extract image URLs from the result\n  const images = result.data?.images || [];\n  const imageUrls = images.map((img: any) => img.url || img);\n  \n  return NextResponse.json({ \n    images: imageUrls,\n    prompt: prompt,\n    style: style,\n    requestId: result.requestId\n });\n}<\/code><\/pre>\n<\/div>\n<p>The route will return JSON with the following structure:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-sh\"> {\n  \"images\": [\n    \"https:\/\/v3.fal.media\/files\/panda\/YCl2K_C4yG87sDH_riyJl_output.png\"\n ],\n  \"prompt\": \"Floating red jerry can on the blue sea, wide shot, side view\",\n  \"style\": \"photorealistic\",\n  \"requestId\": \"8a0e13db-5760-48d4-9acd-5c793b14e1ee\"\n}<\/code><\/pre>\n<\/div>\n<p>The image data, along with the prompt and audio file path, will be saved into the GridDB database.<\/p>\n<h2>Database Operation<\/h2>\n<p>We use the GridDB Cloud version for saving the image generation, prompt, and audio file path. It&#8217;s easy to use and accessible using API. The container or database name for this project is <code>genvoiceai<\/code>.<\/p>\n<h3>Save Data to GridDB<\/h3>\n<p>We can save any data to the database therefore we need to define the data schema or structure. We will use the following data schema or structure for this project:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-javascript\">export interface GridDBData {\n  id: string | number;\n  images: Blob;        \/\/ Stored as base64 string\n  prompts: string;     \/\/ Text prompt\n  audioFiles: string;  \/\/ File path to audio file\n}<\/code><\/pre>\n<\/div>\n<p>In real-world applications, best practice is to separate binary files from their references. However, for simplicity in this example, we store the image directly in the database as a base64-encoded string. Before saving to the database, the image needs to be converted to base64 format:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-javascript\">\/\/ Convert image blob to base64 string for GridDB storage\nconst imageBuffer = await imageBlob.arrayBuffer();\nconst imageBase64 = Buffer.from(imageBuffer).toString('base64');<\/code><\/pre>\n<\/div>\n<p>Please look into the <code>lib\/griddb.ts<\/code> file for the implementation details. The <code>insertData()<\/code> function is the actual database insertion.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-javascript\">async function insertData({ data, containerName = 'genvoiceai' }) {\n  const row = [\n    parseInt(data.id.toString()),  \/\/ ID as integer\n    data.images,                   \/\/ Base64 image string\n    data.prompts,                  \/\/ Text prompt\n    data.audioFiles                \/\/ Audio file path\n ];\n  \n  const path = `\/containers\/${containerName}\/rows`;\n  return await makeRequest(path, [row], 'PUT');\n}<\/code><\/pre>\n<\/div>\n<h3>Get Data from GridDB<\/h3>\n<p>To get data from the database, you can use the <code>GET<\/code> request from the route <code>\/api\/save-data<\/code>. This route uses SQL query to get specific or all data from the database.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-javascript\">\/\/ For specific ID\nquery = {\n  type: 'sql-select',\n  stmt: `SELECT * FROM genvoiceai WHERE id = ${parseInt(id)}`\n};\n\n\/\/ For recent entries\nquery = {\n  type: 'sql-select', \n  stmt: `SELECT * FROM genvoiceai ORDER BY id DESC LIMIT ${parseInt(limit)}`\n};<\/code><\/pre>\n<\/div>\n<p>For detailed code implementation, please look into the <code>app\/api\/save-data\/route.ts<\/code> file.<\/p>\n<h2>Server Routes<\/h2>\n<p>This project uses Next.js serverless functions to handle API requests. This means there is no separate backend code to handle APIs, as they are integrated directly into the Next.js application.<\/p>\n<p>The routes used by the frontend are as follows:<\/p>\n<table>\n<thead>\n<tr>\n<th>Route<\/th>\n<th>Method<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>\/api\/generate-image<\/code><\/td>\n<td>POST<\/td>\n<td>Generate images using fal.ai Imagen 4<\/td>\n<\/tr>\n<tr>\n<td><code>\/api\/transcribe<\/code><\/td>\n<td>POST<\/td>\n<td>Convert audio to text using ElevenLabs<\/td>\n<\/tr>\n<tr>\n<td><code>\/api\/save-data<\/code><\/td>\n<td>POST<\/td>\n<td>Save image, prompt, and audio data to GridDB<\/td>\n<\/tr>\n<tr>\n<td><code>\/api\/save-data<\/code><\/td>\n<td>GET<\/td>\n<td>Retrieve saved data from GridDB<\/td>\n<\/tr>\n<tr>\n<td><code>\/api\/audio\/[filename]<\/code><\/td>\n<td>GET<\/td>\n<td>Serve audio files from uploads directory<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>User Interface<\/h2>\n<p><a href=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/11\/app-ui.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/11\/app-ui.png\" alt=\"\" width=\"1796\" height=\"1598\" class=\"aligncenter size-full wp-image-32570\" \/><\/a><\/p>\n<p>The main entry of the frontend is in the <code>page.tsx<\/code> file. The user interface is built with Next.js and its single-page applications with several key sections:<\/p>\n<ul>\n<li>Voice Recording Section: Large microphone button for audio recording.<\/li>\n<li>Transcribed Text Display: Shows the converted speech-to-text with language detection. You can also edit the prompt here before generating the image.<\/li>\n<li>Style Selection: A dropdown menu that allows users to choose different image generation styles, including photorealistic, artistic, anime, and abstract styles.<\/li>\n<li>Generated Images Grid: Displays created images with download\/save options.<\/li>\n<li>Saved Data Viewer: Shows previously saved generations from the database.<\/li>\n<\/ul>\n<p>The saved data will be displayed in the <code>Saved Data Viewer<\/code> section, you can show and hide it by clicking the <strong>Show Saved<\/strong> button on the top right. Each saved entry will include the image, the prompt used to generate it, the audio reference, and the request ID. You can also play the audio and download the image.<\/p>\n<p><a href=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/11\/saved-image.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/11\/saved-image.png\" alt=\"\" width=\"481\" height=\"420\" class=\"aligncenter size-full wp-image-32561\" \/><\/a><\/p>\n<h2>Future Enhancements<\/h2>\n<p>This project is a basic demo and can be further enhanced with additional features, such as:<\/p>\n<ul>\n<li>User authentication and authorization for saved data.<\/li>\n<li>Image editing or customization options.<\/li>\n<li>Integration with other AI models for image generation.<\/li>\n<li>Speech recognition improvements for different languages. Currently, it supports only English.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Project Overview A modern web application that transforms spoken descriptions into high-quality images using cutting-edge AI technologies. Users can record their voice describing an image they want to create, and the system will transcribe their speech and generate a corresponding image. What Problem We Solved Traditional image generation tools require users to type detailed prompts, [&hellip;]<\/p>\n","protected":false},"author":41,"featured_media":52264,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[121],"tags":[],"class_list":["post-52263","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.1.1 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Voice-Based Image Generation Using Imagen 4 and ElevenLabs | GridDB: Open Source Time Series Database for IoT<\/title>\n<meta name=\"description\" content=\"Project Overview A modern web application that transforms spoken descriptions into high-quality images using cutting-edge AI technologies. Users can\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Voice-Based Image Generation Using Imagen 4 and ElevenLabs | GridDB: Open Source Time Series Database for IoT\" \/>\n<meta property=\"og:description\" content=\"Project Overview A modern web application that transforms spoken descriptions into high-quality images using cutting-edge AI technologies. Users can\" \/>\n<meta property=\"og:url\" content=\"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/\" \/>\n<meta property=\"og:site_name\" content=\"GridDB: Open Source Time Series Database for IoT\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/griddbcommunity\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-11-07T08:00:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-03-30T18:23:36+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.griddb.net\/wp-content\/uploads\/2025\/12\/cover-2.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1408\" \/>\n\t<meta property=\"og:image:height\" content=\"768\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"griddb-admin\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@GridDBCommunity\" \/>\n<meta name=\"twitter:site\" content=\"@GridDBCommunity\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"griddb-admin\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/\"},\"author\":{\"name\":\"griddb-admin\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/4fe914ca9576878e82f5e8dd3ba52233\"},\"headline\":\"Voice-Based Image Generation Using Imagen 4 and ElevenLabs\",\"datePublished\":\"2025-11-07T08:00:00+00:00\",\"dateModified\":\"2026-03-30T18:23:36+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/\"},\"wordCount\":1495,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization\"},\"image\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/#primaryimage\"},\"thumbnailUrl\":\"\/wp-content\/uploads\/2025\/12\/cover-2.png\",\"articleSection\":[\"Blog\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/\",\"url\":\"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/\",\"name\":\"Voice-Based Image Generation Using Imagen 4 and ElevenLabs | GridDB: Open Source Time Series Database for IoT\",\"isPartOf\":{\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/#primaryimage\"},\"thumbnailUrl\":\"\/wp-content\/uploads\/2025\/12\/cover-2.png\",\"datePublished\":\"2025-11-07T08:00:00+00:00\",\"dateModified\":\"2026-03-30T18:23:36+00:00\",\"description\":\"Project Overview A modern web application that transforms spoken descriptions into high-quality images using cutting-edge AI technologies. Users can\",\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/#primaryimage\",\"url\":\"\/wp-content\/uploads\/2025\/12\/cover-2.png\",\"contentUrl\":\"\/wp-content\/uploads\/2025\/12\/cover-2.png\",\"width\":1408,\"height\":768},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#website\",\"url\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/\",\"name\":\"GridDB: Open Source Time Series Database for IoT\",\"description\":\"GridDB is an open source time-series database with the performance of NoSQL and convenience of SQL\",\"publisher\":{\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization\",\"name\":\"Fixstars\",\"url\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png\",\"contentUrl\":\"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png\",\"width\":200,\"height\":83,\"caption\":\"Fixstars\"},\"image\":{\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/griddbcommunity\/\",\"https:\/\/x.com\/GridDBCommunity\",\"https:\/\/www.linkedin.com\/company\/griddb-by-toshiba\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/4fe914ca9576878e82f5e8dd3ba52233\",\"name\":\"griddb-admin\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/5bceca1cafc06886a7ba873e2f0a28011a1176c4dea59709f735b63ae30d0342?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/5bceca1cafc06886a7ba873e2f0a28011a1176c4dea59709f735b63ae30d0342?s=96&d=mm&r=g\",\"caption\":\"griddb-admin\"},\"url\":\"https:\/\/www.griddb.net\/en\/author\/griddb-admin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Voice-Based Image Generation Using Imagen 4 and ElevenLabs | GridDB: Open Source Time Series Database for IoT","description":"Project Overview A modern web application that transforms spoken descriptions into high-quality images using cutting-edge AI technologies. Users can","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/","og_locale":"en_US","og_type":"article","og_title":"Voice-Based Image Generation Using Imagen 4 and ElevenLabs | GridDB: Open Source Time Series Database for IoT","og_description":"Project Overview A modern web application that transforms spoken descriptions into high-quality images using cutting-edge AI technologies. Users can","og_url":"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/","og_site_name":"GridDB: Open Source Time Series Database for IoT","article_publisher":"https:\/\/www.facebook.com\/griddbcommunity\/","article_published_time":"2025-11-07T08:00:00+00:00","article_modified_time":"2026-03-30T18:23:36+00:00","og_image":[{"width":1408,"height":768,"url":"https:\/\/www.griddb.net\/wp-content\/uploads\/2025\/12\/cover-2.png","type":"image\/png"}],"author":"griddb-admin","twitter_card":"summary_large_image","twitter_creator":"@GridDBCommunity","twitter_site":"@GridDBCommunity","twitter_misc":{"Written by":"griddb-admin","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/#article","isPartOf":{"@id":"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/"},"author":{"name":"griddb-admin","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/4fe914ca9576878e82f5e8dd3ba52233"},"headline":"Voice-Based Image Generation Using Imagen 4 and ElevenLabs","datePublished":"2025-11-07T08:00:00+00:00","dateModified":"2026-03-30T18:23:36+00:00","mainEntityOfPage":{"@id":"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/"},"wordCount":1495,"commentCount":0,"publisher":{"@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization"},"image":{"@id":"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/#primaryimage"},"thumbnailUrl":"\/wp-content\/uploads\/2025\/12\/cover-2.png","articleSection":["Blog"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/","url":"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/","name":"Voice-Based Image Generation Using Imagen 4 and ElevenLabs | GridDB: Open Source Time Series Database for IoT","isPartOf":{"@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/#primaryimage"},"image":{"@id":"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/#primaryimage"},"thumbnailUrl":"\/wp-content\/uploads\/2025\/12\/cover-2.png","datePublished":"2025-11-07T08:00:00+00:00","dateModified":"2026-03-30T18:23:36+00:00","description":"Project Overview A modern web application that transforms spoken descriptions into high-quality images using cutting-edge AI technologies. Users can","inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/griddb.net\/en\/blog\/voice-based-image-generation-using-imagen-4-and-elevenlabs\/#primaryimage","url":"\/wp-content\/uploads\/2025\/12\/cover-2.png","contentUrl":"\/wp-content\/uploads\/2025\/12\/cover-2.png","width":1408,"height":768},{"@type":"WebSite","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#website","url":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/","name":"GridDB: Open Source Time Series Database for IoT","description":"GridDB is an open source time-series database with the performance of NoSQL and convenience of SQL","publisher":{"@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization","name":"Fixstars","url":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/logo\/image\/","url":"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png","contentUrl":"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png","width":200,"height":83,"caption":"Fixstars"},"image":{"@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/griddbcommunity\/","https:\/\/x.com\/GridDBCommunity","https:\/\/www.linkedin.com\/company\/griddb-by-toshiba"]},{"@type":"Person","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/4fe914ca9576878e82f5e8dd3ba52233","name":"griddb-admin","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/5bceca1cafc06886a7ba873e2f0a28011a1176c4dea59709f735b63ae30d0342?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/5bceca1cafc06886a7ba873e2f0a28011a1176c4dea59709f735b63ae30d0342?s=96&d=mm&r=g","caption":"griddb-admin"},"url":"https:\/\/www.griddb.net\/en\/author\/griddb-admin\/"}]}},"_links":{"self":[{"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/posts\/52263","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/users\/41"}],"replies":[{"embeddable":true,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/comments?post=52263"}],"version-history":[{"count":3,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/posts\/52263\/revisions"}],"predecessor-version":[{"id":55096,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/posts\/52263\/revisions\/55096"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/media\/52264"}],"wp:attachment":[{"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/media?parent=52263"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/categories?post=52263"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.griddb.net\/en\/wp-json\/wp\/v2\/tags?post=52263"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}