From RAG to Voice: Building a Shopify Assistant for Indian Customers

Why This Matters for D2C and Shopify Sellers

If you talk to any D2C founder, one problem comes up again and again: customers add items to cart, then they disappear.

Most Shopify sellers try to fix this using:

Automated emails
WhatsApp reminders
Discount nudges
Chat widgets

All of these approaches assume the customer is willing to read, scroll, and process information, usually in English.

But, in India, that assumption often breaks.

India is a voice-native market. People are comfortable asking questions verbally, especially when shopping:

“Is this kurta cotton?”
“What sizes are available?”

When answers are buried inside long descriptions written in English, customers don’t always abandon because they dislike the product. Often, they leave because finding the right information takes effort.

This project started from a simple observation:

If customers can speak their questions, why should stores only reply in text?

What I Set Out to Build

The goal was not to build a fancy chatbot.

The goal was to build a store-aware voice assistant that:

Understands Shopify product and policy data
Answers questions accurately using retrieval (not guessing)
Speaks answers out loud
Supports natural follow-up questions

At the end, the assistant behaves like a helpful store staff member, not an FAQ page.

High-Level Flow of the System

At a conceptual level, the system works like this:

The customer speaks a question
Speech is converted to text (STT)
Question is answered using RAG over Shopify data
Answer is converted back to speech (TTS)
Assistant waits for the next question

This loop continues until the customer says “exit” or “stop”.

The Foundation: RAG Over Shopify Data

The engagement depends on the quality of the responses. And we need to remember that voice only amplifies wrong answers—it does not fix them.

Data the Assistant Knows

The assistant is built using real Shopify store information:

Product names
Prices
Available sizes
Fabric and material details
Shipping timelines
Return and refund policies

This data is fetched from Shopify, cleaned to remove HTML, and converted into plain text.

Chunking and Preparation

Raw product pages are long and messy. Reading them aloud would be useless.

So the content is:

Broken into small, meaningful chunks
Grouped logically (product-level, policy-level)
Stored in JSON

Chunking ensures:

Product questions don’t retrieve policy text
Policy questions don’t retrieve product descriptions
Answers stay relevant and short

def chunk_text(text, size=CHUNK_SIZE):
    chunks = []
    start = 0
    while start < len(text):
        chunks.append(text[start:start + size])
        start += size
    return chunks

with open(os.path.join(SCRIPT_DIR, "shopify_data.json"), "r", encoding="utf-8") as f:
    data = json.load(f)

chunks = []

for item in data:
    text_chunks = chunk_text(item["content"])
    for chunk in text_chunks:
        chunks.append({
            "title": item["title"],
            "type": item["type"],
            "content": chunk
        })

with open(os.path.join(SCRIPT_DIR, "rag_chunks.json"), "w", encoding="utf-8") as f:
    json.dump(chunks, f, indent=2)

print("Chunking complete. Total chunks:", len(chunks))

‍

Chunking the data

‍

Embeddings and Vector Search

Each chunk is converted into a vector embedding using a sentence-transformer model.

These embeddings are stored in a FAISS index, which allows semantic search.

When a customer asks:

“What sizes do you have in cotton kurta?”

The system retrieves:

The cotton kurta chunk
Not unrelated products
Not shipping policies

This is the core of the RAG system.

Handling Real Customer Questions (Not Textbook English)

Real customers don’t speak like documentation.

They say things like:

“cotton kurta size?”
“fabric of kurta”

To handle this, the assistant uses:

Keyword-based intent detection (price, size, fabric)
Product name extraction from the spoken text
Attribute-level answer extraction

Instead of reading the entire product description aloud, it extracts only what is relevant.

Example:

Question: “What is the fabric of kurta?”
Answer: “The cotton kurta is made from 100% pure cotton.”

This keeps voice responses short, focused, and natural.

Speech-to-Text (STT): How Voice Becomes Text

The Speech-to-Text module converts user voice input into text using Hugging Face transformer-based speech recognition models.

model = WhisperModel("base", compute_type="int8")
def listen(fs=16000, silence_threshold=0.015, silence_duration=1.5, max_duration=12):
    """
    Record audio dynamically until silence is detected after speech starts.
    - silence_threshold: RMS amplitude below which audio is considered silent
    - silence_duration:  seconds of silence needed to stop recording
    - max_duration:      hard cap on recording length in seconds
    """
    print("Listening...")

    chunk_size = int(fs * 0.1)          # process in 100ms chunks
    max_chunks = int(max_duration / 0.1)
    silence_chunks_needed = int(silence_duration / 0.1)
    audio_chunks = []
    silent_count = 0
    speech_started = False

    with sd.InputStream(samplerate=fs, channels=1, dtype='float32') as stream:
        for _ in range(max_chunks):
            chunk, _ = stream.read(chunk_size)
            audio_chunks.append(chunk.copy())

            rms = np.sqrt(np.mean(chunk ** 2))

            if rms > silence_threshold:
                speech_started = True
                silent_count = 0
            elif speech_started:
                silent_count += 1
                if silent_count >= silence_chunks_needed:
                    break

    if not audio_chunks or not speech_started:
        return ""

    recording = np.concatenate(audio_chunks, axis=0)
    # Convert float32 [-1, 1] → int16 for a standard WAV that Whisper handles reliably
    recording_int16 = (recording * 32767).astype(np.int16)

‍

How It Works

The user speaks into the microphone.
The audio input is captured and processed.
A pretrained Hugging Face ASR (Automatic Speech Recognition) model transcribes the speech into text.
The transcribed text is forwarded to the RAG pipeline for query processing.

Technologies Used

faster-whisper (WhisperModel – base, int8 optimized)
sounddevice (microphone input streaming)
numpy (audio signal processing & RMS calculation)
tempfile (temporary audio storage)

File

scripts/hf_stt.py

Purpose

This module enables natural voice interaction with the AI assistant, eliminating the need for manual typing.

Text-to-Speech (TTS): Turning Answers into Voice

The Text-to-Speech module converts the AI-generated text response back into natural-sounding speech.

pygame.mixer.pre_init(frequency=44100, size=-16, channels=2, buffer=512)
pygame.mixer.init()

# Sweet, natural neural voice — options you can swap:
#   "en-US-JennyNeural"  — warm and friendly (default)
#   "en-US-AriaNeural"   — calm and clear
#   "en-US-SaraNeural"   — bright and polite
VOICE = "en-US-JennyNeural"
async def _generate(text: str, path: str):
    communicate = edge_tts.Communicate(text, voice=VOICE, rate="-5%", pitch="+0Hz")
    await communicate.save(path)

‍

How It Works

The RAG module generates a text response.
The text is passed to a Hugging Face TTS model.
The model synthesizes human-like speech.
The audio is played back to the user.

Technologies Used

edge-tts (Neural voice synthesis – Microsoft Edge voices)
asyncio (asynchronous speech generation)
pygame (audio playback)
os (file cleanup)

File

scripts/hf_tts.py

Purpose

This module provides a fully voice-enabled AI assistant experience by converting responses into speech.

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))

# Load chunked data
with open(os.path.join(SCRIPT_DIR, "rag_chunks.json"), "r", encoding="utf-8") as f:
    data = json.load(f)

texts = [item["content"] for item in data]

# Load FAISS
index = faiss.read_index(os.path.join(SCRIPT_DIR, "faiss.index"))

# Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")
welcome = "Shopify Voice Assistant started. Ask your question."
print(welcome)
speak(welcome)

‍

Why Voice Matters Specifically in India

Voice changes how customers interact with a store:

It reduces effort
It feels conversational
It works well even when grammar isn’t perfect
It builds trust faster than long text

For Indian Shopify sellers, this can improve:

Product clarity
Customer confidence
Overall shopping experience

Sometimes, a clear spoken answer is all a customer needs to move forward.

Architecture Summary

‍Data Layer

Shopify products and policies

RAG Layer

Chunking
Embeddings
FAISS vector search

Logic Layer

Intent detection (price, size, fabric)
Product name extraction
Attribute-based answers

Voice Layer

Each layer is independent, making the system:

Easier to debug
Easier to extend
Practical for real use

Shopify voice assistant

Possible Extensions Beyond This Project

Although this project focuses on voice-based support, the same architecture can be extended to:

Cart-drop voice follow-ups
WhatsApp voice notes
Multilingual STT/TTS engines
Human handoff for complex cases

Once the store knowledge is solid, voice becomes just another interface.

At Superteams, we help you build voice agents customized to your business needs. To learn more, speak to us.

Project Link

You can explore the project here:

**🔗 Project Repository:
**https://github.com/AbhinayaPinreddy/Shopify-Voice-AI-Assistant