Earlier than we dive into the pipeline, you may want to check out all the code on my Github web page, as I will probably be referring to some sections of it.
The determine under explains the workflow of AI-powered digital language tutor that’s designed to arrange a real-time, voice-based conversational studying expertise:
- The consumer begins the dialog by initiating a recording of their speech, briefly saving it as a .wav file. That is completed by urgent and holding the spacebar, and the recording is stopped when the spacebar is launched. The sections of the Python code that allow this press-and-talk performance are defined under.
The next world variables are used to handle the state of the recording course of:
recording = False # Signifies whether or not the system is at present recording audio
done_recording = False # Signifies that the consumer has accomplished recording a voice command
stop_recording = False # Signifies that the consumer needs to exit the dialog
The listen_for_keys
perform is for checking key presses and releases. It units the worldwide variables based mostly on the state of the spacebar and esc button.
def listen_for_keys():
# Operate to pay attention for key presses to regulate recording
world recording, done_recording, stop_recording
whereas True:
if keyboard.is_pressed('house'): # Begin recording on spacebar press
stop_recording = False
recording = True
done_recording = False
elif keyboard.is_pressed('esc'): # Cease recording on 'esc' press
stop_recording = True
break
elif recording: # Cease recording on spacebar launch
recording = False
done_recording = True
break
time.sleep(0.01)
The callback
perform is used to deal with the audio knowledge when recording. It checks the recording
flag to find out whether or not to file the incoming audio knowledge.
def callback(indata, frames, time, standing):
# Operate referred to as for every audio block throughout recording.
if recording:
if standing:
print(standing, file=sys.stderr)
q.put(indata.copy())
The press2record
perform is the primary perform which is answerable for dealing with voice recording when the consumer presses and holds the spacebar.
It initialises world variables to handle the recording state and determines the pattern fee, and it creates a brief file to retailer the recorded audio.
The perform then opens a SoundFile object to write down the audio knowledge and an InputStream object to seize the audio from the microphone, utilizing the beforehand talked about callback
perform. A thread is began to pay attention for key presses, particularly the spacebar for recording and the ‘esc’ key to cease. Inside a loop, the perform checks the recording flag and writes the audio knowledge to the file if recording is energetic. If the recording is stopped, the perform returns -1; in any other case, it returns the filename of the recorded audio.
def press2record(filename, subtype, channels, samplerate):
# Operate to deal with recording when a secret's pressed
world recording, done_recording, stop_recording
stop_recording = False
recording = False
done_recording = False
attempt:
# Decide the samplerate if not offered
if samplerate is None:
device_info = sd.query_devices(None, 'enter')
samplerate = int(device_info['default_samplerate'])
print(int(device_info['default_samplerate']))
# Create a brief filename if not offered
if filename is None:
filename = tempfile.mktemp(prefix='captured_audio',
suffix='.wav', dir='')
# Open the sound file for writing
with sf.SoundFile(filename, mode='x', samplerate=samplerate,
channels=channels, subtype=subtype) as file:
with sd.InputStream(samplerate=samplerate, machine=None,
channels=channels, callback=callback, blocksize=4096) as stream:
print('press Spacebar to begin recording, launch to cease, or press Esc to exit')
listener_thread = threading.Thread(goal=listen_for_keys) # Begin the listener on a separate thread
listener_thread.begin()
# Write the recorded audio to the file
whereas not done_recording and never stop_recording:
whereas recording and never q.empty():
file.write(q.get())
# Return -1 if recording is stopped
if stop_recording:
return -1besides KeyboardInterrupt:
print('Interrupted by consumer')
return filename
Lastly, the get_voice_command
perform calls press2record
to file consumer’s voice command.
def get_voice_command():
# ...
saved_file = press2record(filename="input_to_gpt.wav", subtype = args.subtype, channels = args.channels, samplerate = args.samplerate)
# ...
- Having captured and saved the voice command in a brief .wav file, we now enter the transcription part. On this stage, the recorded audio is transformed into textual content utilizing Whisper. The corresponding script for merely working transcription job for a .wav file is given under:
def get_voice_command():
# ...
outcome = audio_model.transcribe(saved_file, fp16=torch.cuda.is_available())
# ...
This technique takes two parameters: the trail to the recorded audio file, saved_file
, and an elective flag to make use of FP16 precision if CUDA is out there to enhances efficiency on appropriate {hardware}. It merely returns the transcribed textual content.
- Then, the transcribed textual content is shipped to ChatGPT to generate an acceptable response within the
interact_with_tutor()
perform. The corresponding code section is as follows:
def interact_with_tutor():
# Outline the system function to set the conduct of the chat assistant
messages = [
{"role": "system", "content" : "Du bist Anna, meine deutsche Lernpartnerin.
Du wirst mit mir chatten. Ihre Antworten werden kurz sein.
Mein Niveau ist B1, stell deine Satzkomplexität auf mein Niveau ein.
Versuche immer, mich zum Reden zu bringen, indem du Fragen stellst, und vertiefe den Chat immer."}
]
whereas True:
# Get the consumer's voice command
command = get_voice_command()
if command == -1:
# Save the chat logs and exit if recording is stopped
save_response_to_pkl(messages)
return "Chat has been stopped."# Add the consumer's command to the message historical past
messages.append({"function": "consumer", "content material": command})
# Generate a response from the chat assistant
completion = openai.ChatCompletion.create(
mannequin="gpt-3.5-turbo",
messages=messages
)
# Extract the response from the completion
chat_response = completion.selections[0].message.content material # Extract the response from the completion
print(f'ChatGPT: {chat_response} n') # Print the assistant's response
messages.append({"function": "assistant", "content material": chat_response}) # Add the assistant's response to the message historical past
# ...
The perform interact_with_tutor
begins by defining the system function of ChatGPT to form its behaviour all through the dialog. Since my objective is to observe German, I set the system function accordingly. I referred to as my digital tutor as “Anna” and set my language proficiency stage for her to regulate her responses. Moreover, I instructed her to maintain the dialog participating by asking questions.
Subsequent, the consumer’s transcribed voice command is appended to the message checklist with the function of “consumer.” This message is then despatched to ChatGPT. Because the dialog continues inside some time loop, all the historical past of consumer instructions and GPT responses is logged within the messages checklist.
- After the every response of ChatGPT, we convert the textual content message into speech utilizing gTTS.
def interact_with_tutor():
# ...
# Convert the textual content response to speech
speech_object = gTTS(textual content=messages[-1]['content'],tld="de", lang=language, sluggish=False)
speech_object.save("GPT_response.wav")
current_dir = os.getcwd()
audio_file = "GPT_response.wav"
# Play the audio response
play_wav_once(audio_file, args.samplerate, 1.0)
os.take away(audio_file) # Take away the non permanent audio file
The gTTS()
perform will get 4 parameters : textual content
, tld
, lang
, and sluggish
. The textual content
parameter is being assigned the content material of the final message within the messages
checklist (indicated by [-1]
) which you wish to convert into speech. The tld
parameter specifies the top-level area for the Google Translate service. Setting it to "de"
implies that the German area is used, which will be important for guaranteeing that the pronunciation and intonation are acceptable for the German. The lang
parameter specifies the language through which the textual content ought to be spoken. On this code, the language
variable is ready to 'de'
, that means that the textual content will probably be spoken in German.sluggish=False
: the sluggish
parameter controls the velocity of the speech. Setting it to False
implies that the speech will probably be spoken at a standard velocity. If it had been set to True
, the speech can be spoken extra slowly.
- The transformed speech of ChatGPT response is then saved as a brief .wav file, performed again to the consumer, after which eliminated.
- The
interact_with_tutor
perform repeatedly runs when consumer continues the dialog by urgent the spacebar once more. - If the consumer presses “esc”, dialog ends and all the dialog is saved to a pickle file,
chat_log.pkl
. You should use it later for some evaluation.