I use Speech Note for STT/TTS and it works great. You can choose between different models, I use whisper (more accurate) or Vosk (faster). You don’t need a GPU, but it will speed things up greatly
Linux
From Wikipedia, the free encyclopedia
Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).
Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.
Rules
- Posts must be relevant to operating systems running the Linux kernel. GNU/Linux or otherwise.
- No misinformation
- No NSFW content
- No hate speech, bigotry, etc
Related Communities
Community icon by Alpár-Etele Méder, licensed under CC BY 3.0
I was able to quickly set up and use whisper (base) using Speech Note without issue and it saved me over 80% of what I would have had to manually do. Thank you for the recommendation.
I’ve had good experiences with whisper.cpp (should be in the AUR). I used the large model on my GPU (3060), and it filled 11.5 out of the 12GB of vram, so you might have to settle for a lower tier model. The speed was pretty much real time on my GPU, so it might be quite a bit slower on your CPU, unless the lower tier models are also a lot faster (never tested them due to lack of necessity).
The large model had pretty much perfect accuracy (only 5 or so mistakes in ~40 pages of transcriptions), and that was with Dutch audio recorded on a smartphone. If it can handle my pretty horrible conditions, your audio should (hopefully) be no problem to transcribe.
I used the base model and it ran at a very acceptable speed with CPU only. Decent accuracy considering the recording was mediocre quality at best. Thank you for the suggestion.
Depends on what the audio is. What's the crisis?
Generally, you can use CPU for anything based on pytorch, it will just take substantially longer.
Transcription of numerous voice mails and phone calls for a legal matter. Would like to supply transcripts with the audio files so we don't have to pay as much time for the lawyer's paralegals to review and decide what is actually going to be useful.
Start with Whisper as someone else mentioned. DeepSpeech by Mozilla is another simple one.
Both are similar in performance and accuracy for normal spoken conversation with no extra auditory noise.
Whisper worked for me. I'll have to go back through and tag speakers and fox a few spots but you guys have saved me 80-90% of the work. Thank you.
Search GitHub for voice dictation