tbran 10 months ago

To run text-to-speech on my laptop, I've been using Justine Tunney's downloadable single executable Whisper file.

I use it transcribe audio then copy into an LLM to get notes on whatever it is. Helps me decide to watch or listen to something and saves a bunch of time.

Her tweet: https://x.com/JustineTunney/status/1825551821857010143

Instructions from Simon Willison: https://simonwillison.net/2024/Aug/19/whisperfile/

Command line options: https://github.com/Mozilla-Ocho/llamafile/issues/544#issueco...

jwr 10 months ago

Amazing work.

I am also impressed by the advances in technology. 20 years ago, I had severe RSI problems and worked on "vx-mode", a package for interfacing XEmacs to Dragon NaturallySpeaking, the best speech-recognition solution available at the time. My goals were similar, although the result was nowhere near what the OP has done. Also, speech recognition tech was nowhere near what we have now: I still remember buying good microphones, worrying about microphone placement relative to mouth, endless training and re-training…

This kind of software can make a huge difference for many people.

  • Jeff_Brown 10 months ago

    I'm really happy about it but I'm not sure how game changing it would be for a blind person. It seems to require seeing what's on the page.

    • jwr 10 months ago

      Perhaps not for a blind person, but for anyone with RSI or other hand/wrist impairments, this can make a huge difference. I speak from experience, having used dictation to work around RSI issues.

submeta 10 months ago

Year 2080: AGIs help you trinscribe, structure, layout your code/text/thoughts. At the same time: HN posts: „New package for Emacs doing xyz“.

  • raverbashing 10 months ago

    And all it requires is some emacs version bump, some dependency upgrades, some external servers and changing the default shortcut in a confusing lisp file to something that doesn't require pressing 8 keys at the same time

    • kleiba 10 months ago

      Fun fact: even pressing three keys at the same time is rare when using Emacs (although there are some three-key combos I use regularly), most shortcuts consist of consecutive key presses.

      • fhd2 10 months ago

        I sometimes feel like playing the piano :D But the UX is better than you'd think, there's packages that show you what options you have for what key to press next, and the sequences are generally quite logical (e.g. CTRL-x followed by "p" has all the commands related to projects).

        Plus you can always just enter the command instead of using the key stroke for it. Again, the default UX for that is a bit weak, but with a few packages it becomes pretty strong.

        • ashton314 10 months ago

          > there's packages that show you what options you have for what key to press next

          Rejoice! The excellent which-key package that does this comes bundled with Emacs 30! (Emacs 30 will probably be released soon.)

          > enter command… default UX is a bit weak

          Agreed: the packages Helm, Ivy, and Vertico make this interface much nicer. I use Vertico [1] personally. Though, from Emacs 29, there are some really nice options you can set. I used the following in my Bedrock starter kit [2] to get nicer tab-completion: as soon as you hit TAB twice you'll get bumped into the Completion buffer to select something with your cursor.

          Here's the relevant config:

              (setopt completion-auto-help 'always)                  ; Open completion always; `lazy' another option
              (setopt completions-max-height 20)                     ; This is arbitrary
              (setopt completions-detailed t)
              (setopt completions-format 'one-column)
              (setopt completions-group t)
              (setopt completion-auto-select 'second-tab)            ; Much more eager
              ;(setopt completion-auto-select t)                     ; See `C-h v completion-auto-select' for more possible values
          
          There's more configuration options, of course, but this is helpful:

          [1]: https://github.com/minad/vertico [2]: https://codeberg.org/ashton314/emacs-bedrock

          • spauldo 10 months ago

            which-key made it in? Sweet! I've been saying for years it should be in Emacs and turned on by default.

        • kleiba 10 months ago

          True. I often times find myself typing out the command rather than using some obscure key sequence like C-c C-v n (case in point: https://orgmode.org/manual/Key-bindings-and-Useful-Functions...). Since Emacs does tab completion for the command name too, I personally find that a better UX than using the "shortcut" (if I can remember it at all).

          • pxc 10 months ago

            I tend to use search for infrequently used stuff and stuff I'm just trying to learn for the first time, then if I find myself using it several times in a session I look up the keybind to start practicing that. If it sticks, it sticks, and if it doesn't... the search functionality is great!

        • eptcyka 10 months ago

          > the sequences are generally quite logical (e.g. CTRL-x followed by "p" has all the commands related to projects).

          They really are not.

      • argiopetech 10 months ago

        Depends on if you count shift. I C-M-% (query-regexp-replace) fairly regularly, and that's 4.

        • kleiba 10 months ago

          Sure, shift counts. I suppose I would bind it to a more convenient keybinding if I used query-regexp-replace regularly, but note that I didn't say there weren't any such keybindings, just that they're rare.

        • b5n 10 months ago

          I assume this varies widely across setups.

              (use-package visual-regexp
                :defer t
                :bind (("C-c r" . vr/replace)
                       ("C-c q" . vr/query-replace)
                       ("C-r" . vr/isearch-backward)
                       ("C-s" . vr/isearch-forward)))
          
              (use-package visual-regexp-steroids
                :defer t)
    • wiz21c 10 months ago

      year 2080: "M-x ai: imagine you are a smart emacs developper, write a configuration file that sets up LSP"

      answer:

      "I did it. Please note that you're using a Microsoft protocol. Microsoft has a long history of attacking the 4 core freedoms of the Free Software movement which are

      The freedom to run the program as you wish, for any purpose (freedom 0). ..."

      • pxc 10 months ago

        This is kinda ideal tbh. I like how, for instance, F-Droid warns users about anti-features and integrations with proprietary web services. Clear messaging about problematic software + freedom to nonetheless choose those problematic options is great.

        That said, I don't think this is the way the FSF evaluates software, or that they'd treat an open protocol like this. I could imagine a warning like this about integrating with a proprietary language server in particular, though— and I'd be grateful for it! A locally-run AI assistant that cared about things like that would be super cool.

      • anthk 10 months ago

        That AI would be running under GNU Hurd with Guix. Also, Scheme simplified itself so hard that it created something akin to the Common Lisp standard unitfying all ice's and srfi's into something manageable from humans in a single package.

        Also it rewrote all of the legacy Emacs' Elisp into manageable Emacs Guile (with an uberfast JIT and/or libre Guile microcode from the FSF).

lepisma 10 months ago

Hey, author here. Didn't notice this came up on HN.

I wrote a small follow up trying to write and speak at the same time here https://lepisma.xyz/journal/2024/09/13/can-i-output-two-stre...

  • pama 10 months ago

    Thats a cool idea. Could the LLM find the right location for the audio stream by simply having the context of the buffer, and the location of the text and audio cursor when the intersction starts?

    • lepisma 10 months ago

      I think it could work. In my example of writing docstring, I can see this working out with high probability.

voltaireodactyl 10 months ago

This looks very useful, and beautifully presented — looking forward to being able to use with local model.

Jeff_Brown 10 months ago

I would use this for edits that are hard to do otherwise. Like, instead of typing `M-x align-regexp` and then figuring out what regular expression to type, I would just highlight a passage and say to the LLM "Can you align all the library names in this import statement?"

BeetleB 10 months ago

I did something similar here:

https://blog.nawaz.org/posts/2023/Dec/cleaning-up-speech-rec...

I now use Whisper with a much expanded prompt and have the flow integrated both in Emacs and my WM.

Prior HN discussion:

https://news.ycombinator.com/item?id=40174921

I've since done hours of transcription with it - often transcribing whole emails. The challenge is that my brain thinks very differently while talking compared to while typing. As a result, my output is very verbose, and is very different from what I would have typed. I haven't figured out how to speak as if I'm typing.

ggm 10 months ago

"Emacs: Upgrade to MELPA"

ELPA installed s/w suite: "I'm sorry Dave, I can't do that"

  • anthk 10 months ago

    More like: Emacs: pull all the libre MELPA repos into a local .el file to be checked ondemand. Hide all the propietary depending or propietary repos.

namidark 10 months ago

Has anyone gotten whisper.el/.cpp to work on OSX with the microphone permissions and Emacs?