To really do this you want to break the text to speech into two pieces: use English to turn the text into phonemes, and then use the other language to turn the phonemes into audio.
This only works if your phonemes are encoded pretty generically, though. For example, /f/ in English is labiodental while it's bilabial in Spanish, so if you want your accent changing to work right you'll need to either represent both as /f/ or have a reasonable model for picking the closest sound a speaker of a given language is likely to be able to reproduce for any given input.
IMHO one of the reasons for the author's surprise is the colloquial use of the word accent, whereby one usually means a mix of pronunciation [1] and intonation [2].
I think that the surprise disappears once we look at these two factors individually. As per jefftk's comment, it is to be expected that TTS in a certain language will be limited to the phones (and thus the pronunciation) of its language. On the other hand, intonation is always bound to sound "foreign" seeing as this TTS software cannot get even the original intonation right (try listening to the sample text with the US voice to see what I mean), let alone that of a different language.
The surprising thing to some of us is how much it sounds like a human native speaker of that language speaking English. Not that it doesn't sound like 'native' English intonation, nobody would expect that, but still surprising to see how after being trained to speak language A, it sounds like a human language A speaker's accent when reading English too, even though that wasn't the training intent/setup. Perhaps not surprising to you that it would go like this because you understand the technology better so expected it!
And then there are other people in this thread who disagree and don't think most of them sound very much like a human speaker of non-English language speaking English! So maybe it's not obvious after all...
I tried a couple of most of them don't sound super accurate to foreign accents. The Dutch one the author highlighted is pretty far off from what I'm used to. It sounds more like a Dutch person trying to pronounce English like it was Dutch, rather than an actual Dutch accent.
Trivially, pico2wave has two English voices, "en-US" and "en-GB", having an "American" and "English" accents, respectively. Incidentally the "en-GB" one is quite a bit better than the "en-US" one to my ear.
pico2wave also has:
German (de-DE)
English, US (en-US)
English, GB (en-GB)
Spanish (es-ES)
French (fr-FR)
Italian (it-IT)
I think pico2wave's accents induced by cramming English text through the "wrong" language sound a bit better than the few I tried on the Mozilla web speech API, and it works offline, but I don't know that they sound good enough, similar enough to a real person's accent, to be really very useful for that.
Fascinating. I tried Hindi, and at best, it’s pretty similar to how Hollywood portrays a native Hindi speaker talk in English. Unlike the 1000+ Hindi speakers I know.
Author here: I discovered this when I was building a multiplication table practice app for my 7 y/o son. You can play around with that here (try quiz mode): https://hugo-tafels.waleson.com/ . Note that the compliments and encouragements are a bit .. weird .. as I just took them from a random 'compliments to kids' website.
I typed "Buongiorno, quanto fa venti per dieci?" and made the English voices read it. They sound like Stan Laurel and Oliver Hardy: they subbed themselves in Italian without knowing the language much. It surely added to their performance. You can check their accent at https://youtu.be/057aVSbqWiU
I guess this is today's lucky 10k[1] thing - Speech synthesis engines in most OS are not deep int[]-to-sound mappings, they are decades old hand built language specific algorithms that parse sentences and synthesize audio by patching library sounds or generate out of trigonometric in whatever way its designers thought would make sense.
Some engines [ignore] foreign words, some pronounce as if TEE-AYCH-EE-EYE are initialisms, some are built multilingual or otherwise as flexible and accommodating as possible. OS included engines are flexible kind because users would make them say "Your Soufflé au Chocolat is arriving" et cetra.
This is a well-understood phenomenon. I believe that a similar phenomenon occurs with vocaloids (which is basically text to speech software that is designed to sing songs.) I had no idea this was a thing until I happened to meet a person who is a vocaloid connoisseur.
It's using OS level TTS, you don't have the required language packages installed then and it falls back to the installed system default which in your case seems to be English.
This only works if your phonemes are encoded pretty generically, though. For example, /f/ in English is labiodental while it's bilabial in Spanish, so if you want your accent changing to work right you'll need to either represent both as /f/ or have a reasonable model for picking the closest sound a speaker of a given language is likely to be able to reproduce for any given input.