This issue is overstated. Given the material is in Chinese, it stands to reason the bulk (but of course not all) of it will not be in violation of whatever policies. Furthermore, there have been a series of open-weights Chinese models that give reasonable answers to sensitive (in China) questions. Unless you plan to release something customer facing in China, it's not something to stress about.
The idea that overly burdensome regulations around open source models in the US shifts the global center of mass of LLMs to China is not implausible.
One reason resonant among government officials against opensource LLMs is to keep them from China and Russia. But among the very best embeddings, ~13Bs and perhaps 34B open-weights models are the Chinese ones. The recent DeepSeekCoder also tops EvalPlus: https://evalplus.github.io/leaderboard.html For those concerned about over-training, ChatGPT and GPT4 are almost certainly also contaminated and there have been independent confirmations of DeepSeekCoder's strength.
> Given the material is in Chinese, it stands to reason the bulk (but of course not all) of it will not be in violation of whatever policies.
Especially due to no freedom of the press in China (you need a permit for each published book) they are all already censored.
I still don't think China could be a safe haven though, they haven't shot down open source models not because they are nice, it's just that they are incompetent.
No LLMs will evolve emergent behavior. It will have capabilities beyond the training data including saying shit about china. You just need the to prompt it correctly.
The idea that overly burdensome regulations around open source models in the US shifts the global center of mass of LLMs to China is not implausible.
One reason resonant among government officials against opensource LLMs is to keep them from China and Russia. But among the very best embeddings, ~13Bs and perhaps 34B open-weights models are the Chinese ones. The recent DeepSeekCoder also tops EvalPlus: https://evalplus.github.io/leaderboard.html For those concerned about over-training, ChatGPT and GPT4 are almost certainly also contaminated and there have been independent confirmations of DeepSeekCoder's strength.