well, first you need to know what image regions feed to ANN, and that can involve some segmentation and pre-recognition, otherwise you're going to evaluate the net at all feasible subwindows — and that's a LOT of matrix math for you. Very big GPU can help, but they have latency in themselves, and FPGA at such performance levels are inordinately expensive.
Done at scale though ASICs seem to be the sure-to-work way.
I'd be very surprised if a modern cpu couldn't handle the task, especially if you were clever about detecting regions of interest, predicting head movement and cache maintenance. But I'd also be surprised if they go to market with an x86 under the hood.
I remember reading a while ago about how smart tvs were using ANNs for upscaling, so it has been done at scale. rimshot
(1) TVs don't have strict latency requirement. I've hard latencies of 100 ms are common.
(2) Upscaling ANNs process rather small image neighborhood radius, and required processing power is on the order of O(r² * log r), and if a minimally recognizable cat is 50x50 px and for upscale you use a very large window of 16x16, that's 14 times already.
Latencies of 100 ms may be common because TVs don't have strict latency requirements.
16x16 is a very small window, I have no idea what they're using for TVs, but 128 isn't uncommon in post production ANN upscaling. Also consider the fact that ANNs have not received anywhere close to the level of attention in optimization that compilers have, so there is also a lot of potential slack to be taken up if real time processing demands it.