Author, here. The paper is about the Collatz sequence, how experiments with a transformer can point at interesting facts about a complex mathematical phenomenon, and how, in supervised math transformers, model predictions and errors can be explained (this part is a follow-up to a similar paper about GCD). From a ML research perspective, the interesting (but surprising) take away is the particular way the long Collatz function is learned: "one loop at a time".
To me, the base conversion is a side quest. We just wanted to rule out this explanation for the model behavior. It may be worth further investigation, but it won't be by us. Another (less important) reason is paper length, if you want to submit to peer reviewed outlets, you need to keep pages under a certain number.
1) Why did you not test the standard Collatz sequence? I would think that including that, as well as testing on Z+, Z+\2Z, and 2Z+, would be a bit more informative (in addition to what you've already done). Even though there's the trivial step it could inform how much memorization the network is doing. You do notice the model learns some shortcuts so I think these could help confirm that and diagnose some of the issues.
2) Is there a specific reason for the cross attention?
Regardless, I think it is an interesting paper (these wouldn't be criteria for rejection were I reviewing your paper btw lol. I'm just curious about your thoughts here and trying to understand better)
FWIW I think the side quest is actually pretty informative here, though I agree it isn't the main point.
It might be a side quest, or it could be an elegant way to frame a category of problems that are resistant to the ways in which transformers can learn; in turn, by solving that structural deficiency in order to enable a model to effectively learn that category of problems, you might empower a new leap in capabilities and power.
We're a handful of breakthroughs before models reach superhuman levels across any and all domains of cognition. It's clear that current architectures aren't going to be the end-all solution, but all we need might simply be a handful of well-posed categorical deficiencies that allow a smooth transition past the current jagged frontiers.
To me, the base conversion is a side quest. We just wanted to rule out this explanation for the model behavior. It may be worth further investigation, but it won't be by us. Another (less important) reason is paper length, if you want to submit to peer reviewed outlets, you need to keep pages under a certain number.