You are completely right about tvc ls recomputing each hash, but I think it has to do this? A timestamp wouldn't be reliable, so the only reliable way to verify a file's contents would be to generate a hash.
In my lazy implemenation, I don't even check if the hashes match, the program reads, compresses and tries to write the unchanged files. This is an obvious area to improve performance on. I've noticed that git speeds up object lookups by generating two-letter directories from the first two letters in hashes, so objects aren't actually stored as `.git/objects/asdf12ha89k9fhs98...`, but as `.git/objects/as/df12ha89k9fhs98...`.
>why were TOML files not considered at the end
I'm just not that familiar with toml. Maybe that would be a better choice! I saw another commenter who complained about yaml. Though I would argue that the choice doesn't really matter to the user, since you would never actually write a commit object or a tree object by hand. These files are generated by git (or tvc), and only ever read by git/tvc. When you run `git cat-file <hash>`, you'll have to add the `-p` flag (--pretty) to render it in a human-readable format, and at that point it's just a matter of taste whether it's shown in yaml/toml/json/xml/special format.
I agree, but I'm still iffy on reading all files (already an expensive operation) in the repository, then hashing every one of them, every time you do an ls or a commit. I took a quick look and git seems to check whether it needs to recalculate the hash based on a combination of the modification timestamp and if the filesize has changed, which is not foolproof either since the timestamp can be modified, and the filesize can remain the same and just have different contents.
I'm not too sure how to solve this myself. Apparently this is a known thing in git and is called the "racy git" problem https://git-scm.com/docs/racy-git/
But to be honest, perhaps I'm biased from working in a large repository, but I'd rather the tradeoff of not rehashing often, rather than suffer the rare case of a file being changed without modifying its timestamp, whilst remaining the same size. (I suppose this might have security implications if an attacker were to place such a file into my local repository, but at that point, having them have access to my filesystem is a far larger problem...)
> I'm just not that familiar with toml... Though I would argue that the choice doesn't really matter to the user, since you would never actually write...
Again, I agree. At best, _maybe_ it would be slightly nicer for a developer or a power user debugging an issue, if they prefer the toml syntax, but ultimately, it does not matter much what format it is in. I mainly asked out of curiosity since your first thoughts were to use yaml or json, when I see (completely empirically) most Rust devs prefer toml, probably because of familiarity with Cargo.toml. Which, by the way, I see you use too in your repository (As to be expected with most Rust projects), so I suppose you must be at least a little bit familiar with it, at least from a user perspective. But I suppose you likely have even more experience with yaml and json, which is why it came to mind first.
> ...based on a combination of the modification timestamp and if the filesize has changed
Oh that is interesting. I feel like the only way to get a better and more reliable solution to this would be to have the OS generate a hash each time the file changes, and store that in file metadata. This seems like a reasonable feature for an OS to me, but I don't think any OS does this. Also, it would force programs to rely on whichever hashing algorithm the OS uses.
>... have the OS generate a hash each time the file changes...
I'm not sure I would want this either tbh. If I have a 10GB file on my filesystem, and I want to fseek to a specific position in the file and just change a single byte, I would probably not want it to re-hash the entire file, which will probably take a minute longer compared to not hashing the file. (Or maybe it's fine and it's fast enough on modern systems to do this every time a file is modified by any program running, I don't know how much this would impact the performance.).
Perhaps a higher resolution timestamp by the OS might help though, for decreasing the chance of a file having the exact same timestamp (unless it was specifically crafted to have been so).
In my lazy implemenation, I don't even check if the hashes match, the program reads, compresses and tries to write the unchanged files. This is an obvious area to improve performance on. I've noticed that git speeds up object lookups by generating two-letter directories from the first two letters in hashes, so objects aren't actually stored as `.git/objects/asdf12ha89k9fhs98...`, but as `.git/objects/as/df12ha89k9fhs98...`.
>why were TOML files not considered at the end I'm just not that familiar with toml. Maybe that would be a better choice! I saw another commenter who complained about yaml. Though I would argue that the choice doesn't really matter to the user, since you would never actually write a commit object or a tree object by hand. These files are generated by git (or tvc), and only ever read by git/tvc. When you run `git cat-file <hash>`, you'll have to add the `-p` flag (--pretty) to render it in a human-readable format, and at that point it's just a matter of taste whether it's shown in yaml/toml/json/xml/special format.