Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not sure where "wc -l" came from. It was not in any of the tests I authored. That's because I am not interested in line counts. Nor am I interested in very large files either, or ASCII files of Shakespeare, etc. As I stated in the beginning, I am working with files that are like "a wall of text". Long lines, few if any linebreaks. Not the type of files that one can read or edit using less(1), ed(1) or vi(1).

What I am interested in is how fast the results display on the screen. For files of this type in the single digit MB range, piping the results to another program does not illustrate the speed of _displaying results to the screen_. In any event, that's not what I'm doing. I am not piping to another program. I am not looking for line counts. I am looking for patterns of characters and I need to see these on the screen. (If I wanted line counts I would just use grep -c -o.)

When working with these files interactively at the command line, performing many consecutive searches,^1 the differences in speed become readily observable. Without need for time(1). grep -o is ridiculously slow. Hence I am always looking for alternatives. Even a shell script with ired is faster than grep. Alas, ripgrep is not a solution either. It's not any faster than ired _at this task for files of this type and size_.

Part of the problem in comparing results is that we are ignoring the hardware. For example, I am using a small, low resource, underpowered computer; I imagine most software developers use more expensive computers that are much more powerful with vast amounts of resources.

Try some JSON as a sample but note this is not necessarily the best example of the "walls of text" I am working with; ones that do not necessarily conform to a standard.

   curl "https://api.crossref.org/works?query=unix&rows=1000" > test.json

   busybox time grep -Eo .{100}https:.{50}
   real 0m 7.93s
   user 0m 7.81s
   sys  0m 0.02s
This is still a tiny file in the world of software developers. Obviously, if this takes less than 1 second on some developer machine, then any comparison with me, an end user with an ordinary computer, are not going to make much sense.

1. Not an actual "loop", but an iterative, loop-like process of search file, edit program, compile program, run program, search file, edit program, ...

With this, speed becomes noticeable even if the search task is relatively short-lived.



Well then can you share such a file? I wasn't measuring the time of wc. I just did that to confirm the outputs were the same. The fact is that I can't reproduce your timings, and ired is significantly slower in the tests I showed above.

I tried my best to stress the importance of match frequency, and even varied the tests on that point. Yet I am still in the dark as to the match frequency in your tests.

The timing differences even in your tests also seem insignificant to me, although they can be significant in a loop or something. Hence the reason I used a larger file. Otherwise the difference in wall time appears to be a matter of milliseconds. Why does that matter? Maybe I'm reading your timings wrong, but that would only deepen the mystery as to why our results are so different. Hence my request for an input that your care about so that we can get on the same page.

Not sure if it was clear or not, but I'm the author of ripgrep. The benefit to you from this exchange is that I should be able to explain why the perf difference has to exist (or is difficult to remedy) or file a TODO for making rg -o faster.


Another iterative loop-like procedure is search, adjust pattern and/or amount of context, search, adjust pattern and/or amount of context, search, ...

If a program is sluggish, I will notice.

The reason I am searching for a pattern is because there is something I consider meaningful that follows or precedes it. Repeating patterns would generally not be something I am interested in. For example, a repeating pattern such as "httphttphttp". The search I would do would more likely be "http". If for some reason it repeats, then I will see that in the context.

For me, neither grep nor grep clones are as useful as ired. ired will show me the context including the formatting, e.g., spaces, carriage returns. It will print the pattern plus context to the screen exactly as it appears in the file, also in hexdump or formatted hex, like xxd -p.

https://news.ycombinator.com/item?id=38564604

And it will do all this faster than grep -o and nearly as fast as a big, fat grep clone in Rust that spits out coloured text by default, even when ired is in a shell script with multiple pipes and other programs. TBH, neither grep nor grep clones are as flexible; they are IMO not suitable for me, for this type of task. But who knows there may be some other program I do not know about yet.

Significance can be subjective. What is important to me may not be important to someone else, and vice versa. Every user is different. Not every user is using the same hardware. Nor is every user trying to do the exact same things with their computer.

For example, I have tried all the popular UNIX shells. I would not touch zsh with a ten-foot pole. Because I can feel the sluggishness compared to working in dash or NetBSD sh. I want something smaller and lighter. I intentionally use the same shell for interactive and non-interactive use. Because I like the speed. But this is not for everyone. Some folks might like some other shell, like zsh. Because [whatever reasons]. That does not mean zsh is for everyone, either. Personally, I would never try to proclaim that the reasons these folks use zsh are "insignificant". To those users, those reasons are significant. But the size and speed differences still exist, whether any particular user deems them "significant" or not.


> If a program is sluggish, I will notice.

Well yes of course... But you haven't demonstrated ripgrep to be sluggish for your use case.

> For me, neither grep nor grep clones are as useful as ired. ired will show me the context including the formatting, e.g., spaces, carriage returns. It will print the pattern plus context to the screen exactly as it appears in the file, also in hexdump or formatted hex, like xxd -p.

Then what are you whinging about? grep isn't a command line hex editor like ired is. You're the one who came in here asking for grep -o to be faster. I never said grep (or ripgrep) could or even should replace ired in your workflow. You came in here talking about it and making claims about performance. At least for ripgrep, I think I've pretty thoroughly debunked your claims about perf. But in terms of functionality, I have no doubt whatsoever that ired is better fitted for the kinds of problems you talk about. Because of course it is. They are two completely different tools.

ired will also helpfully not report all substring results. I love how you just completely ignore the fact that your useful tool is utterly broken. I don't mean "broken" lightly. It has had hidden false negatives for 14 years. Lmao. YIKES.

> they are IMO not suitable for this type of task

Given that ripgrep gives the same output as your ired shell script (with a lot less faffing about) and it does it faster than ired, I find this claim baseless and without evidence. Of course, ripgrep will not be as flexible as ired for other hex editor use cases. Because it's not a hex editor. But for the specific case you brought up on your own accord because you wanted to come complain on an Internet message board, ripgrep is pretty clearly faster.

> nearly as fast as a big, fat grep clone in Rust

At least it doesn't have a 14 year old bug that can't find ABAB in ABAABAB. Lmao.

But I see the goalposts are shifting. First it was speed. Now that that has been thoroughly debunked, you're whinging about binary size. I never made any claims about that or said that ripgrep was small. I know it's fatter than grep (although your grep is probably dynamically linked). If people like you want to be stingy with your MBs to the point that you won't use ripgrep, then I'm absolutely cool with that. You can keep your broken software.

> Significance can be subjective. What is important to me may not be important to someone else, and vice versa. Every user is different.

A trivial truism, and one that I've explicitly acknowledged throughout this discussion. I said that milliseconds in perf could matter for some use cases, but it isn't going to matter in a human paced iteration workflow. At least, I have seen no compelling argument to the contrary.

It's even subjective whether or not you care if your tool has given you false negatives because of a bug that has existed for 14 years. Different strokes for different folks, amiright?


Okay I see you replied with more details elsewhere. I'll investigate tomorrow when I have hands on a keyboard. Thanks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: