Hi I'm Marc the developer of Mojeek. Sorry about this, I'm aware it's happening too often and will look into it. If you press refresh twice without doing anything else that should fix it.
Did you reload the error message page twice? That should fix it, if not please try searching from the homepage and let me know if neither works, thanks.
(It was in amongst a bunch of other settings that also had those same 3 options)
As can be seen by simply mentally timing the delay, and also by the "in 0.47s", queries seem to be being cached. That sort of admits that the cache isn't that up-to-date.
Also, I tried my standard "bamboozle google" query, "ketmax", while looking for results suffixed with "35", "." and "zip". No dice. (ixquick eventually found it that one time.)
I'll most likely be changing the preferences back to a more traditional yes/no soon! It was meant as a clear setting and use whatever the default is for future searches (which may change).
Yes the results will be cached for up to 1 hour which is why subsequent searches didn't take as long.
I'm limited to how much I can index, so ZIP files have been skipped for now.
Thanks for clarifying the purpose of the "clear" preference. There's no description underneath that option (and only that option, making it seem like an outlier that was supposed to be removed or something) so it was far too easy to assume there was a glitch somewhere, eg a template fault (that made that option get the same set of radio buttons as the three elements underneath it).
Ah, the cache lasts an hour. That's only four times longer than Google's turnaround time (they aim to have data from most sites into search results pages within 15 minutes), so all things considered that's not shabby at all. (I'm not saying "only" here sarcastically)
I'm curious as to the indexing limitation, which I presume directly translates to current storage capacity. This may (may) be interesting: GitLab recently asked Hacker News for advice about a systems upgrade https://news.ycombinator.com/item?id=13153031 and received many, many comments. This forum is no replacement for the existing partnerships you've forged over the last 13 years but it's possible that networking on here may be interesting. Not presuming; just putting a datapoint out there. (The main caveat is that GitLab is within the collective on here. I can't see why Mojeek couldn't be too though.)
Now for a bunch of other things.
I'm very curious how you're doing text analysis for the search entries,
How are you figuring out the (full-resolution) result counts (like "18,342,669" - not "16.8 million") for the "from $resultcount in $time" bit? This is something Google continually gets wrong!
Most my interest stem from impressedness at the fact that you're using C for all of this (which absolutely makes sense; it's cheaper to run!); my curiosity is specifically about how you're doing those things from C, how you're managing inevitable crashes, etc etc.
I'm also curious what database you're using (and also the query parsing architecture) - because searching is actually pretty fast (even from Australia!).
Feel free to answer whatever you like here (highly appreciated!), but what would be really cool would be a multi-page writeup explaining everything in-depth (basically everything that's non-competitive) and then posting the URL as a new article here (it might take a couple goes till it "sticks" and people see and upvote it). That would be very interesting to a lot of people on here, I think.
Thanks, really appreciate the feedback! HN is actually my most frequented site, I just don't talk much!
I did see that post at the time but will read it again. Index limitation is basically just lack of servers, funds, etc.
Currently we're doing a full index search for every query and so know the exact amount of hits within our index. This might change in the future though and become "about".
No databases except for what I've written myself. It started as a hobby so everything was from scratch from the beginning and just continued that way.
Yes thanks that's a very good idea. I'll try and put something together.
The article may be interesting to go through but most of the comments were discussing fairly stock-standard "popular"/hyped software stacks, so YMMV with that specific data. Untangling all of the individual pieces into a coherent picture was also fairly involved (I gave up). I mention the link solely to reference the idea that GitLab did get lots of hits and feedback, and that you may find it an interesting idea to consider networking on here regarding server resources. It's just a thought, may not be useful.
Okay, so... a small continuous load of users is doing full index over 1 billion items within 500ms per request. That's... that needs to go into your writeup, along with what your current usage load is like. Prepare for inquiries and offers when you do your post!
Also, you're definitely going to have a lot of interest in the database system if it's homemade; your use-case (high read and query load, moderate write load) is fairly widespread, and different implementations always lean themselves toward being super-awesome at certain kinds of queries.
I'm not going to push the open-source idea myself, but you'll definitely have a bit of clamouring. That will need to be worked out; if this site is your most frequently read (cool) then you probably already have a good idea of the pros and cons of open vs closed.
Very much looking forward to hearing about this, whenever it happens. Doesn't need to be immediate by any means - comprehensive, in-depth analysis takes a while, and if it's not rushed the results are very good.
It's a very good idea and I definitely should do it anyway.
Most of the 500ms is generating the snippet, the search itself is usually much quicker. The search is effected by the load more than generating the snippet though. I'll try and get it all written down.
Woops, completely forgot to complete what I was talking about the timing thing. When you first do a query, it takes 0.47s, when you redo it, it takes like 0.00 or 0.01 seconds. I have no idea how long for.
A ZIP file for an old DOS disassembler, which Google will readily find if passed ".zip", "35" and "ketmax" in reverse order (not strrev, reverse order) as one string with no spaces. If you leave out the "35" part of the string (and just supply the other two parts), Google has no idea what I'm talking about no matter what I try - it was ixquick that helped me find the file without that middle tidbit.
Whenever I mention this bug I'm challenged with how to refer to it in a way that humans will be able to understand (without resorting to base64 or similar...) but which Google's indexer won't "get", so that the bug stays there and I'm able to point it out to people.
My point is that Google's correlation system is nowhere near perfect. (I would intuitively categorize this with the much-vaunted systems that determine context, "did you mean" etc.)
Hi, if you click the more results from news.ycombinator.com it is there but for some reason the submit page is pushing it down a place. I'll look into it, thanks.
Tried it out with searching for snapchat on news. All of the top results are from last week but todays news of their 12% drop is hidden in the results? Wouldn't newer news be more relevant https://www.mojeek.com/search?q=snapchat&fmt=news&news=1 ?
What is being used to tune the search results, if user tracking isn't involved? I'm of course guessing here, but I would assuse no tracking means no log of queries and no log of clicked results?
A simple search for the R programming language didn't return anything about the R programming language. I'm not trying to say this isn't impressive, and they aren't doing good work. But I don't exactly think this is a good space to get into as a young company. Lots of resources have been burned by major players making subpar products.
This is the second "alt search engine" submission I've seen today. Normally I'm in favor of healthy competition in the "open/non-evil" products space. But search engines probably more than any other web-based application, benefit from economies of scale.
So why bother starting another one when DuckDuckGo is already gaining mind and market share? It just seems wasteful to re-index the Web and develop algorithms for searching it when someone else is already doing it.
Hi, I started developing Mojeek way before DuckDuckGo as a hobby project. We also have our own search index unlike DDG, which I've only more recently had the funds to start growing.
https://www.mojeek.com/search?q=python&site=