Mojeek – 1.5B Pages and Growing (2016)

kbd · on March 6, 2017

Visited their blog, typed a simple search for "python"[1] into their search box, and got "Sorry your search appeared automated and was blocked."

https://www.mojeek.com/search?q=python&site=

i336_ · on March 6, 2017

I wonder if the IP you searched from was unfortunately caught in the crossfire of a currently-flagged subnet.

rspeer · on March 7, 2017

Oddly enough, I got that when I followed your link, then searched again and got results.

The results, however, seem to be for the most popular Python pages of last decade, including mod_python, PIL, and several results about Plone.

mojeek · on March 7, 2017

Hi I'm Marc the developer of Mojeek. Sorry about this, I'm aware it's happening too often and will look into it. If you press refresh twice without doing anything else that should fix it.

visualdensity · on March 8, 2017

Also getting the automated error message. Sorry but it's practically useless at the moment :/

mojeek · on March 8, 2017

Did you reload the error message page twice? That should fix it, if not please try searching from the homepage and let me know if neither works, thanks.

phowat · on March 7, 2017

I got the same error searching after clicking the submission link.

sarang23592 · on March 7, 2017

I got the same error when I searched python

i336_ · on March 6, 2017

Have some amusement courtesy of https://www.mojeek.com/preferences:

  Crawl date: [x] no (default)  [ ] yes  [ ] no

(It was in amongst a bunch of other settings that also had those same 3 options)

As can be seen by simply mentally timing the delay, and also by the "in 0.47s", queries seem to be being cached. That sort of admits that the cache isn't that up-to-date.

Also, I tried my standard "bamboozle google" query, "ketmax", while looking for results suffixed with "35", "." and "zip". No dice. (ixquick eventually found it that one time.)

mojeek · on March 7, 2017

Hi, thanks for the feedback.

I'll most likely be changing the preferences back to a more traditional yes/no soon! It was meant as a clear setting and use whatever the default is for future searches (which may change).

Yes the results will be cached for up to 1 hour which is why subsequent searches didn't take as long.

I'm limited to how much I can index, so ZIP files have been skipped for now.

i336_ · on March 7, 2017

Wow, cool to see the author checking in! :)

Thanks for clarifying the purpose of the "clear" preference. There's no description underneath that option (and only that option, making it seem like an outlier that was supposed to be removed or something) so it was far too easy to assume there was a glitch somewhere, eg a template fault (that made that option get the same set of radio buttons as the three elements underneath it).

Ah, the cache lasts an hour. That's only four times longer than Google's turnaround time (they aim to have data from most sites into search results pages within 15 minutes), so all things considered that's not shabby at all. (I'm not saying "only" here sarcastically)

I'm curious as to the indexing limitation, which I presume directly translates to current storage capacity. This may (may) be interesting: GitLab recently asked Hacker News for advice about a systems upgrade https://news.ycombinator.com/item?id=13153031 and received many, many comments. This forum is no replacement for the existing partnerships you've forged over the last 13 years but it's possible that networking on here may be interesting. Not presuming; just putting a datapoint out there. (The main caveat is that GitLab is within the collective on here. I can't see why Mojeek couldn't be too though.)

Now for a bunch of other things.

I'm very curious how you're doing text analysis for the search entries,

How are you figuring out the (full-resolution) result counts (like "18,342,669" - not "16.8 million") for the "from $resultcount in $time" bit? This is something Google continually gets wrong!

Most my interest stem from impressedness at the fact that you're using C for all of this (which absolutely makes sense; it's cheaper to run!); my curiosity is specifically about how you're doing those things from C, how you're managing inevitable crashes, etc etc.

I'm also curious what database you're using (and also the query parsing architecture) - because searching is actually pretty fast (even from Australia!).

Feel free to answer whatever you like here (highly appreciated!), but what would be really cool would be a multi-page writeup explaining everything in-depth (basically everything that's non-competitive) and then posting the URL as a new article here (it might take a couple goes till it "sticks" and people see and upvote it). That would be very interesting to a lot of people on here, I think.

Just a thought.

mojeek · on March 7, 2017

Thanks, really appreciate the feedback! HN is actually my most frequented site, I just don't talk much!

I did see that post at the time but will read it again. Index limitation is basically just lack of servers, funds, etc.

Currently we're doing a full index search for every query and so know the exact amount of hits within our index. This might change in the future though and become "about".

No databases except for what I've written myself. It started as a hobby so everything was from scratch from the beginning and just continued that way.

Yes thanks that's a very good idea. I'll try and put something together.

Marc

i336_ · on March 7, 2017

Thanks for piping up about this!

The article may be interesting to go through but most of the comments were discussing fairly stock-standard "popular"/hyped software stacks, so YMMV with that specific data. Untangling all of the individual pieces into a coherent picture was also fairly involved (I gave up). I mention the link solely to reference the idea that GitLab did get lots of hits and feedback, and that you may find it an interesting idea to consider networking on here regarding server resources. It's just a thought, may not be useful.

Okay, so... a small continuous load of users is doing full index over 1 billion items within 500ms per request. That's... that needs to go into your writeup, along with what your current usage load is like. Prepare for inquiries and offers when you do your post!

Also, you're definitely going to have a lot of interest in the database system if it's homemade; your use-case (high read and query load, moderate write load) is fairly widespread, and different implementations always lean themselves toward being super-awesome at certain kinds of queries.

I'm not going to push the open-source idea myself, but you'll definitely have a bit of clamouring. That will need to be worked out; if this site is your most frequently read (cool) then you probably already have a good idea of the pros and cons of open vs closed.

Very much looking forward to hearing about this, whenever it happens. Doesn't need to be immediate by any means - comprehensive, in-depth analysis takes a while, and if it's not rushed the results are very good.

mojeek · on March 7, 2017

It's a very good idea and I definitely should do it anyway.

Most of the 500ms is generating the snippet, the search itself is usually much quicker. The search is effected by the load more than generating the snippet though. I'll try and get it all written down.

Thanks very much!

i336_ · on March 7, 2017

Woops, completely forgot to complete what I was talking about the timing thing. When you first do a query, it takes 0.47s, when you redo it, it takes like 0.00 or 0.01 seconds. I have no idea how long for.

nerdponx · on March 7, 2017

What's that query supposed to return?

i336_ · on March 7, 2017

A ZIP file for an old DOS disassembler, which Google will readily find if passed ".zip", "35" and "ketmax" in reverse order (not strrev, reverse order) as one string with no spaces. If you leave out the "35" part of the string (and just supply the other two parts), Google has no idea what I'm talking about no matter what I try - it was ixquick that helped me find the file without that middle tidbit.

Whenever I mention this bug I'm challenged with how to refer to it in a way that humans will be able to understand (without resorting to base64 or similar...) but which Google's indexer won't "get", so that the bug stays there and I'm able to point it out to people.

My point is that Google's correlation system is nowhere near perfect. (I would intuitively categorize this with the much-vaunted systems that determine context, "did you mean" etc.)

machopacho · on March 6, 2017

https://www.mojeek.com/search?q=ycombinator+news vs. https://www.google.de/search?q=ycombinator+news

Google's first result is on track, theirs is not. A full index is not everything, your search algorithm needs to be on top as well.

mojeek · on March 7, 2017

Hi, if you click the more results from news.ycombinator.com it is there but for some reason the submit page is pushing it down a place. I'll look into it, thanks.

zitterbewegung · on March 6, 2017

Tried it out with searching for snapchat on news. All of the top results are from last week but todays news of their 12% drop is hidden in the results? Wouldn't newer news be more relevant https://www.mojeek.com/search?q=snapchat&fmt=news&news=1 ?

lacksconfidence · on March 7, 2017

What is being used to tune the search results, if user tracking isn't involved? I'm of course guessing here, but I would assuse no tracking means no log of queries and no log of clicked results?

lqdc13 · on March 6, 2017

looks like it's worse than duckduckgo for programming-related searches.

chairmanwow · on March 6, 2017

A simple search for the R programming language didn't return anything about the R programming language. I'm not trying to say this isn't impressive, and they aren't doing good work. But I don't exactly think this is a good space to get into as a young company. Lots of resources have been burned by major players making subpar products.

nerdponx · on March 7, 2017

This is the second "alt search engine" submission I've seen today. Normally I'm in favor of healthy competition in the "open/non-evil" products space. But search engines probably more than any other web-based application, benefit from economies of scale.

So why bother starting another one when DuckDuckGo is already gaining mind and market share? It just seems wasteful to re-index the Web and develop algorithms for searching it when someone else is already doing it.

mojeek · on March 7, 2017

Hi, I started developing Mojeek way before DuckDuckGo as a hobby project. We also have our own search index unlike DDG, which I've only more recently had the funds to start growing.

code_duck · on March 7, 2017

Granted, the search market is not what it was in 1999, but that's what some friends said when I first told them about Google.

est · on March 7, 2017

Any non-ASCII character search will redirect to home page.

alewis75 · on March 7, 2017

Great to see an English search engine where there is no tracking and independent results are given!

j_s · on March 7, 2017

Anyone have any luck with the CloudFlare leak search parameters?

sud0x3 · on March 7, 2017

Why would I use this over duckduckgo?

kevin2r · on March 7, 2017

what are the back-end technologies used to make mojeek.com work?

mojeek · on March 7, 2017

Hi, it's been developed from scratch in C by myself.

simplehuman · on March 7, 2017

This was posted in Dec 2016

sctb · on March 7, 2017

Thanks, we've updated the submission title.