Thanks for the feedback! What would be your usecase for that API?

mtrn · on March 14, 2016

I would love to have a list of URL of all .edu domains, which contain the word "publications". It's a bit silly, but I'd love to build a open version something like Google Scholar.

I can think of many other use cases as well, where a product needs to built from larger, but very selected set of raw input or pages.

BTW: Thanks for Hackday Paris 2011! Loved that event and venue :)

sylvinus · on March 14, 2016

Hey, you're welcome! Hackday Paris 2011 brings back some nice memories ;)

One problem with what you'd like to do is the pagination. Because we have to send queries to all the shards and then re-rank them, it becomes increasingly hard (and useless for most users) to build pages with p > ~20 (which is why all search engines heavily limit their pagination). So our main infrastructure definitely won't be optimized for that :/

However sending the top ~500 pages with a keyword + a domain filter should be doable pretty easily!

struct · on March 14, 2016

This is basically my use-case too, in that it's more important to get access to _every_ document which contains that keyword and less important to rank them in a search-engine order. I think that it may be possible to do fairly inexpensively[1] but I'm still benchmarking to pick the right mix of technologies and data structures.

[1] https://www.getguesstimate.com/models/4225

deusu · on March 14, 2016

(I'm not with CommonSearch. I have my own project that crawls extensively though.)

You do realize that you are talking about potentially a LOT of data?

To give you an example: The word "work" occurs on about 4% of all web-pages. So even if there were only about 2bn pages in an index, that would mean 80 million matching pages. Even if you only need their URLs that would be about 2.4gb of data assuming an average URL length of 30 bytes. Ok, compression can make that smaller, but still...

It would also mean that the server would need to make 80 million random reads to get the URLs. Even with SSDs that would take some time. Hmm, actually in this case it may be faster to just read all URL-data sequentially, than doing random reads. But in both cases we would be talking about minutes needed to get all that data from disk.

I currently have a search-index with about 1.2bn pages - I expect to reach 2bn pages by mid-May - that could be used to get the kind of data you need. But not in a realtime API. Not that amount of result-data.

struct · on March 15, 2016

1) I'd be very interested in such a service. 2) Yep, it's a lot, but that query is quite a lot bigger than most. Assuming some constraints on the layout of the index, I estimate you'd spend roughly $70 plus taxes and compute time retrieving the indexed documents from S3 for that query. You'd always be able to reduce or expand the keywords to and only retrieve as much as you could afford. I think there's value in both allowing people to tackle querying the index by themselves and providing a paid-for managed service that automates much of that.

mtrn · on March 14, 2016

Interesting. To be honest, a static data set would be perfectly fine for a first batch processing attempt.

> that could be used to get the kind of data you need.

Cool. Would you be interested in sharing or exchanging data?

deusu · on March 14, 2016

I'm always open to new business opportunities. :)

What would be more useful to you, the raw data - meaning for each page a list of the keywords on it - or the reverse-word-index?

Raw-data may be better for batch-processing or running multiple queries at the same time.

My crawler currently outputs about 40-45gb of raw-data per day (about 30 million pages). Full crawl will be 2bn pages, updated every 2-3 months.

The reverse-word-index would be about 18gb per day for the same number of pages.

Reverse-word-index is already compressed, raw-data isn't.

There is a small problem with the crawl though, as it does not always handle non-ascii characters on pages correctly. I'm working on that.

BTW: I also currently have a list of about 8.5bn URLs from the crawl. About 600gb uncompressed. These are the links on the crawled pages. Obviously not all of those will end up being crawled.

mtrn · on March 15, 2016

Would love to share experiences or collaborate (contact in profile if you are interested).

jclos · on March 14, 2016

I'm not the original commenter but there could be a big huge case in research. Lots of researchers work on UIs for search, interactive search systems, and query refinement algorithms that are really just abstract layers over an existing search engine. It used to be that we could just overlay stuff over Google, but most search engines nowadays are a pain to work with.

mtrn · on March 14, 2016

I see scientific agents on the horizon and with them a new wave of research. But access to raw data, e.g. publications, is very limited and restricted. It's a pity.

mindcrime · on March 14, 2016

I'd be especially interested in a crawl results API if it could distinguish "news" sites from other content. Some of our work involves analyzing content, extracting keywords and then looking for relevant news (and other context) around those keywords. Of course the crawl would need to be relatively fresh to be useful for that.

Worst case we may build our own specialized crawler just for this purpose, but it would be nice if there was a useful search engine API we could leverage. And, of course, we'd be happy to pay for access to such an API.

struct · on March 14, 2016

Some of my primary interests are building websites that can analyse other websites and for that, I need a keyword index and access to the underlying crawl data (as in, a response that can point me to the exact file offsets that contain the pages). Think [1] but with keywords instead of URLs.

[1] http://index.commoncrawl.org/

sylvinus · on March 14, 2016

Thanks!

We probably won't expose the underlying crawl data ourselves but being able to reference it just like [1] does is indeed important.