Scraps Yandex Google and other SEO learn from the source code leak

“Fragments” of Yandex’s codebase leaked online last week. Like Google, Yandex is a platform with many aspects such as e-mail, maps, a taxi service, etc. The code leak featured bits and pieces of everything.

According to the documentation in it, the code base of Yandex was folded into a large repository called Arcadia in 2013. The leaked code base is a subset of all the projects in Arcadia and we find several components in it related to the engine of search in the “Kernel”, “Library”. ,” “Robot,” “Search” and “ExtSearch” archives.

The move is completely unprecedented. Not since AOL’s search query date of 2006 has anything so material related to a web search engine entered the public domain.

Although missing the data and many files that are referenced, this is the first instance of a tangible look at how a modern search engine works at the code level.

Personally, I can’t get over how great the timing is to be able to actually see the code as I finish my book “The Science of SEO” where I talk about Information Retrieval, how modern search engines work, and how to build a simple self.

However, I’ve been analyzing the code since last Thursday and any engineer will tell you that’s not enough time to figure out how it all works. So, I suspect there will be many more posts as I continue to tinker.

Before I jump in, I want to give a shout out to Ben Wills at Ontolo for sharing the code with me, pointing me in the initial direction of where the good stuff is, and going back and forth with me as we deciphered things. Feel free to grab the spreadsheet with all the data we’ve compiled on the ranking factors here.

Also, shout out to Ryan Jones for digging in and sharing some key findings with me over IM.

It’s not Google’s code, so why do we care?

Some believe that the review of this code base is a distraction and that there is nothing that will impact how business decisions are made. I think it’s interesting considering these are people from the same SEO community who used the CTR model from AOL 2006 data as the industry standard for the model in every search engine for many years to follow.

That said, Yandex is not Google. Yet both are leading web search engines that have continued to stay at the forefront of technology.

Software engineers from both companies attend the same conferences (SIGIR, ECIR, etc.) and share results and innovations in information retrieval, natural language processing/understanding and machine learning. Yandex also has a presence in Palo Alto and Google previously had a presence in Moscow.

A quick LinkedIn search turns up a few hundred engineers who worked at both companies, although we don’t know how many of them worked in Research at both companies.

In a more direct overlap, Yandex also makes use of Google’s open source technologies that have been critical to innovations in Search such as TensorFlow, BERT, MapReduce, and, to a much lesser extent, Protocol Buffers.

So while Yandex is certainly not Google, it’s also not some random search project we’re talking about here. There is a lot we can learn about how a modern search engine is built from reviewing this codebase.

At the very least, we can disabuse some outdated notions that also pervade SEO tools such as text-to-code reporting and W3C compliance or the general belief that Google’s 200 signals are just 200 individual on-page and off-page features. the page instead of composite factor classes that potentially use thousands of individual measures.

Some context on Yandex’s architecture

Without context or the ability to compile, run, and step through it, source code is very difficult to understand.

Typically, new engineers receive documentation, walkthroughs, and engage in pair programming to be integrated into an existing codebase. And, there is some limited onboarding documentation related to the creation of the creation process in the document archive. However, the Yandex code also refers to the internal wikis as a whole, but those are not filtered and the comments in the code are also quite sparse.

Fortunately, Yandex gives some insights into its architecture in its public documentation. There are also a couple of patents that have been published in the US that help shed some light. Namely:

As I was researching Google for my book, I developed a much deeper understanding of the structure of their classification systems through various whitepapers, patents, and speeches written by engineers against my SEO experience. I also spent a lot of time brushing up on my understanding of general information retrieval practices for web search engines. It is not surprising that there are actually some best practices and similarities in play with Yandex.

The Yandex documentation discusses a dual distributed crawler system. One for real-time crawling called “Orange Crawler” and another for general crawling.

Historically, it is said that Google had an index stratified into three buckets, one for real-time hosting, one for regularly crawled and one for rarely crawled. This approach is considered good practice in IR.

Yandex and Google differ in this respect, but the general idea of ​​segmented crawling guided by an understanding of the frequency of updates remains.

One thing worth mentioning is that Yandex does not have a separate rendering system for JavaScript. They say this in their documentation and although they have a Webdriver-based system for visual regression testing called Gemini, they limit themselves to text-based crawling.

The documentation also discusses a sharded database structure that breaks pages into an inverted index and a document server.

Like most other web search engines, the indexing process creates a dictionary, caches pages, and then puts data into the inverted index so that the bigrams and trigrams and their placement in the document are represented.

This differs from Google in that it moved to phrase-based indexing, which means n-grams that can be much longer than trigrams a long time ago.

However, the Yandex system uses BERT in its pipeline, so at some point documents and queries are converted into embeddings and nearest neighbor search techniques are employed for classification.

The sorting process is where things start to get more interesting.

Yandex has a layer called Metasearch where cached popular search results are served after processing the request. If the results are not found there, then the search query is sent to a series of thousands of different machines in the Basic Search layer simultaneously. Each builds a publishing list of relevant documents, then returns to MatrixNet, Yandex’s neural network application for re-ranking, to build the SERP.

Based on the videos where Google engineers talked about the Search infrastructure, this ranking process is quite similar to Google Search. They talk about Google’s technology that is in shared environments where different applications are on each machine and tasks are distributed across those machines based on the availability of computing power.

One of the use cases is exactly that, distributing queries to an assortment of machines to quickly process relevant index fragments. Computing the publication lists is the first place we need to consider the ranking factors.

There are 17,854 ranking factors in the codebase

On Friday after the leak, the inimitable Martin MacDonald shared a file from the codebase called web_factors_info/factors_gen.in. The file comes from the “Kernel” archive in the codebase leak and features 1,922 classification factors.

Naturally, the SEO community ran with that number and that file to eagerly spread the news of the insights in it. Many people have translated the descriptions and built tools or Google Sheets and ChatGPT to make sense of the data. All of which are great examples of the power of community. However, the 1,922 represents only one of several sets of classification factors in the code base.

A deeper dive into the codebase reveals that there are numerous ranking factor files for the various subsets of Yandex’s query processing and ranking systems.

Combing those, we find that there are actually 17,854 ranking factors in total. Included in those ranking factors are a variety of metrics related to:

There is also a series of Jupyter notebooks that have 2,000 additional factors beyond those in the core code. Presumably, these Jupyter notebooks represent tests where engineers are considering additional factors to add to the code base. Again, you can review all of these features with metadata that we’ve collected from the entire code base at this link.

The Yandex documentation also clarifies that they have three classes of classification factors: Static, Dynamic, and those that are specific to the user’s search and how it was performed. In his own words:

In the code base they are indicated in the classification factor files with the tags TG_STATIC and TG_DYNAMIC. Search factors have several tags such as TG_QUERY_ONLY, TG_QUERY, TG_USER_SEARCH and TG_USER_SEARCH_ONLY.

While we discovered a potential 18k ranking factors to choose from, the documentation linked to MatrixNet indicates that the score is built from tens of thousands of factors and customized based on research findings.

This indicates that the ranking environment is very dynamic, similar to that of the Google environment. According to the patent “Framework for the evaluation of scoring functions” of Google, they have long had something similar where several functions are executed and the best set of results are returned.

Finally, considering that the documentation references tens of thousands of classification factors, we must also keep in mind that there are many other files referred to in the code that are missing from the archive. So there’s probably more that we can’t see. This is further illustrated by reviewing the images in the onboarding documentation that show other folders that are not present in the archive.

For example, I suspect that there is more related to the DSSM in the directory /semantic-search/.

The initial weighting of ranking factors 

I first operated under the assumption that the codebase has no weights for the ranking factors. So I was surprised to see that the nav_linear.h file in the /search/relevance/ directory presents the initial coefficients (or weights) associated with the ranking factors on the full display.

This section of the code highlights 257 of the 17,000+ ranking factors we’ve identified. (Hat tip to Ryan Jones for pulling these off and aligning with the descriptions of the ranking factors.)

To be clear, when you think of a search engine algorithm, you probably think of a long and complex mathematical equation by which each page is scored based on a series of factors. While this is an oversimplification, the following screenshot is an excerpt from such an equation. The coefficients represent the importance of each factor and the resulting calculated score is the one that would be used to score the selection pages for relevance.

These values ​​being coded suggests that this is certainly not the only place where classification occurs. Instead, this function is more likely where the initial relevance score is made to generate a series of publication lists for each shard that is considered for classification. In the first patent listed above, they talk about this as a concept of query-independent relevance (QIR) which then limits documents before they are reviewed for query-specific relevance (QSR).

The resulting publication lists are then passed to MatrixNet with query functions for comparison. Therefore, while you do not know the specifics of the downstream operations (yet), these weights are still valuable to understand because they tell you the requirements for a page to be eligible for consideration.

However, that brings up the next question: what do we know about MatrixNet?

There is neural classification code in the Kernel archive and there are numerous references to MatrixNet and “mxnet” as well as many references to Deep Structured Semantic Models (DSSM) throughout the code.

The description of one of the classification factor FI_MATRIXNET indicates that MatrixNet is applied to all factors.

Index:              160

CppName: “FI_MATRIXNET”

Name:               “MatrixNet”

Tags:               [TG_DOC, TG_DYNAMIC, TG_TRANS, TG_NOT_01, TG_REARR_USE, TG_L3_MODEL_VALUE, TG_FRESHNESS_FROZEN_POOL]

Description:        “MatrixNet is applied to all factors – the formula”

There are also a lot of binary files that can be pre-formed models themselves, but it will take more time to unravel those aspects of the code.

What is immediately clear is that there are several levels to the ranking (L1, L2, L3) and there is an assortment of ranking models that can be selected at each level.

The file selecting_rankings_model.cpp suggests that different ranking models can be considered at each layer throughout the process. This is basically how neural networks work. Each level is an aspect that completes operations and their combined calculations make the re-classified list of documents that ultimately appears as SERP. I will follow up with a deep dive into MatrixNet when I have more time. For those who need a sneak peek, check out the Search Results Ranker Patent.

For now, let’s take a look at some interesting ranking factors.

Top 5 negatively weighted initial ranking factors

This is a list of the highest initial ranking factors negatively weighted with their weights and a brief explanation based on their descriptions translated from Russian.

In summary, these factors indicate that, for the best score, you should:

Everything else on this list is out of your control.

Top 5 positively weighted initial ranking factors

To follow, here is a list of the highest weighted positive ranking factors.

There are plenty of unexpected initial ranking factors 

What is most interesting about the initial weighted ranking factors are the unexpected ones. The following is a list of seventeen factors that stood out.

The first step from the review of these strange classification factors and the variety of those available in the Yandex codebase is that there are many things that could be a classification factor.

I suspect that Google’s “200 signals” are actually 200 classes of signals where each signal is a built composite of many other components. In the same way that Google Analytics has dimensions with many associated metrics, Google search likely has classes of ranking signals composed of many features.

Yandex scrapes Google, Bing, YouTube and TikTok

The codebase also reveals that Yandex has many parsers for other websites and their respective services. To Westerners, the most notable of those are the ones I listed in the heading above. In addition, Yandex has parsers for a variety of services that I was not familiar with as well as those for its own services.

What is immediately obvious is that the parsers are full functionality. Every significant component of Google’s SERP is extracted. In fact, anyone who could be considered scraping any of these services might be well to review this code.

There is another code that indicates Yandex uses some Google data as part of the DSSM calculations, but the 83 Google called ranking factors themselves make it clear that Yandex relied on Google’s results very much.

Obviously, Google would never pull the Bing move to copy the results of another search engine, nor be dependent on one for the core ranking calculations.

Yandex has anti-SEO upper bounds for some ranking factors

315 ranking factors have thresholds at which any calculated value beyond that indicates to the system that that feature of the page is over-optimized. 39 of these ranking factors are part of the initially weighted factors that can prevent a page from being included in the initial list of publications. You can find these in the spreadsheet I linked above, filtering by the Rank Coefficient and the Anti-SEO column.

It is not very reasonable to expect all modern search engines to set thresholds on certain factors that SEOs have historically abused, such as anchor text, CTR, or keyword stuffing. For example, Bing said that it exploits the misuse of meta keywords as a negative factor.

Yandex boosts “Vital Hosts”

Yandex has a series of enforcement mechanisms throughout its base code. These are artificial enhancements to some documents to ensure that they score higher when they are considered for classification.

Below is a comment from the “enhancement wizard” that suggests that smaller files benefit best from the enhancement algorithm.

There are many types of boosts; I’ve seen a boost linked to the links and I’ve also seen a series of “HandJobBoosts” which I can only assume is a strange translation of “manual” changes.

One of these impulses that I found particularly interesting is related to “Vital Hosts”. Where a vital host can be any specified site. Specifically mentioned in the variables is NEWS_AGENCY_RATING which leads me to believe that Yandex gives an impulse that biases its results to certain news organizations.

Without getting into geopolitics, this is very different from Google in that they have been steadfast not to introduce biases like this in their ranking systems.

The structure of the document server

The code base reveals how documents are stored in the Yandex document server. This is useful to understand that a search engine does not just make a copy of the page and save its cache, it captures different features such as metadata to then use in the downstream classification process.

The screenshot below highlights a subset of the features that are particularly interesting. Other files with SQL queries suggest that the document server has closer to 200 columns including the DOM tree, sentence lengths, retrieval time, a series of dates, and antispam score, redirect chain, and whether the document is translated or not. The most comprehensive list I’ve found is at /robot/rthub/yql/protos/web_page_item.proto.

What is most interesting in the subset here is the number of simhashes that are employed. Simhashes are numerical representations of content and search engines use them for quick comparison to determine duplicate content. There are many cases in the robot archive that indicate that duplicate content is explicitly removed.

Also, as part of the indexing process, the code base features TF-IDF, BM25 and BERT in its text processing pipeline. It is not clear why all these mechanisms exist in the code because there is some redundancy in the use of all of them.

Link factors and prioritization

The codebase also reveals a lot of information about link factors and how links are prioritized.

Yandex’s link spam calculator has 89 factors to look at. Anything marked as SF_RESERVED is obsolete. Where provided, you can find the descriptions of these factors in the Google Sheet linked above.

In particular, Yandex has a host ranking and some scores that seem to live long after a site or page develops a reputation for spam.

Another thing that Yandex does is review the copy in a domain and determine if there is duplicate content with those links. This can be link placements throughout the site, links on duplicate pages, or simply links with the same anchor text coming from the same site.

This illustrates how trivial it is to score multiple links from the same source and clarifies how important it is to target more unique links from more diverse sources.

What can we apply from Yandex to what we know about Google?

Naturally, this is always the question on everyone’s mind. While there are certainly many analogies between Yandex and Google, in truth, only a Google software engineer working in Search could definitively answer this question.

However, this is the wrong question.

Indeed, this code should help expand our thinking about modern research. Most of the collective intelligence of search is built from what the SEO community learned in the early 2000s through tests and from the mouths of search engineers when search was very less opaque. That unfortunately has not kept up with the rapid pace of innovation.

Insights from the many characteristics and factors of the Yandex leak should produce more hypotheses of things to test and consider for ranking in Google. They also introduce more things that can be analyzed and measured by SEO crawling, link analysis and ranking tools.

For example, a measure of the cosine similarity between queries and documents using BERT embeddings could be valuable for understanding towards competitor pages since it is something that modern search engines do themselves.

Much in the way that the search logs of AOL have moved us from guessing the distribution of clicks on SERP, the Yandex code will move us away from the abstract to the concrete and our statements “depends” can be better qualified.

Therefore, this codebase is a gift that will keep on giving. It’s only been a weekend and we’ve already gotten some very compelling insights from this code.

I anticipate that some ambitious SEO engineers with a lot more time on their hands will keep digging and maybe even fill in enough of what’s missing to compile this thing and make it work. I also believe that the engineers of the various search engines also go through and analyze the innovations that they can learn and add to their systems.

Simultaneously, Google’s lawyers will likely write aggressive cease and desist letters related to all scraping.

I am eager to see the evolution of our space that is driven by curious people who maximize this opportunity.

But, hey, if getting insights from the current code is not valuable for you, you are welcome to return to do something more important like the arguments of subdomains versus subdirectories.

The opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.