Google Scholar: not so smart?

Tony Russell-Rose
3 min readDec 18, 2018

--

December 18, 2018 by Tony Russell-Rose

Most of us are familiar Google Scholar: a freely available subset of Google that indexes the world’s scholarly literature across a range of disciplines. With its database of over 389 million documents including articles, citations and patents, it has become an indispensable resource for scholars and researchers across the globe. Which is why we recently added Google Scholar integration to 2Dsearch, thereby offering a tool of immediate utility to anyone wishing to search the world’s scientific literature in a systematic manner.

Now, we’d always known that despite its extensive index, searching GS is subject to the ‘secret sauce’ of Google’s search algorithms, and that this can compromise the ability of users to formulate sophisticated, reproducible searches. But what we perhaps didn’t realise was just how limited that support is. In particular, GS seems to support only 1 set of synonyms (i.e. a substring with terms separated by OR). For example, the following search string would seem to parse correctly:

V AND (W OR X OR Y OR Z)

But the following would not:

(A OR B OR C) AND (D OR E OR F)

How do we know this? Well, if you click on the Advanced Search option, it shows how your query is being interpreted:

In this instance, it would appear to be interpreted as a single disjunction, which clearly not what was intended. More importantly, there is no messaging to alert the user to this apparent re-writing of their query.

But can we really be sure that the GS Advanced Search dialog is an accurate representation of how queries are actually parsed? It turns out that this dialog is in fact not a reliable indicator, since it is not possible to “construct all possible expressions in the advanced search interface due to the limited number of available entry fields“. Instead, complex expressions have to be “copied and pasted as a whole into the single entry field of the simple search interface“. You can see empirical proof of the difference by comparing the following two queries:

aaa | bbb | ccc | ddd

(aaa | bbb) (ccc | ddd)

According to the Advanced Search dialog, these two queries are identical. However, the former retrieves ~10 times as many results. Clearly these queries are not being parsed in the same way.

Which comes as a relief, since many of the search strategies contained in previous published works contain multiple nested clauses, including those referred to in my Medium post. It is reassuring to learn that despite the alarming ‘evidence’ provided by the Advanced Search dialog, we can have some degree of cautious optimism that the searches these published works rely on have been interpreted in the manner their authors intended.

As a result, we will shortly be updating our own query translation support to accommodate these insights. In the meantime, if you’d like to try out some complex queries on Google Scholar yourself (without the cutting and pasting :), head on over to 2Dsearch and let us know what you think.

Originally published at isquared.wordpress.com on December 18, 2018.

--

--

Tony Russell-Rose

Technology innovator with extensive experience in search and information retrieval. PhD in AI / NLP. Specialist skills in user experience, HCI and human factors