It is mentioned in text, but just sharing the Figshare supplmentary data here for slightly easier discoverability: http://figshare.com/articles/Dark_Research_supplementary_materials/1284356
Really nice work, Ross.
Really interesting study... I have a few comments. The first is how comparable are PLoS and Zootaxa from the perspective of search engines? Am I correct that you used a complete set of Zootaxa PDFs obtained from the NHM? In other words, articles that are both open access and behind a paywall? This is unlikely to be the same as the collection visible to search engines (the bulk of Zootaxa articles are not open access). Perhaps a better question is how the open access subset of Zootaxa compares to PLoS?
Do we know whether Google Scholar indexes PLoS PDFs or indexes the HTML? I know Google Scholar can be pretty opaque about its methods, but are you possibly confounding different media (PDF versus HTML) with different degrees of access?
Did you talk to Zhi-Qiang Zhang (editor of Zootaxa)? He should be able to tell you what access Google Scholar (and other search engines) have to Zootaxa content. I know that Zhi-Qiang is very keen to maximise the visibility of Zootaxa content (which is one reason why they've adopted DOIs, and moved to the OJS platform for publishing content). You are making various statements about how you think search engines access content, it would be interesting to actually know.
Lastly, as I'm sure you are aware, there is a world of difference between a well funded organisation like PLoS that serves a (mostly) well funded scientific community (which have grants to cover the costs of publishing), versus a shoe-string operation serving a community that is not all well funded. You might say that "all things being equal" publishing in PLoS is a better bet in terms of exposing your research, but things are anything but equal.
line 63: concerning the inclusion of MAS, you might want to look at this preprint paper about the possible deprecation of MAS: http://arxiv.org/abs/1404.7045
table 3: Do you compare the mean recall with the n/a values included? Is this a fair comparison?
177: Concerning the importance of terms in title, this is not apparent from the results. Such a result is also useful to discuss the importance of choosing a good title.
186: OA doesn't seem to solve the problem for Scopus
205: perhaps consider to move this paragraph to the Discussion, as it makes the recommendations clearer
212: considering PLOS ONE is also available in HTML & XML, which could be more easily indexable than PDF, I'm unsure whether the result should be attributed fully to open access as done in this speculation, or whether it has a technical reason. While you control for the difference between PDF & HTML in your local search method, this does not mean GS also does this. You might want to compare with an OA journal that only does OA, or a closed journal with HTML (& XML) versions.
217: I'm unsure whether the current analysis is sufficient to call research "dark research"; less than 100% recall for a single search term does not mean the paper is not discoverable with another. Boeker, Vach & Motschall (2013, doi:10.1186/1471-2288-13-131) demonstrate systematic review with much more comprehensive queries. One might ask whether single keyword searches can constitute a demonstration of systematic review
I'm wondering if you could use the authors own Keywords as a recall test? 1) mine the keywords the author provides; 2) test if submitting those exact keywords to the search index (all of them at first, then perhaps iterate through fewer) return the paper via the various search engines. This would require mining the keywords, which would be trivial if papers were nicely marked up, but of course won't be in reality. Following that you could, perhaps, script the the analysis of whether a hit happens for a search engine? The idea is is that if we can't find the paper via exactly the words the author thinks are most important, and indeed those that should index a paper, then something is (globally) amiss.
I do not think your abstract reflects what you have going for you in this paper until the last three sentences. I would suggest that you revise your abstract if possible to sell your good work a bit better.
I greatly enjoyed this work and as I was scrolling through the feedback I ran across one of your responses which is exactly why I enjoyed the work -
"I want to test the discoverability of articles (regardless of OA or not). Yes, it does seem reasonable to pre-suppose that open access articles might be advantaged, but until we prove that with data I can't just make that assumption."
As a librarian I have struggled with blanketed complaints of the accessibility of materials without any sufficient data to really back up why we chose to use certain retrieval systems.
I see potential for subject liaison librarians to be able to apply your methods to better understand information retrieval from a different perspective. Typically we as librarians tend to approach retrieval from a materials perspective and rely on our data-savvy colleagues or technology librarians to assist us with understanding the technical aspects of information retrieval. I have learned over the past three years that this is a poor practice as all librarians in 2015 are technology librarians and have to obtain the skills to analyze these tools properly.
I found your idea to attempt to ascertain "causitive mechanism(s) preventing Zootaxa content from being more discoverable via services such as GS" to be another angle to approach the retrieval issues.
I appreciated that you "scored the sections of the article the keywords occurred in." I think this adds another layer to research and I hope that you continue to keep the "fine-grain" as a part of your future analysis.
I am specifically an Open Education Librarian and I find that some of my time is dedicated to finding work-arounds for access to materials. While it is true that open journal articles might appear more easily discovered we do need consistent data to support why this might be so. We can no longer get away with promoting open content without this data.
Good work, Ross. Thank you.
I'm not sure with other disciplines, but for medical content, it depends on the question. So structuring the research question you want to retrieve articles for in a PICO (Population, Intervention, Comparator, Outcome) format. This has been recommended by groups like Cochrane and deemed the most effective starting point, which ensure the most relevant and sufficient databases are referred for retrieving articles.
There seems to be no point looking up numerous databases with complex coding systems when enough articles are retrieved to see the effect via meta-analysis, and ultimately answer the research question. Although retrieval codes and refining databases are an ongoing research thing. .
I find however for pre-prints and non-published content not indexed to databases - a growing thing and I guess the challenge would be designing retrieval systems that would differentiate between good and bad research designs. To go even further, retrieving data from repositories and anywhere in the online Diaspora, and having tools to distinguish between "good" and "bad" data. Right now there are bias appraisal tools but it's qualitative and not quantitative.
I've been thinking about this the last couple days.... some sort of relativist transparency mechanism. Since novel findings are novel - transparency tools would remain retrospective, and can only differentiate good and bad data from existing literature. If the tool was "meta" and measured the inbetweens, so a relative pattern between data and not THE data, then maybe a differentiation between good and data might be possible even for novel findings. I can't think of theoretical paradigms for this, or perhaps there are in computational frameworks ...
You can also choose to receive updates via daily or weekly email digests. If you are following multiple preprints then we will send you no more than one email per day or week based on your preferences.
Note: You are now also subscribed to the subject areas of this preprint and will receive updates in the daily or weekly email digests if turned on. You can add specific subject areas through your profile settings.
Usage since published - updated daily