Our Work
Semantic Scholar Publications
We are an interdisciplinary research team focused on AI, HCI, ML, NLP, accessibility and computational social science in support of Semantic Scholar's mission of accelerating science. Our team is part of the Allen Institute for AI, a nonprofit research institute advancing AI for the common good.
Follow us on Twitter for research updates!
Scholarly text is often laden with jargon, or specialized language that can facilitate efficient in-group communication within fields but hi...
Recent work has shown that infusing layout features into language models (LMs) improves processing of visually-rich documents such as scient...
What is the effect of releasing a preprint of a paper before it is submitted for peer review? No randomized controlled trial has been conduc...
This work introduces a simple yet effective framework for handling such complex queries by decomposing the query into individual clues, routing those as sub-queries to specialized retrievers, and ensembling the results.
This work proposes a question-answering framework for decontextualization that allows for better handling of user information needs and preferences when determining the scope of rewriting, and presents results showing state-of-the-art LLMs under this framework remain competitive with end-to-end approaches.
It is shown how a simple ER technique that caches activations from an intermediate layer of a pretrained model, and learns task-specific adapters on the later layers, is broadly effective and reveals important areas of future work.
LongEval is presented, a set of guidelines for human evaluation of faithfulness in long-form summaries that addresses the following challenges: How can high inter-annotator agreement on faithfulness scores be achieved?
A neural baseline method designed for EL is introduced on scientific tables containing many out-of-knowledge-base mentions, and it significantly outperforms a state- of-the-art generic table EL method.
CiteSee is a paper reading tool that leverages a user’s publishing, reading, and saving activities to provide personalized visual augmentations and context around citations to help users prioritize their exploration.
In order to help scholars understand and follow a research topic, significant research has been devoted to creating systems that help schola...
This work designs a system, Relatedly, that scaffolds exploring and reading multiple related work paragraphs on a topic, with features including dynamic re-ranking and highlighting to spotlight unexplored dissimilar information, auto-generated descriptive paragraph headings, and low-lighting of redundant information.
It is argued that developing AI supports for expository writing has unique and exciting research challenges and can lead to high real-world impacts.
We present Queer in AI as a case study for community-led participatory design in AI. We examine how participatory design and intersectional ...
This work introduces Scim, a novel intelligent interface that helps experienced researchers skim – or rapidly review – a paper to attain a cursory understanding of its contents and discusses design considerations and tensions for the design of future intelligent skimming tools.
This paper describes the Semantic Reader Project, a collaborative effort across multiple institutions to explore automatic creation of dynamic reading interfaces for research papers, and develops and releases a production reading interface that will incorporate the best features as they mature.
Traditionally, writing assistance systems have focused on short or even single-word suggestions. Recently, large language models like GPT-3 ...
This paper combines public and proprietary data sources using state-of-theart techniques for scholarly PDF content extraction and automatic knowledge graph construction to build the Semantic Scholar Academic Graph, the largest open scientific literature graph to-date.
It is found that existing summarizers suffer large reductions in performance when applied as-is to this more realistic task, though training summarizers with retrieved inputs can reduce their sensitivity retrieval errors.
Empirical results suggest that scale is not the only way, as novel algorithms can be a promising alternative, and leads to a new corpus of generics, Gen-A-tomic, that is the largest and highest quality available to date.
This work considers design choices for the annotation interface used to elicit human judgments and their impact on reproducibility, and develops an automated mechanism for maintaining annotator quality via a probabilistic model that detects and excludes noisy annotators.
This paper proposes to train a GenQA model by transferring knowledge from a trained AS2 model, and proposes to use the As2 model prediction scores for loss weighting and score-conditioned input/output shaping, to aid the knowledge transfer.
This paper proposes three novel sentence-level transformer pre-training objectives that incorporate paragraph-level semantics within and across documents, to improve the performance of transformers for AS2, and mitigate the requirement of large labeled datasets.
This paper proposes a Multiple Heads Student architecture (named CERBERUS), an efficient neural network designed to distill an ensemble of large transformers into a single smaller model, rivaling the state-of-the-art large AS2 models that have 2.7x more parameters and run 2x slower.
It is shown how state-of-the-art models struggle to generalize across task formats, and that simple multi-task training fails to improve them, and a new approach that learns multiple embeddings per document, each tailored to a different format, can improve performance.
This paper introduces G EN -T Y D I QA, an extension of the TyDiQA dataset with well-formed and complete answers for Arabic, Bengali, English, Japanese, and Russian questions and presents the first Cross-Lingual answer sentence generation system (C ROSS -L INGUAL G EN QA).
BLOOM is a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers and achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning.
It may be possible for readers of all abilities to organically leave traces in papers, and that these traces can be used to facilitate navigation tasks, in particular for low-vision readers.
A tool integrated into users’ reading process that helps them with leveraging authors’ existing summarization of threads, typically in introduction or related work sections, in order to situate their own work’s contributions is developed.
This work introduces a new technique, polymorphic lenses, that improves exploratory search over a KG by obtaining new leverage from the existing preference models that KG-based systems maintain for recommending content.
SciFact-Open is presented, a new test collection designed to evaluate the performance of scientific claim verification systems on a corpus of 500K research abstracts, and it is found that systems developed on smaller corpora struggle to generalize to SciFact- open, exhibiting performance drops of at least 15 F1.
It is empirically demonstrate that MulCo provides improved ability to fuse local and global contexts encoded using BERT and GNN compared to the current state-of-the-art.
MultiVerS is presented, which predicts a fact-checking label and identifies rationales in a multitask fashion based on a shared encoding of the claim and full document context, and allows MultiVerS to perform weakly-supervised domain adaptation by training on scientific documents labeled using high-precision heuristics.
This work identifies the right prompting approach by extensively exploring natural language prompts on FEB and demonstrates that making progress on few-shot self-rationalization is possible, and presents FEB -- a standardized collection of four existing English-language datasets and associated metrics.
This work proposes a novel method for equipping long-context QA models with an additional sequence-level objective for better identification of the supporting evidence, via an additional contrastive supervision signal in finetuning.
A novel system that automatically retrieves patient-specific literature based on intensive care (ICU) patient information, aggregates relevant papers and fuses them with internal admission notes to form outcome predictions, which is able to substantially boost predictive accuracy on three challenging tasks in comparison to strong recent baselines.
This paper shows that popular pre-trained transformers perform poorly when used for fine-tuning on multi-candidate inference tasks, and proposes a new pre-training objective that models the paragraph-level semantics across multiple input sentences.
Multi-LexSum is introduced, a collection of 9,280 expert-authored summaries drawn from ongoing CRLC writing that presents a challenging multi-document summarization task given the length of the source documents, often exceeding two hundred pages per case.
The framework presented is a multi-party international governance structure focused on language data, and incorporating technical and organizational tools needed to support its work.
We present Aspire, a new scientific document similarity model based on matching fine-grained aspects.
We introduce new methods for incorporating VIsual LAyout (VILA) structures, e.g., the grouping of page texts into text lines or text blocks, into language models to further improve performance on automated scientific document understanding.
Grounding model predictions in clinically-relevant symptoms can improve generalizability while producing a model that is easier to inspect, and this approach can still perform competitively on in-domain data.
This tutorial aims at bringing interested NLP researchers up to speed about the recent and ongoing techniques for zero- and few-shot learning with pretrained language models.
This work presents a novel framework informed by linguistic theory to generate exemplars—specific cases when a generic holds true or false and highlights the importance of linguistic theory-based controllability for generating exemplars, the insufficiency of knowledge bases as a source of exemplar, and the challenges exemplars pose for the task of natural language inference.
This work proposes scientific claim generation, the task of generating one or more atomic and verifiable claims from scientific sentences, and demonstrates its usefulness in zero-shot fact checking for biomedical claims, and proposes CLAIMGEN-BART, a new supervised method for generating claims supported by the literature, as well as KBIN, a novel methods for generating claim negations.
ACCoRD, an end-to-end system tackling the novel task of generating sets of descriptions of scientific concepts, takes advantage of the myriad ways a concept is mentioned across the scientific literature to produce distinct, diverse descriptions of target scientific concepts in terms of different reference concepts.
PRIMERA is introduced, a pre-trained model for multi-document representation with a focus on summarization that reduces the need for dataset-specific architectures and large amounts of fine-tuning labeled data.
A novel computational representation that automatically breaks up products into fine-grained functional facets is proposed that leads to a significant boost in search accuracy and in the quality of creative inspirations, outperforming strong baselines and state-of-art representations of product texts by 50-60%.
This work introduces multiple new methods for augmenting recommendations with textual relevance messages that highlight knowledge-graph connections between recommended papers and a user’s publication and interaction history and develops a novel method that highlights connections with proxy authors of interest to users.
This work contributes two datasets to the study of mentorship, one of which has over 300,000 ground truth academic mentor-mentee pairs obtained from multiple diverse, manually-curated sources, and linked to the Semantic Scholar (S2) knowledge graph.
We construct a faceted representation of authors with information gleaned from their papers and inferred author personas, and use it to develop an approach that locates commonalities ("bridges") and contrasts between scientists. This approach helps users discover authors considered useful for generating novel research directions.
A National Science Foundation Convergence Accelerator project is described to build a set of Knowledge Network Programming Infrastructure systems to address the issue of frustratingly slow building, using, and scaling large knowledge networks.
A novel paper reading experience that integrates relevant information about follow-on work directly into a paper, allowing readers to learn about newer papers and see how a paper is discussed by its citing papers in the context of the reference paper.
PINOCCHIO is presented, a new decoding method that improves the consistency of a transformer-based abstractive summarizer by constraining beam search to avoid hallucinations.
This paper introduces LIMEADE, the first general framework that translates both positive and negative advice (expressed using high-level vocabulary such as that employed by post-hoc explanations) into an update to an arbitrary, underlying opaque model.
To improve access to medical papers, we introduce a novel interactive interface-Paper Plain-with four features powered by natural language processing: definitions of unfamiliar terms, in-situ plain language section summaries, a collection of key questions that guide readers to answering passages, and plain language summaries of the answering passages.
This work examines an extreme evaluation setting wherein only a single known relevant document per query is available for evaluation, and finds that although the predictions of these One-Shot Labelers frequently disagree with human assessments, the labels they produce yield a far more reliable ranking of systems than the single labels do alone.
Our goal is to bolster the ability of researchers and clinicians to keep track of difficulties, limitations and emerging hypotheses.
Few-shot NLP research lacks a unified, challenging-yet-realistic evaluation setup. In response, we introduce FLEX, a rigorous few-shot learning NLP benchmark and public leaderboard measuring four transfer types. We also present UniFew, a simple, competitive baseline that does not rely on heavy prompt engineering or complex meta-learning methods.
This paper proposes generating personalized scientific concept descriptions that are tailored to the user’s expertise and context and outlines a complete architecture for the task and releases an expert-annotated resource, ACCoRD.
This work releases MS^2 (Multi-Document Summarization of Medical Studies ), a dataset of over 470k documents and 20K summaries derived from the scientific literature that facilitates the development of systems that can assess and aggregate contradictory evidence across multiple studies , and is the first large-scale, publicly available multi-document summarization dataset in the biomedical domain.
A new pretrained language model for cross document tasks.
We present SciA11y, a system that renders inaccessible scientific paper PDFs into HTML.
An extension of cross-document coreference with a referential hierarchy over mention clusters, in the scientific document domain. New task, dataset and models with applications in faceted document retrieval and knowledge base construction.
Integrating scientific language models and graph embeddings for boosting drug discovery.
In response to this challenge, we present S2AND, a unified benchmark dataset for AND on scholarly papers, as well as an open-source reference model implementation.
PAWLS is a new annotation tool designed specifically for the PDF document format. PAWLS supports span-based textual annotation, N-ary relations and freeform, non-textual bounding boxes, all of which can be exported in convenient formats for training multi-modal machine learning models.
We address the task of citation text generation: given a pair of scientific documents, explain their relationship in natural language text in the manner of a citation from one text to the other.
We introduce ParsiNLU, the first benchmark in Persian language that includes a range of high-level tasks -- Reading Comprehension, Textual Entailment, etc. These datasets are collected in a multitude of ways, often involving manual annotations by native speakers.
We highlight three understudied phenomena for citation context analysis and release MultiCite, a new dataset of 12.6K citation contexts from 1.2K computational linguistics papers that fully models these phenomena.
We present an overview of the SCIVER shared task. In addition to surveying the participating systems, we provide several insights into modeling approaches to support continued progress and future research on scientific claim verification.
To navigate the collection of COVID19 papers from different domains, we present a KB of mechanisms relating to COVID19, to support domain-agnostic search and exploration of general activities, functions, influences and associations in these papers.
Qasper is a dataset of 5049 questions over 1585 NLP papers designed to facilitate document-grounded, information-seeking QA. Existing models that do well on other QA tasks do not perform well on these questions.
A new robust and lightweight tool for acquiring, managing, and performing typical operations over datasets used in IR, primarily focus on textual datasets used for ad-hoc search.
This work conducts mixed-method user studies on three datasets, where an AI with accuracy comparable to humans helps participants solve a task (explaining itself in some conditions), and observes complementary improvements from AI augmentation that were not increased by explanations.
We introduce ScholarPhi, an augmented reading interface that brings definitions of technical terms and symbols to readers when and where they need them most.
Accessibility research has grown substantially in the past few decades, yet there has been no literature review of the field. To understand current and historical trends, we created and analyzed a dataset of accessibility papers appearing at CHI and ASSETS since ASSETS' founding in 1994.
CODE introduces neuron-level analyses and transformations aimed at identifying and removing redundant computation from the networks that compose the ensemble that enables CODE to train large DNN ensembles in a fraction of the time and memory footprint needed by current techniques.
This paper provides a comprehensive overview of the structure and results of TREC-COVID, an information retrieval (IR) shared task to evaluate search on scientific literature related to COVID-19.
The majority of scientific papers are distributed in PDF, which pose challenges for accessibility, especially for blind and low vision (BLV) readers. We characterize the scope of this problem...
An open-source library for streamlining the usage of deep learning in document image analysis research and applications.
An analysis of 2.87 million computer science papers reveals that, if current trends continue, parity between the number of male and female authors will not be reached in this century. With optimistic projection models, gender parity is forecast to be reached by 2100 in CS, but projected to be reached within two to three decades in the biomedical literature.
It is argued that AI systems should be trained in a human-centered manner, directly optimized for team performance, and the benefit of modeling teamwork during training is shown through improvements in expected team utility across datasets, considering parameters such as human skill and the cost of mistakes.
In this paper, we present a new method for generating extended summaries of long papers.
This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks. GENIE automatically posts leaderboard submissions to crowdsourcing platforms and presents both manual and automatic metrics on the leaderboard.
This review discusses the corpora, modeling resources, systems and shared tasks that have been introduced for COVID-19, and lists 39 systems that provide functionality such as search, discovery, visualization and summarization over the COVID-19 literature.
The results suggest that while CORD-19 exhibits a strong tilt toward recent and topically focused articles, the knowledge being explored to attack the pandemic encompasses a much longer time span and is very interdisciplinary.
This work adapts the Golden Rules Set (a language specific set of sentence boundary exemplars) originally implemented as a ruby gem pragmatic segmenter to Python, ported to Python with additional improvements and functionality.
The task of definition detection is important for scholarly papers, because papers often make use of technical terminology that may be unfamiliar to readers. We develop a new definition detection system, HEDDEx, that utilizes syntactic features, transformer encoders, and heuristic filters, and evaluate it on a standard sentence-level benchmark.
We construct SciFact, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts annotated with labels and rationales. We develop baseline models for SciFact, and demonstrate that these models benefit from combined training on a large dataset of claims about Wikipedia articles, together with the new SciFact data.
We introduce TLDR generation for scientific papers, a new automatic summarization task with high source compression and provide a new dataset and models for effective generation of TLDRs.
SciSight is a novel framework for exploratory search of COVID-19 research that integrates two key capabilities: first, exploring interactions between biomedical facets (e.g., proteins, genes, drugs, diseases, patient characteristics); and second, discovering groups of researchers and how they are connected.
We present a zero-shot ranking algorithm that adapts to COVID-related scientific literature. Our approach filters training data from another collection down to medical-related queries, uses a neural reranking model pre-trained on scientific text (SciBERT), and filters the target document collection.
To address challenges in figure retrieval and figure-to-text alignment, we introduce MedICaT, a dataset of medical images in context.
A new comprehensive framework for Analyzing the Behavior of Neural IR ModeLs (ABNIRML), which includes new types of diagnostic tests that allow us to probe several characteristics---such as sensitivity to word order---that are not addressed by previous techniques.
This work investigates G-DAUG^C, a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting, and demonstrates that it produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.
Ontologies are critical to support the types of big data analysis necessary for kidney precision medicine, where heterogeneous clinical, imaging and biopsy data from diverse sources must be combined to define a patient's phenotype.
A novel, unsupervised method for extracting scientific concepts from papers, based on the intuition that each scientific concept is likely to be introduced or popularized by a single paper that is disproportionately cited by subsequent papers mentioning the concept.