The NucScholar project has two primary goals: 1) the automated processing and sorting of nuclear science publications and 2) an intuitive tool for natural language search of the literature.
NucScholar seeks to develop a literature-processing pipeline that automatically retrieves, sorts, and extracts information from nuclear science articles. The text of the article is extracted directly from the PDF file and subsequently processed with well-established natural language processing techniques (lemmatization, stemming, stop word removal). This text data is organized into a term-document matrix in which the entries correspond to the frequency of a given word in a specific document. The singular value decomposition of the term-document matrix yields a new basis consisting of linear combinations of nuclear science terms, or “topics”, that allow for nuanced categorization of papers. This topic modeling is guided by a dictionary of nuclear-science specific words generated from the indices of relevant text books.
The second goal of the project is a nuclear science search interface in which a user asks a specific, technical question and receives a specific, correct answer. To implement natural language queries, we intend to fine tune a pre-trained, generic language model (in particular the Bidirectional Encoder Representations from Transformers, or BERT) using nuclear science literature. Once the BERT model has been adapted to the nuclear context users will be able to interact with the search engine via an intuitive question-answer system, which represents a significant improvement compared to the pedantic and difficult to use search utilities that are currently available. The user’s query will result in not only an answer to the initial question, but also a list of the most relevant papers as well as links to the appropriate entries in nuclear science databases.