A wealth of software is available, and more is being created. Yet, finding software for a given purpose remains difficult. Few resources exist to help users discover alternatives or understand the differences between them.
(1) It's often difficult to think of good search terms. (2) Good software can be buried in thousands of search results. (3) Search results don't tell you differences between software features.
(1) Most people don't know about all the possible options or how the options differ. (2) Answers are potentially biased: what people do know may be out of date or be incorrect.
(1) Papers are static, but software evolves rapidly. (2) Not all software projects have a publication associated with them. (3) Looking for software in papers is time-consuming.
Lacking better info, people often don’t use the best options, and sometimes unintentionally recreate existing tools. Time and money are wasted, reproducibility of research suffers, and funding agencies get poor return on investment.
We believe a comprehensive software index could make it easier to find software by providing results grouped along different dimensions and containing specific details about each software resource. This would help users find software more effectively and compare alternatives more systematically.
Software catalogs are not new, but many catalogs failed either because they were too simplistic (providing incomplete or misleading content) or relied on humans. Humans don’t scale well; automation is the only feasible way of cataloging the vast and ever-growing number of constantly-evolving software resources.
The Comprehensive and Automated Software Inventory Creation System is a project to create a proof of concept. CASICS uses machine learning techniques to analyze source code in software repositories such as GitHub.
The features are extracted from source code via language-aware parsers (for identifier names, libraries, doc strings, text strings, comments, etc.). Identifier expansion methods convert short identifiers (e.g., "readfromdb") to more meaningful strings ("read from database").
The features are used as input into supervised, hierarchical multi-label classifiers to label software with respect to predefined ontologies. We are using a combination of SWO, the Library of Congress Subject Headings, and custom ontologies.
Software can be placed in a hierarchical browser organized by the ontology. Search can be augmented with ontologies to recognize when users are looking for known types of features.
CASICS is still a work in progress, but we have developed many software components. We have also released some as independent Python packages that we hope other projects can use.
A separate page provides more information about the CASICS architecture, general principles, and how the different components interact with each other.
We developed CASICS with modularity and reuse in mind. Our GitHub organization for CASICS holds the repositories. There you can find Python packages such as Dassie (a database of terms in the Library of Congress Subject Headings), Nostril (a nonsense string detector) and Spiral (a library for splitting identifiers found in source code).