arrow_back Back to the top


Overview of the CASICS system architecture

The goal of the Comprehensive and Automated Software Inventory Creation System (CASICS) project is to develop machine learning techniques to analyze source code in software repositories. The basic approach is this:

  • Collect software source code from open-source repositories
  • Extract features from source code files to obtain identifiers, documentation strings, comments, and other text, and use identifier expansion methods convert short identifiers to more meaningful strings
  • Use the features as input into supervised, hierarchical multi-label classifiers to label software with respect to predefined ontologies
  • Use the ontologies to (1) present the classified results in an organized, hierarchical browser view and (2) support more intelligent software search

The following diagram illustrates the main software components in the overall CASICS scheme.

The modular components can run on separate computers and communicate over TCP/IP network connections using either the MondoDB API, the Python Pyro4 API, direct file system access, or a REST API (notably, GitHub's REST API).

  • Collector: a package to interact with a repository hosting service such as GitHub and collect information about software projects. It extracts such things as name, owner, description, "readme" file (if any), list of files at the top level, and other metadata, and stores it all in the CASICS Database.
  • Downloader: a package that takes a list of GitHub repositories and downloads the files to a local filesystem. This permits analysis tools to work on local copies of the source files.
  • Extractor: a package that extracts identifiers, text and other features from source code files. It uses language-aware parsers to extract elements intelligently. For example, it extracts Python imports, class names, function names, variable names, comments, and other elements as separate lists of each kind, so that machine learning methods can treat them differently.
  • CASICS Database: a MongoDB-based database that stores the metadata extracted from the code repositories. The other modules in CASICS communicate with the database for their various needs.
  • Annotator: a browser-based annotation interface used by CASICS annotators to add ontology terms to repository entries in the database. The CASICS Annotator is written in a combination of Python and JavaScript.
  • Analyzers: multiple packages that each perform some kind of inference using source code files and repository metadata.

As part of developing CASICS, we also developed some independent software libraries that can be used for other purposes and projects:

  • Dassie (Library of Congress Terms): a database of terms from the Library of Congress Subject Headings (LCSH) controlled vocabulary. We converted a copy of the LCSH terms into a MongoDB database that makes explicit the "is-a" relationships between LCSH terms. Dassie is a system that allows other programs to use normal MongoDB network API calls to search for LCSH terms and their relationships.
  • Nostril (Nonsense String Evaluator): a Python module that infers whether a given medium-length string of characters is likely to be random gibberish or something meaningful. Its main use is to decide whether short strings returned by source code mining methods are likely to be (e.g.) program identifiers, or random characters or other non-identifier strings.
  • Spiral (SPlitters for IdentifieRs: A Library): a Python library of functions for splitting identifiers found in source code files.

We have also been developing new ontologies for areas where we could not find existing ontologies with suitable terms or sufficient breadth:

  • Sofiont (Software Interface Ontology). This ontology provides terms for both human interface types and programmatic (API) interface types.