Blog Security SemVer versioning: how we handled it with linear interval arithmetic
2021-09-28
8 min read

SemVer versioning: how we handled it with linear interval arithmetic

SemVer versioning made it difficult to automate processing. We turned to linear interval arithmetic to come up with a unified, language-agnostic semantic versioning approach.

Blog fallback hero

The semantic versioning (SemVer) specification can be
considered the de-facto standard for tracking software states during its
evolution. Unfortunately, in reality many languages/ecosystems practice "SemVer versioning" and have not adopted
the standard as-is; instead we can find many different semantic versioning
flavors that are not necessarily compatible with the original SemVer spec. SemVer Versioning has
led to the creation of a variety of different semantic versioning schemes.

GitLab provides a Dependency Scanning (DS)
feature that automatically detects vulnerabilities in the dependencies of a
software project for a variety of different languages. DS relies on the
GitLab Advisory Database
that is updated on a daily basis providing information about
vulnerable packages that is expressed in the package-specific (native)
semantic version dialect. GitLab also recently launched an Open Source Edition of the GitLab Advisory Database.

At GitLab we use a semi-automated process for advisory generation: we extract
advisory data that includes package names and vulnerable versions from
data-sources such as NVD and generate advisories that
adhere to the GitLab advisory format before they are curated and stored in our
GitLab Advisory Database.

The plethora of SemVer versioning in the wild posed a major
challenge for the level of automation we could apply in the advisory generation
process: the different semantic version dialects prevented us from building
generic mechanisms around version matching, version verification (i.e., the
process of verifying whether or not versions are available on the relevant package
registry), fixed version inference etc. Moreover, since advisory generation
requires us to extract and update advisory data on scale from data-sources with
hundreds of thousands vulnerability entries, translating and/or verifying
versions by hand is not a viable, scalable solution.

Having a generic method to digest and process a variety of different SemVer versioning dialects was an important building block for automating large parts of the advisory generation process. This led to the development of
semver_dialects, a
utility that helps processing semantic versions in a generic, language-agnostic manner which
has been recently open-sourced (MIT) and published on rubygems.org.

Understand the SemVer spec

The SemVer spec is the de-facto standard for tracking states of software projects during their evolution
by associating unique, comparable version numbers to distinct states, and by
encoding semantic properties into the semantic version strings so that a version
change implicitly conveys information about the nature of the change.

A semantic version consists of a prefix (version core) and a suffix that hold
pre-release and/or build information. A version core consists of three numeric
components that are delimited by .:

  • major: backwards-incompatible changes
  • minor: new backwards-compatible functionality
  • patch: backwards-compatible bug fixes

Considering a software project using SemVer, with two releases 1.0.0 and
1.0.1, by just looking at the change applied to the semantic version strings,
it is clear that 1.0.1 is a newer (more recent) release of the software, whereas version
1.0.0 is an older release. In addition, the version number 1.0.1
represents an improved state of the software as compared to version 1.0.0 which contained a bug
that has been fixed in version 1.0.1. This fix is signalled by the higher number of the patch version component.

Semantic version processing is particularly useful in the context of Dependency Scanning (DS). DS is the process of automatically detecting (and potentially fixing)
vulnerabilities related to the dependencies of a software project: dependencies
of a software project are checked against a set of configuration files (so
called advisories) that contain information about vulnerable dependencies;
advisories usually include the versions of the vulnerable dependency.
Vulnerable versions are usually expressed in terms of version intervals: for example this out-of-bounds read vulnerability for the Python tensorflow package contains information about the vulnerable version by listing the four version intervals below:

  1. up to 2.1.4
  2. from 2.2.0 up to 2.2.3
  3. from 2.3.0 up to 2.3.3
  4. from 2.4.0 up to 2.4.2

While SemVer is very concise and clear about the syntax and semantic of
semantic versions, it does not specify how to express and represent semantic
version constraints. In addition, SemVer is purposefully simplistic to foster
its adoption. In practice it seems as if many ecosystems required features that
go beyond SemVer which led to the development of many SemVer versioning flavours as well
as a variety of different native constraint matching syntaxes, some of which
deviate from the official SemVer specification. Depending on the ecosystem you
are working with, the same semantic version string may be treated/interpreted
differently: for example both Maven and pip/PyPI treat versions 1.2.3.SP
differently because pip/PyPI lacks the notion of an SP post release. Apart
from that, 1.2.3.SP cannot be considered a valid semantic version according
to the SemVer spec.

Today we have a variety of different semantic versioning schemes:

This SemVer versioning fragmentation limited the degree of automation we could apply to our
advisory extraction/generation process. This limitation motivated the
development of a methodology and tool semver_dialects that helps to digest and process semantic versions in a language agnostic way and, hence, helps to reduce the manual advisory curation effort.

Below, you can see an excerpt of the advisory information that is extracted and
generated by our semi-automated advisory generation process:

# ...
affected_range: ">=1.9,<=2.7.1||==2.8"
fixed_versions:
- "2.7.2"
- "2.8.1"
not_impacted: "All versions before 1.9, all versions after 2.7.1 before 2.8, all versions
  after 2.8"
solution: "Upgrade to versions 2.7.2, 2.8.1 or above."
# ...

In the excerpt above:

  • affected_range denotes the range of affected versions which is the machine-readable, native syntax used by the package manager/registry (in this case pypi).
  • fixed_versions denotes the concrete versions when the vulnerability has been fixed.
  • not_impacted provides a textual description of the versions that are not affected.
  • solution provides information about how to remediate the vulnerability.

To be able to extract and generate advisories like the one illustrated
above in a language/ecosystem agnostic way, we implemented and open-sourced a
generic semantic version representation and processing approach called
semver_dialects.

In the advisory excerpt above, the affected_range field contains the version
constraints in the native constraint syntax (in this case PyPI for Python);
fixed_versions can be inferred by inverting the affected_version (i.e.,
non-affected versions) and by selecting the first available version that falls
into the range of non-affected versions from the native package registry; this step
requires our approach to be able to parse the native semantic version syntax.

In order to deal with SemVer versioning and automatically process and generate the fields according to this
description, our semver_dialects implementation had to satisfy the following requirements:

  1. Provide a unified interface to the language specific dialects.
  2. Match semantic versions in a language agnostic way.
  3. Invert ranges.
  4. Cope with scattered, non-consecutive ranges.
  5. Parse and produce different version syntaxes.
  6. Parse and match versions/constraints in a best-effort manner.

SemVer versioning representation

First, we need a generic representation of a semantic version to start with. We
assume that a semantic version is composed of prefix and suffix where the
prefix contains segments for major, minor and patch version components as defined in the
SemVer specification. The suffix may hold additional information about pre/post
releases etc. As illustrated below, the major, minor and patch prefix segments
can be accessed by means of the corresponding methods.

s1 = SemanticVersion.new('1.2.3')
puts "segments: #{s1}"
# segments: 1:2:3
puts "major #{s1.major}"
# major 1
puts "minor #{s1.minor}"
# minor 2
puts "patch #{s1.patch}"
# patch 3

We cannot generally assume that all provided versions we would like to process
fully adhere to the SemVer spec which requires a version prefix (core) to
consist of three segments: major, minor and patch. Hence, per default, we
remove redundant, trailing zeros from the prefix to ensure that
2.0.0, 2.0 and 2 are considered identical.

Semver_dialects translates language specific version suffixes into numeric values. This process
can be described as version normalization. For example the Maven (pre-)release
candidate version 2.0.0.RC1 can be translated to a numeric representation
with prefix: 2 and suffix -1:1 by mapping RC to a numeric value (in this
example -1) and, thus, rendering it numerically comparable.

After this normalization step, semantic version matching for two versions vA
and vB can be implemented by simply numerically comparing their segments in a
pairwise fashion. For unknown suffices that are not mappable to the numeric
domain, we use lexical matching as a default fallback strategy.

In summary, comparing two semantic versions is a two-step process:

  1. Normalization: Extend both semantic versions to have the same prefix length and suffix
    lengths by appending zeros.
  2. Comparison: Iterate over segments and compare each of them numerically.

For example, after normalizing the versions 2.0.0.RC1 and 2.0.0 to 2:-1:1
and 2:0:0, respectively, we can iterate over the segments (delimited by
: in the example) which we can compare numerically to successfully identify
2:-1:1 as being the smaller (release-candidate) version in comparison to
2:0:0.

Constraint syntax - everything is a linear interval

Translating semantic versions into a generic representation makes them
numerically comparable which is already useful but not sufficient to express SemVer versioning constraints in a language-agnostic fashion.

For representing semantic version constraints in a generic way,
we rely on linear intervals. For the purpose of this blog, we define an interval as an ordered pair of two
semantic versions which we are referring to as lower and upper
bounds (or cuts). For the sake of simplicity, for the remainder of
this section we will use simple integers as examples for lower and upper bounds, respectively.

Linear intervals capture semantic version ranges symbolically which makes them
very versatile and space efficient. At the same time, we can rely on
well-established mathematical models borrowed from linear interval arithmetic
that enable us to translate/express any type of constraint in terms of
mathematical set operations on intervals.

In the table below you can find all the different types of intervals we
considered to model semantic version constraints and a corresponding
description where L stands for left, R stands for right with a and b
being the lower and upper bounds, respectively.

Type of interval Example Description

We want to hear from you

Enjoyed reading this blog post or have questions or feedback? Share your thoughts by creating a new topic in the GitLab community forum. Share your feedback

Ready to get started?

See what your team could do with a unified DevSecOps Platform.

Get free trial

New to GitLab and not sure where to start?

Get started guide

Learn about what GitLab can do for your team

Talk to an expert