Descriptors Computation and Storage in CLAM

The problem to solve

Several projects inside the MTG are developing feature extraction from audio data. Each project is using its own set of features and, uses its own feature container structures. Extraction algorithms are coupled with such data structures and thus reusing extraction algorithms among projects has become a tricky task. Because CLAM goals as framework includes to maximize this kind of reuse, it is clear that CLAM must give a general solution for such problem.

Several partial solutions have emerged from different sources. Each solution solves a different issue and it is clear that a synthetic proposal must be elaborated. The solutions to be synthesized are:

Those solutions have been analyzed in a separate document. Each one faces different aspects of the problems such as:

Besides that, CLAM and other MTG projects have implemented a lot of code that does concrete computations. It has been analyzed in order to detect the kind of dependencies, scopes and the kinds of results they give. A sketchy document has been compiled with some notes about them.

About computing descriptors

In order to extract information from audio, a lot of interdependent computations are needed. An extractor is an entity that performs a given computation by taking its dependencies. Most extractors perform the same computation repeatedly over a shifting context, like in the figure on the left. The contexts sequence where the extractor shifts on is called the scope. So, the information retrieved by an extractor is related to the context. For instance, most computations, like the one to get the Mel Cepstrum, use the frame scope, that means that those computation are computed and stored for each frame. We can say that values taken from those computations are related to the frame scope. Different informations have different scopes and some of them may share it.

Besides being the place holder for the computed information, the context is also the reference point to locate any data dependencies for the computation. By instance, in order to track the fundamental frequency for a frame, we should take the peak array for the context frame and maybe you would want also to take the results of the previous N frames for checking the fundamental regularity.

Very often dependencies to compute information belonging to a scope are found outside the scope. Thus, mechanisms to relate contexts along the scopes must be provided. The existing kinds of scope relations and ways to handle them will be addressed on the following sections.

A special case of computation should be taken into account. They just tell how those shifting contexts are defined (segmentate, bufferize...). The segmentation is not only to be done on the time domain. You can also take a 'segmentation' using a semantic domain. For example, instrument 'segmentation' when describing each melodic line in a song.

Summarizing, computations we should support have the following properties:

Glossary

Description context:
Is a target for a description task, the place a description value is related to, a concrete position, region or instance within a given description scope.
Also referred as Context, for short.
Description scope:
A set of description contexts which share the same attributes. It often represents a dimension whose positions or regions are the contexts. It also may represent a class, which instances are the contexts.
Also referred as Scope, for short.
Scope attribute:
Is a named value that every context on the scope has. Each attribute has a type that is enforced for each value in order to bind the extractors.
Also referred as Attribute, for short.
Attribute value:
The value that has an attribute when instantiated for a given description context.
Extractor:
Is a procedure that takes some data and computes the value for an attribute for each description context within a scope.
Extractor Binding:
The action of relating the extractor with its target attribute and its dependency sources
Extractor Target Attribute:
The attribute that will be computed by a given binded extractor.
Extractor Target Context:
The context for which the extractor is computing the attribute value on a given moment in time during the whole computation.
Dependency:
The way a extractor must get the input values in a relative way respect the current context.
Pool:
A container for description data.
ScopePool:
The pool that contains description data for a single scope. The description Pool has a cardinality wich defines the number of context on the scope.

The proposal on memory storage

Discriminating scope links and calculation dependencies

One weakness of the current CLAM way of joining descriptors into data structures is that they are aggregated by computation origin. For example, if you want to compute a frame descriptor and you need the frame audio to calculate it you place it in the AudioDescriptors object linked to the windowed audio of the frame just because you need that Audio to calculate it. But there are too many features that can be taken from the audio that are not always applicable. For example, ADSR descriptors are taken from the audio but have sense only in a note segment audio, not in a frame audio.

AudioClas approach suggests a simplification on this. Descriptors are aggregated by frame independently that it comes from the audio, the frame spectrum... The difference between that descriptor and the one calculated on the next audio frame is not being calculated on an audio frame but the frame index it is based on. Of course, there is more scopes than frame and global, so the AudioClas model should be generalized.

Low coupling using attribute names

The greatest drawback to share extraction algorithms is the data type coupling on the descriptors containers. Because every project has their own descriptors there are conflicts on how to define the shared container structure. Amadeus and AudioClas solutions solve this by using generic containers and accessing attributes by name.

Coupling on names exists but is a minimal and needed coupling on a dependency. Coupling on container structure is a coupling on side computations that a given project may not need to include.

Pools: Storing descriptors by scope kind

AudioClass solution has two data pools: a global pool and a frame pool. This is a scope based distinction. Other MTG projects will have more than two scopes, so we should generalize that solution.

The concepts we will deal with are represented on the figure on the right. A given scope definition is an aggregation of a set of attribute specification, each one specifying its name, type, calculation algorithm, dependencies...

Pools are the data containers, the instances of a given scope definition. They contain values in a two dimensional table: the crossproduct between the shifting contexts and the attributes. Pools have a generic interface. It uses the scope definition in order to access concrete data.

Those 'scopes' are more general than current temporal segments. For example, spectral bands may be another kind of scope if you want to attach a descriptor for each band. Also there were different scopes for a different semantic space (channels, notes...) that may have different descriptors and a sequence of items.

Allow multiple types on storage

The AudioClas approach only deals with scalar and vectorial TData descriptors. We can enhance this by providing a multi pool solution like Amadeus DescriptionSet. We can alternatively use an approach like the one is used for DynamicType internal representation but adapting it for:

The proposal is:

Scopes and Pools on XML

Another problem related with having multiple types is how to deal with XML passivation. Amadeus solution already solves this by dumping a data dictionary with descriptors names and types. We can do a similar thing as we have a scope definition with all the related descriptors definitions to be dumped as data dictionary.

Inside a given application we may not passivate the data dictionary as the map between types and names is supposed to be the same. The loading can be made directly by the attribute definition data type specific methods.

Scope relations

We should use a scope division as a descriptor for a higher level scope space. For example, a melody is divided into phrases and phrases are divided into notes. Those are hierarchical scope spaces relations.

Scopes divisions can be orthogonal. I mean, we can do a semantical division on the time domain of a song by structural parts. We can also divide in time the song for any other independent criteria which should have a different descriptor set. Moreover, we can divide the song non in time but on an instrument basis in order to extract the score for each instrument. This is why I called it multidimensional: From a given point we can branch on different dimensions/criteria to generate a semantic sequence to be described.

Memory layout

There are some alternatives on the memory layout. Currently, the first one is implemented. A combination of them could be useful but simpler is better and will only be implemented if an strong requirement is found.

The proposal on dependency and computation

The following sections are about calculation and it is 'work-in-progress'

Previous sections have explained those aspects concerning the descriptors storage and structure once calculated.

Providing a way to specify dependencies (using scope way paths)

On the provided solutions there are two basic approaches. Well, indeed they are three but one is a do-it-yourself and i won't take it into account.

On one hand, AudioClas uses a high level approach. It deals with descriptors dependencies by their names. It is good because dependencies are not hard coded but validable at runtime.

On the other hand, the Statistics approach is mainly low level, in the sense that they are not used to calculate named values, which give them some special meaning, but arithmetical expressions and statistical functions over raw meaningless data. Dependencies here are mainly arithmetical.

Kinds of dependencies

Whenever a extractor has to compute a given information for a given context, it needs to get the information from some locations that are relative to this context. There are several ways of specifying this dependency.

Most of the alternatives bellow may have a convenience size parameter to indicate a contiguous range along the context. Because dependencies are multiple, this size parameter it is not needed. The equivalent is specifying multiple dependencies.

Extractor binding

Binding an extractor is to tell it where the data comes from and where the data is to be stored. In order to tell where the data is to be stored, an scope and a field identifier should be specified. This binding also specifies the scope where the extractor will travel.

In order to tell where the dependency data comes from, a dependency specifier should be provided by following the patterns on the previous section. The dependency specifier should have an associated type derived from the target dependency scope. Also, the extractor should have a interface type for each dependency slot. The can be conformed on runtime but configuration time.

Goal API

The intended use interface is something like that. Note that the description scheme is loaded as an XML file.

CLAM::DescriptionScheme scheme("DescriptionScheme.xml");
scheme.addPlugin("DescriptionSchemeExtension.xml")
scheme.setParameter("FrameSize",256);
CLAM::DescriptionDataPool pool(scheme);
pool.ExtractFrom("mysong.mp3");
CLAM::XmlStorage::Dump(pool,"Description.xml","SimacDescription");

The description scheme should have a look like that:

<Parameter name="FrameSize" type="Integer" units="SampleSize" />
...
<Attribute scope="Frame" name="Center" type="SamplePosition" />
<Attribute scope="Frame" name="SpectralDistribution" type="Spectrum" />
....
<Extractor name="" >
	<Target scope="Frame" attribute="SpectralDistribution" />
	<Dependency
		type="Indirect"
		scope="Sample"
		attribute="Signal"
		indirectAttribute="Center"
		size = "$FrameSize"
	/>
</Extractor>
...