Descriptors Computation and Storage in CLAM

The problem to solve

Several projects inside the MTG are developing feature extraction from audio data. Each project is using its own set of features and, uses its own feature container structures. Extraction algorithms are coupled with such data structures and thus reusing extraction algorithms among projects has become a tricky task. Because CLAM goals as framework includes to maximize this kind of reuse, it is clear that CLAM must give a general solution for such problem.

Several partial solutions have emerged from different sources. Each solution solves a different issue and it is clear that a synthetic proposal must be elaborated. The solutions to be synthesized are:

The DescriptorSet solution developed by Nir and Miguel for the Amadeus project.
The BasicStatistics refactoring done by Xavier Amatriain in order to integrate CUIDADO algorithms into CLAM
The solution developed by Nicolas Wack for the AudioClas project

Those solutions have been analyzed in a separate document. Each one faces different aspects of the problems such as:

Algorithms and data structures decoupling:
Algorithms should be decoupled, as much as possible, from the set of features we choose for a certain task. The algorithm should work for any sufficient data set in another project.
Calculation dependencies:
Some features are calculated from raw data but others come from others features. There should be a way to express such dependencies in order to plan the calculation order.
Reuse of calculations:
Because some features depend on others, some features may share dependencies and we may want to avoid calculating them twice. Also, applications that calculate goal features as needed may consider one dependant feature as a inner calculation step but later a goal in itself that it is already calculated.
Feature aggregation:
Often there is a feature that is calculated along the time. We ought to be able to calculate features from the aggregation of that feature along the time.
Heterogeneous types:
Not all the features have the same type. Some of them are floating point data, some other are arrays, enumerations, structures...
Serialization:
Another problem we should face is to serialize and deserialize the extracted features. This support should be as direct as in CLAM's Dynamic Types but being more flexible on the data structure definition.
Performance:
Clearly, performance is another important issue here. Heavy volumes of data will be handled and the solution must be efficient in time and space.

Besides that, CLAM and other MTG projects have implemented a lot of code that does concrete computations. It has been analyzed in order to detect the kind of dependencies, scopes and the kinds of results they give. A sketchy document has been compiled with some notes about them.

About computing descriptors

In order to extract information from audio, a lot of interdependent computations are needed. An extractor is an entity that performs a given computation by taking its dependencies. Most extractors perform the same computation repeatedly over a shifting context, like in the figure on the left. The contexts sequence where the extractor shifts on is called the scope. So, the information retrieved by an extractor is related to the context. For instance, most computations, like the one to get the Mel Cepstrum, use the frame scope, that means that those computation are computed and stored for each frame. We can say that values taken from those computations are related to the frame scope. Different informations have different scopes and some of them may share it.

Besides being the place holder for the computed information, the context is also the reference point to locate any data dependencies for the computation. By instance, in order to track the fundamental frequency for a frame, we should take the peak array for the context frame and maybe you would want also to take the results of the previous N frames for checking the fundamental regularity.

Very often dependencies to compute information belonging to a scope are found outside the scope. Thus, mechanisms to relate contexts along the scopes must be provided. The existing kinds of scope relations and ways to handle them will be addressed on the following sections.

A special case of computation should be taken into account. They just tell how those shifting contexts are defined (segmentate, bufferize...). The segmentation is not only to be done on the time domain. You can also take a 'segmentation' using a semantic domain. For example, instrument 'segmentation' when describing each melodic line in a song.

Summarizing, computations we should support have the following properties:

Any extracted information have sense in a given context
A scope is a sequence of contexts with all having the same kind semantic information
Extractors shift the extraction context in order to extract a given information along the scope
The extraction context is the place holder for the extracted information
The extraction context is also the reference point to locate inputs for the computation
Computations dependencies are cross scope
Scope contexts definition is also to be computed/extracted (segmentation, buferization...)

Glossary

Description context:: Is a target for a description task, the place a description value is related to, a concrete position, region or instance within a given description scope.
Also referred as Context, for short.
Description scope:: A set of description contexts which share the same attributes. It often represents a dimension whose positions or regions are the contexts. It also may represent a class, which instances are the contexts.
Also referred as Scope, for short.
Scope attribute:: Is a named value that every context on the scope has. Each attribute has a type that is enforced for each value in order to bind the extractors.
Also referred as Attribute, for short.
Attribute value:: The value that has an attribute when instantiated for a given description context.
Extractor:: Is a procedure that takes some data and computes the value for an attribute for each description context within a scope.
Extractor Binding:: The action of relating the extractor with its target attribute and its dependency sources
Extractor Target Attribute:: The attribute that will be computed by a given binded extractor.
Extractor Target Context:: The context for which the extractor is computing the attribute value on a given moment in time during the whole computation.
Dependency:: The way a extractor must get the input values in a relative way respect the current context.
Pool:: A container for description data.
ScopePool:: The pool that contains description data for a single scope. The description Pool has a cardinality wich defines the number of context on the scope.

The proposal on memory storage

Discriminating scope links and calculation dependencies

One weakness of the current CLAM way of joining descriptors into data structures is that they are aggregated by computation origin. For example, if you want to compute a frame descriptor and you need the frame audio to calculate it you place it in the AudioDescriptors object linked to the windowed audio of the frame just because you need that Audio to calculate it. But there are too many features that can be taken from the audio that are not always applicable. For example, ADSR descriptors are taken from the audio but have sense only in a note segment audio, not in a frame audio.

AudioClas approach suggests a simplification on this. Descriptors are aggregated by frame independently that it comes from the audio, the frame spectrum... The difference between that descriptor and the one calculated on the next audio frame is not being calculated on an audio frame but the frame index it is based on. Of course, there is more scopes than frame and global, so the AudioClas model should be generalized.

Low coupling using attribute names

The greatest drawback to share extraction algorithms is the data type coupling on the descriptors containers. Because every project has their own descriptors there are conflicts on how to define the shared container structure. Amadeus and AudioClas solutions solve this by using generic containers and accessing attributes by name.

Coupling on names exists but is a minimal and needed coupling on a dependency. Coupling on container structure is a coupling on side computations that a given project may not need to include.

Pools: Storing descriptors by scope kind

AudioClass solution has two data pools: a global pool and a frame pool. This is a scope based distinction. Other MTG projects will have more than two scopes, so we should generalize that solution.

The concepts we will deal with are represented on the figure on the right. A given scope definition is an aggregation of a set of attribute specification, each one specifying its name, type, calculation algorithm, dependencies...

Pools are the data containers, the instances of a given scope definition. They contain values in a two dimensional table: the crossproduct between the shifting contexts and the attributes. Pools have a generic interface. It uses the scope definition in order to access concrete data.

Those 'scopes' are more general than current temporal segments. For example, spectral bands may be another kind of scope if you want to attach a descriptor for each band. Also there were different scopes for a different semantic space (channels, notes...) that may have different descriptors and a sequence of items.

Allow multiple types on storage

The AudioClas approach only deals with scalar and vectorial TData descriptors. We can enhance this by providing a multi pool solution like Amadeus DescriptionSet. We can alternatively use an approach like the one is used for DynamicType internal representation but adapting it for:

Attribute definitions not hard-coded into a class
Attribute lookups by name
Dealing with arrays of data not only single data

The proposal is:

Let the Attribute specification deal with the data type dependant actions
Provide the user templatized methods to access the data
Check on run-time that the used template type matches the attribute definition type

Scopes and Pools on XML

Another problem related with having multiple types is how to deal with XML passivation. Amadeus solution already solves this by dumping a data dictionary with descriptors names and types. We can do a similar thing as we have a scope definition with all the related descriptors definitions to be dumped as data dictionary.

Inside a given application we may not passivate the data dictionary as the map between types and names is supposed to be the same. The loading can be made directly by the attribute definition data type specific methods.

Scope relations

We should use a scope division as a descriptor for a higher level scope space. For example, a melody is divided into phrases and phrases are divided into notes. Those are hierarchical scope spaces relations.

Scopes divisions can be orthogonal. I mean, we can do a semantical division on the time domain of a song by structural parts. We can also divide in time the song for any other independent criteria which should have a different descriptor set. Moreover, we can divide the song non in time but on an instrument basis in order to extract the score for each instrument. This is why I called it multidimensional: From a given point we can branch on different dimensions/criteria to generate a semantic sequence to be described.

Memory layout

There are some alternatives on the memory layout. Currently, the first one is implemented. A combination of them could be useful but simpler is better and will only be implemented if an strong requirement is found.

Independent array per attribute:
- Cache efficient, fast lookup
- Attribute instantiation and removal is simpler
- Easy to follow references, costly references updates
- Costly references updates
- Resizing, inserting, sorting... are complex and may imply reallocations
Attribute group for each element
- Disperse memory, slow lookup
- Ideal for growing elements on the scope

The proposal on dependency and computation

The following sections are about calculation and it is 'work-in-progress'

Previous sections have explained those aspects concerning the descriptors storage and structure once calculated.

Providing a way to specify dependencies (using scope way paths)

On the provided solutions there are two basic approaches. Well, indeed they are three but one is a do-it-yourself and i won't take it into account.

On one hand, AudioClas uses a high level approach. It deals with descriptors dependencies by their names. It is good because dependencies are not hard coded but validable at runtime.

On the other hand, the Statistics approach is mainly low level, in the sense that they are not used to calculate named values, which give them some special meaning, but arithmetical expressions and statistical functions over raw meaningless data. Dependencies here are mainly arithmetical.

expressing (identify?) calculation dependencies
way paths provides target data for template expressions

Kinds of dependencies

Whenever a extractor has to compute a given information for a given context, it needs to get the information from some locations that are relative to this context. There are several ways of specifying this dependency.

Same context: Data dependencies that are stored on the current context.
That's the simpler case; only the data id for the dependency is needed.
ie. The CepstralTransform extractor takes MelSpectrum to compute the MelCepstrum on the same frame context.
Intra scope: Dependencies that are stored on the same scope but on different context
- Relative: The contexts for the dependencies are in a fixed relative position in respect of the current context
  Besides the data id, a window is specified with relative offset and size to the current context.
  ie. ??
- Indirected: The context that are dependant is a computed value, stored on the current context.
  The dependency needs an extra context data specifying the context on the current point it relates. This data allows arbitrary access to the any
  ie. ??
Inter scope:
- Indirected: Just like the IntraScope non relative indirected but it refers to another scope.
- Scope aggregations: Aggregations are dependencies that aggregates a value along a contiguous context on another scope in a single context on other scope.
  It needs a reference on the scope and a range to apply it.
  ie. Computing the mean of the frame energy along a note. The scope for the result is Note and the aggregated scope is frame.
- Reciprocal referenced aggregation: Aggregates only those context on the dependency scope that references the current scope.

Most of the alternatives bellow may have a convenience size parameter to indicate a contiguous range along the context. Because dependencies are multiple, this size parameter it is not needed. The equivalent is specifying multiple dependencies.

Extractor binding

Binding an extractor is to tell it where the data comes from and where the data is to be stored. In order to tell where the data is to be stored, an scope and a field identifier should be specified. This binding also specifies the scope where the extractor will travel.

In order to tell where the dependency data comes from, a dependency specifier should be provided by following the patterns on the previous section. The dependency specifier should have an associated type derived from the target dependency scope. Also, the extractor should have a interface type for each dependency slot. The can be conformed on runtime but configuration time.

Goal API

The intended use interface is something like that. Note that the description scheme is loaded as an XML file.

CLAM::DescriptionScheme scheme("DescriptionScheme.xml");
scheme.addPlugin("DescriptionSchemeExtension.xml")
scheme.setParameter("FrameSize",256);
CLAM::DescriptionDataPool pool(scheme);
pool.ExtractFrom("mysong.mp3");
CLAM::XmlStorage::Dump(pool,"Description.xml","SimacDescription");

The description scheme should have a look like that:

<Parameter name="FrameSize" type="Integer" units="SampleSize" />
...
<Attribute scope="Frame" name="Center" type="SamplePosition" />
<Attribute scope="Frame" name="SpectralDistribution" type="Spectrum" />
....
<Extractor name="" >
	<Target scope="Frame" attribute="SpectralDistribution" />
	<Dependency
		type="Indirect"
		scope="Sample"
		attribute="Signal"
		indirectAttribute="Center"
		size = "$FrameSize"
	/>
</Extractor>
...