Several projects inside the MTG are developing feature extraction from audio data. Each project is using its own set of features and, uses its own feature container structures. Extraction algorithms are coupled with such data structures and thus reusing extraction algorithms among projects has become a tricky task. Because CLAM goals as framework includes to maximize this kind of reuse, it is clear that CLAM must give a general solution for such problem.
Several partial solutions have emerged from different sources. Each solution solves a different issue and it is clear that a synthetic proposal must be elaborated. The solutions to be synthesized are:
Those solutions have been analyzed in a separate document. Each one faces different aspects of the problems such as:
Besides that, CLAM and other MTG projects have implemented a lot of code that does concrete computations. It has been analyzed in order to detect the kind of dependencies, scopes and the kinds of results they give. A sketchy document has been compiled with some notes about them.
In order to extract information from audio, a lot of interdependent computations are needed. An extractor is an entity that performs a given computation by taking its dependencies. Most extractors perform the same computation repeatedly over a shifting context, like in the figure on the left. The contexts sequence where the extractor shifts on is called the scope. So, the information retrieved by an extractor is related to the context. For instance, most computations, like the one to get the Mel Cepstrum, use the frame scope, that means that those computation are computed and stored for each frame. We can say that values taken from those computations are related to the frame scope. Different informations have different scopes and some of them may share it.
Besides being the place holder for the computed information, the context is also the reference point to locate any data dependencies for the computation. By instance, in order to track the fundamental frequency for a frame, we should take the peak array for the context frame and maybe you would want also to take the results of the previous N frames for checking the fundamental regularity.
Very often dependencies to compute information belonging to a scope are found outside the scope. Thus, mechanisms to relate contexts along the scopes must be provided. The existing kinds of scope relations and ways to handle them will be addressed on the following sections.
A special case of computation should be taken into account. They just tell how those shifting contexts are defined (segmentate, bufferize...). The segmentation is not only to be done on the time domain. You can also take a 'segmentation' using a semantic domain. For example, instrument 'segmentation' when describing each melodic line in a song.
Summarizing, computations we should support have the following properties:
One weakness of the current CLAM way of joining descriptors into data structures is that they are aggregated by computation origin. For example, if you want to compute a frame descriptor and you need the frame audio to calculate it you place it in the AudioDescriptors object linked to the windowed audio of the frame just because you need that Audio to calculate it. But there are too many features that can be taken from the audio that are not always applicable. For example, ADSR descriptors are taken from the audio but have sense only in a note segment audio, not in a frame audio.
AudioClas approach suggests a simplification on this. Descriptors are aggregated by frame independently that it comes from the audio, the frame spectrum... The difference between that descriptor and the one calculated on the next audio frame is not being calculated on an audio frame but the frame index it is based on. Of course, there is more scopes than frame and global, so the AudioClas model should be generalized.
The greatest drawback to share extraction algorithms is the data type coupling on the descriptors containers. Because every project has their own descriptors there are conflicts on how to define the shared container structure. Amadeus and AudioClas solutions solve this by using generic containers and accessing attributes by name.
Coupling on names exists but is a minimal and needed coupling on a dependency. Coupling on container structure is a coupling on side computations that a given project may not need to include.
AudioClass solution has two data pools: a global pool and a frame pool. This is a scope based distinction. Other MTG projects will have more than two scopes, so we should generalize that solution.
The concepts we will deal with are represented on the figure on the right. A given scope definition is an aggregation of a set of attribute specification, each one specifying its name, type, calculation algorithm, dependencies...
Pools are the data containers, the instances of a given scope definition. They contain values in a two dimensional table: the crossproduct between the shifting contexts and the attributes. Pools have a generic interface. It uses the scope definition in order to access concrete data.
Those 'scopes' are more general than current temporal segments. For example, spectral bands may be another kind of scope if you want to attach a descriptor for each band. Also there were different scopes for a different semantic space (channels, notes...) that may have different descriptors and a sequence of items.
The AudioClas approach only deals with scalar and vectorial TData descriptors. We can enhance this by providing a multi pool solution like Amadeus DescriptionSet. We can alternatively use an approach like the one is used for DynamicType internal representation but adapting it for:
The proposal is:
Another problem related with having multiple types is how to deal with XML passivation. Amadeus solution already solves this by dumping a data dictionary with descriptors names and types. We can do a similar thing as we have a scope definition with all the related descriptors definitions to be dumped as data dictionary.
Inside a given application we may not passivate the data dictionary as the map between types and names is supposed to be the same. The loading can be made directly by the attribute definition data type specific methods.
We should use a scope division as a descriptor for a higher level scope space. For example, a melody is divided into phrases and phrases are divided into notes. Those are hierarchical scope spaces relations.
Scopes divisions can be orthogonal. I mean, we can do a semantical division on the time domain of a song by structural parts. We can also divide in time the song for any other independent criteria which should have a different descriptor set. Moreover, we can divide the song non in time but on an instrument basis in order to extract the score for each instrument. This is why I called it multidimensional: From a given point we can branch on different dimensions/criteria to generate a semantic sequence to be described.
There are some alternatives on the memory layout. Currently, the first one is implemented. A combination of them could be useful but simpler is better and will only be implemented if an strong requirement is found.
The following sections are about calculation and it is 'work-in-progress'
Previous sections have explained those aspects concerning the descriptors storage and structure once calculated.
On the provided solutions there are two basic approaches. Well, indeed they are three but one is a do-it-yourself and i won't take it into account.
On one hand, AudioClas uses a high level approach. It deals with descriptors dependencies by their names. It is good because dependencies are not hard coded but validable at runtime.
On the other hand, the Statistics approach is mainly low level, in the sense that they are not used to calculate named values, which give them some special meaning, but arithmetical expressions and statistical functions over raw meaningless data. Dependencies here are mainly arithmetical.
Whenever a extractor has to compute a given information for a given context, it needs to get the information from some locations that are relative to this context. There are several ways of specifying this dependency.
Most of the alternatives bellow may have a convenience size parameter to indicate a contiguous range along the context. Because dependencies are multiple, this size parameter it is not needed. The equivalent is specifying multiple dependencies.
Binding an extractor is to tell it where the data comes from and where the data is to be stored. In order to tell where the data is to be stored, an scope and a field identifier should be specified. This binding also specifies the scope where the extractor will travel.
In order to tell where the dependency data comes from, a dependency specifier should be provided by following the patterns on the previous section. The dependency specifier should have an associated type derived from the target dependency scope. Also, the extractor should have a interface type for each dependency slot. The can be conformed on runtime but configuration time.
The intended use interface is something like that. Note that the description scheme is loaded as an XML file.
CLAM::DescriptionScheme scheme("DescriptionScheme.xml"); scheme.addPlugin("DescriptionSchemeExtension.xml") scheme.setParameter("FrameSize",256); CLAM::DescriptionDataPool pool(scheme); pool.ExtractFrom("mysong.mp3"); CLAM::XmlStorage::Dump(pool,"Description.xml","SimacDescription");
The description scheme should have a look like that:
<Parameter name="FrameSize" type="Integer" units="SampleSize" /> ... <Attribute scope="Frame" name="Center" type="SamplePosition" /> <Attribute scope="Frame" name="SpectralDistribution" type="Spectrum" /> .... <Extractor name="" > <Target scope="Frame" attribute="SpectralDistribution" /> <Dependency type="Indirect" scope="Sample" attribute="Signal" indirectAttribute="Center" size = "$FrameSize" /> </Extractor> ...