Planet CLAM is a window into the world, work and lives of CLAM hackers and contributors. The planet is open to any blog feed that occasionally relates with CLAM or its brother projects like testfarm. Ask in the devel list to get in.

PLANET CLAM

September 07, 2013

AP-Gen new release (LADSPA and VST support)

AP-Gen speeds up and eases the plugin development through base source code generation, both for different standards and operating systems, thus achieving that the developer can focus on his goal, the digital audio processing. To achieve this, starts from normalized … Continue reading

July 30, 2013

VST cross compiling in Linux

1. Install mingw32 and wine: $ sudo apt-get install mingw32 $ sudo apt-get install wine 2. Download Steinberg VST SDK 2.4 and unzip it. 3. Create a PLUGIN_NAME.def file: LIBRARY     '' DESCRIPTION '' EXPORTS     main=VSTPluginMain 4. … Continue reading

July 23, 2013

Recommendations as Personalized Learning to Rank

As I have explained in other publications such as the Netflix Techblog, ranking is a very important part of a Recommender System. Although the Netflix Prize focused on rating prediction, ranking is in most cases a much better formulation for the recommendation problem. In this post I give some more motivation, and an introduction to the problem of personalized learning to rank, with pointers to some solutions. The post is motivated, among others, by a proposal I sent for a tutorial at this year's Recsys. Coincidentally, my former colleagues in Telefonica, who have been working in learning to rank for some time, proposed a very similar one. I encourage you to use this post as an introduction to their tutorial, which you should definitely attend. The goal of a ranking system is to find the best possible ordering of a set of items for a user, within a specific context, in real-time. We optimize ranking algorithms to give the highest scores to titles that a member is most likely to play and enjoy.

If you are looking for a ranking function that optimizes consumption, an obvious baseline is item popularity. The reason is clear: on average, a user is most likely to like what most others like. Think of the following situation: You walk into a room full of people you know nothing about, and you are asked to prepare a list of ten books each person likes. You will get $10 for each book you guess right. Of course, your best bet in this case would be to prepare identical lists with the "10 most liked books in recent times". Chances are the people in the room is a fair sample of the overall population, and you end up making some money. However, popularity is the opposite of personalization. As I explained in the previous example, it will produce the same ordering of items for every member. The goal becomes is to find a personalized ranking function that is better than item popularity, so we can better satisfy users with varying tastes. Our goal is to recommend the items that each user is most likely to enjoy. One way to approach this is to ask users to rate a few titles they have read in the past in order to build a rating prediction component. Then, we can use the user's predicted rating of each item as an adjunct to item popularity. Using predicted ratings on their own as a ranking function can lead to items that are too niche or unfamiliar, and can exclude items that the user would want to watch even though they may not rate them highly. To compensate for this, rather than using either popularity or predicted rating on their own, we would like to produce rankings that balance both of these aspects. At this point, we are ready to build a ranking prediction model using these two features.

Let us start with a very simple scoring approach by choosing our ranking function to be a linear combination of popularity and predicted rating. This gives an equation of the form score(u,v) = w1 p(v) + w2 r(u,v) + b, where u=user, v=video item, p=popularity and r=predicted rating. This equation defines a two-dimensional space as the one depicted in the following figure.


Once we have such a function, we can pass a set of videos through our function and sort them in descending order according to the score. First, though, we need to determine the weights w1 and w2 in our model (the bias b is constant and thus ends up not affecting the final ordering). We can formulate this as a machine learning problem: select positive and negative examples from your historical data and let a machine learning algorithm learn the weights that optimize our goal. This family of machine learning problems is known as "Learning to Rank" and is central to application scenarios such as search engines or ad targeting. A crucial difference in the case of ranked recommendations is the importance of personalization: we do not expect a global notion of relevance, but rather look for ways of optimizing a personalized model.

As you might guess, the previous two-dimensional model is a very basic baseline. Apart from popularity and rating prediction, you can think on adding all kinds of features related to the user, the item, or the user-item pair.Below you can see a graph showing the improvement we have seen at Netflix after adding many different features and optimizing the models.


The traditional pointwise approach to learning to rank described above treats ranking as a simple binary classification problem where the only input are positive and negative examples. Typical models used in this context include Logistic Regression, Support Vector Machines, Random Forests or Gradient Boosted Decision Trees.

There is a growing research effort in finding better approaches to ranking. The pairwise approach to ranking, for instance, optimizes a loss function defined on pairwise preferences from the user. The goal is to minimize the number of inversions in the resulting ranking. Once we have reformulated the problem this way, we can transform it back into the previous binary classification problem. Examples of such an approach are RankSVM [Chapelle and Keerthi, 2010, Efficient algorithms for ranking with SVMs], RankBoost [Freund et al., 2003, An efficient boosting algorithm for combining preferences], or RankNet [Burges et al., 2005, Learning to rank using gradient descent].

We can also try to directly optimize the ranking of the whole list by using a listwise approach. RankCosine [Xia et al., 2008. Listwise approach to learning to rank: theory and algorithm], for example, uses similarity between the ranking list and the ground truth as a loss function. ListNet [Cao et al., 2007. Learning to rank: From pairwise approach to listwise approach] uses KL-divergence as loss function by defining a probability distribution. RankALS [Takacs and Tikk. 2012. Alternating least squares for personalized ranking] is a recent approach that defines an objective function that directly includes the ranking optimization and then uses Alternating Least Squares (ALS) for optimizing.

Whatever ranking approach we use, we need to use rank-specific information retrieval metrics to measure the performance of the model. Some of those metrics include Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG), Mean Reciprocal Rank (MRR), or Fraction of Concordant Pairs (FCP). What we would ideally like to do is to directly optimize those same metrics. However, it is hard to optimize machine-learned models directly on these measures since they are not differentiable and standard methods such as gradient descent or ALS cannot be directly applied. In order to optimize those metrics, some methods find a smoothed version of the objective function to run Gradient Descent. CLiMF optimizes MRR [Shi et al. 2012. CLiMF: learning to maximize reciprocal rank with collaborative less-is-more filtering], and TFMAP [Shi et al. 2012. TFMAP: optimizing MAP for top-n context-aware recommendation], optimizes MAP in a similar way. The same authors have very recently added a third variation in which they use a similar approach to optimize "graded relevance" domains such as ratings [Shi et. al, "Gapfm: Optimal Top-N Recommendations for Graded Relevance Domains"]. AdaRank [Xu and Li. 2007. AdaRank: a boosting algorithm for information retrieval] uses boosting to optimize NDCG. Another method to optimize NDCG is NDCG-Boost [Valizadegan et al. 2000. Learning to Rank by Optimizing NDCG Measure], which optimizes expectation of NDCG over all possible permutations. SVM-MAP [Xu et al. 2008. Directly optimizing evaluation measures in learning to rank] relaxes the MAP metric by adding it to the SVM constraints. It is even possible to directly optimize the non-diferentiable IR metrics by using techniques such as Genetic Programming, Simulated Annealing [Karimzadehgan et al. 2011. A stochastic learning-to-rank algorithm and its application to contextual advertising], or even Particle Swarming [Diaz-Aviles et al. 2012. Swarming to rank for recommender systems].

As I mentioned at the beginning of the post, the traditional formulation for the recommender problem was that of a rating prediction. However, learning to rank offers a much better formal framework in most contexts. There is a lot of interesting research happening in this area, but it is definitely worth for more researchers to focus their efforts on what is a very real and practical problem where one can have a great impact.

July 22, 2013

Reasons to not use locks: Priority inversion and general purpose vs realtime OS

“Let's say your GUI thread is holding a shared lock when the audio callback runs. In order for your audio callback to return the buffer on time it first needs to wait for your GUI thread to release the lock. … Continue reading

July 09, 2013

The Bla Face

My latest experiments involved animated SVG’s and webapps for mobile devices (FirefoxOS…). Also scratches HTML5 audio tag.

The result is this irritating application: The Bla Face. A talking head that stares around, blinks and speaks the ‘bla’ language.

Take a look at it and read more if you are interested on how it was done.

Animating Inkscape illustrations

I drew the SVG face as an example for a Inkscape course I was teaching as volunteer in a women association at my town. This was to show the students, that, once you have a vectorial drawing, it is quite easy to animate it like a puppet. I just moved the parts directly in Inkscape, for example, moving the nodes of the mouth, or moving the pupils.

Playing with that is quite funny, but the truth is that, although the SVG standard provides means to automate animations, and Internet is full of examples and documentation on how to do it, it must be done either by changing the XML (SMIL, CSS) or by programming with JavaScript, there is no SVG native FLOSS authoring tool available that I know. In fact, the state of the art would be something like that:

  • Synfig: Full interface to animate, imports and exports svg’s but animation is not native SVG and you pay the price.
  • Tupi: Promising interface concept, working with svg but not at internal level. It still needs work.
  • Sozi and JessyInk: Although they just animate the viewport, not the figures, and their authoring UI is quite pedestrian, I do like how they integrate the animation into the SVG output.
  • A blue print exists on how to make animations inside Inkscape. Some years ago and still there.

So if I want to animate the face I should code some SMIL/Javascript. Not something that I could teach my current students, but, at least, let’s use it as a mean to learn webapp development. Hands on.

Embedding svg into HTML5, different ways unified.

The web is full of reference on the different ways to insert an SVG inside HTML5. Just to learn how it works I tried most of them, I discarded the img method that blocks you the access to the DOM, and the embed method which is deprecated.

Inline SVG

The first method consists on inserting the SVG inline into the HTML5, it has the drawback that every time you edit the SVG from Inkscape you have to update the changes. No problem, there are many techniques to insert it dynamically. I used an idiom, that I already used for TestFarm for plots, and I like a lot. That is, a class of div emulating an img with a src attribute.

<!-- Method one: Inline SVG (dinamically inserted) -->
<div
    id='faceit'
    class='loadsvg'
    src='blaface.svg'
    ></div>

Calling the following function (requires JQuery), takes all such div tags and uses the src attributes to dynamically load the svg.

/// For every .loadsvg, loads SVG file specified by the 'src' attribute
function loadsvgs()
{
    $.each($(".loadsvg"), function() {
        xhr = new XMLHttpRequest();
        xhr.open("GET",$(this).attr('src'),false);
        // Following line is just to be on the safe side;
        // not needed if your server delivers SVG with correct MIME type
        xhr.overrideMimeType("image/svg+xml");
        xhr.send("");
        $(this).prepend(
            xhr.responseXML.documentElement);
    });
}

The document to create new elements in this case is the HTML root, so document and you can get the root SVG node by looking up “#faceit > svg”.

Object

The second method is the object tag.

<object
    id='faceit'
    data="blaface.svg"
    type="image/svg+xml"
    ></object>

It is cleaner, since it does not need any additional JavaScript to load. When using object, the root SVG element is not even inside the HTML DOM. You have to lookup for the #faceit element and accessing the contentDocument attribute which is a DOM document itself. Because they are different DOM documents, new SVG elements can not be created, as we did previously, from the HTML document.

This couple of functions will abstract this complexity from the rest of the code:

function svgRoot()
{
    var container = $(document).find("#faceit")[0];
    // For object and embed
    if (container.contentDocument)
        return container.contentDocument;
    return $(container).children();
}
function svgNew(elementType)
{
    svg = svgRoot();
    try { 
        return svg.createElementNS(svgns, elementType);
    }
    catch(e) {
        // When svg is inline, no svg document, use the html document
        return document.createElementNS(svgns, elementType);
    }
}

iframe

I don’t like that much the iframe solution, because instead of adapting automatically to the size of the image, you have to set it by hand, clipping the image if you set it wrong. But it works in older browsers and it is not deprecated like embed:

<iframe
    id='faceit'
    src="blaface.svg"
    type="image/svg+xml"
    height='350px'
    width='250px'
    style='border: none; text-align:center;'
    ></iframe>

You can also play with the SVG view port to get the SVG resized, without losing proportions.

In terms of JavaScript, the same code that works for object works for iframe.

css

The CSS part of the head so that whatever the method they look the same.

Animating the eye pupils

Before doing any animation, my advice: change the automatic ids of the SVG objects to be animated into something nice. You can use object properties dialog or the XML view in Inkscape.

Eye pupils can be moved to stare around randomly. Both pupils have been grouped so that moving such group, #eyepupils, is enough. The JavaScript code that moves it follows:

var previousGlance = '0,0'
function glance()
{
    var svg = svgRoot();
    var eyes = $(svg).find("#eyepupils");
    var eyesanimation = $(eyes).find("#eyesanimation")[0];

    if (eyesanimation === undefined)
    {
        eyesanimation = svgNew("animateMotion");
        $(eyesanimation).attr({
            'id': 'eyesanimation',
            'begin': 'indefinite', // Required to trigger it at will
            'dur': '0.3s',
            'fill': 'freeze',
            });
        $(eyes).append(eyesanimation);
    }
    var x = Math.random()*15-7;
    var y = Math.random()*10-5;
    var currentGlance = [x,y].join(',');
    $(eyesanimation).attr('path', "M "+previousGlance+" L "+currentGlance);
    previousGlance = currentGlance;
    eyesanimation.beginElement();

    nextGlance = Math.random()*1000+4000;
    window.setTimeout(glance, nextGlance);
}
glance();

So the strategy is introducing an animateMotion element into the group, or reusing the previous one, set the motion, trigger the annimation and reprogram the next glance.

Animating mouth and eyelids

To animate eyelids and mouth, instead of moving an object we have to move control nodes of a path. Control nodes are not first class citizens in SVG, they are encoded using a compact format as the string value of the d attribute of the path. I added the following function to convert structured JS data into such string:

function encodePath(path)
{
    return path.map(function(e) {
        if ($.isArray(e)) return e.join(",");
        return e;
        }).join(" ");
}

With this helper, simpler functions to get parametrized variations on a given object become more handy. For instance, to have a mouth path with parametrized opening factor:

function mouthPath(openness)
{
    return encodePath([
        "M",
        [173.28125, 249.5],
        "L",
        [71.5625, 250.8125],
        "C",
        [81.799543, 251.14273],
        [103.83158, 253.0+openness], // Incoming tangent
        [121.25, 253.0+openness], // Mid lower point
        "C",
        [138.66843, 253.0+openness], // Outgoing tangent
        [160.7326, 251.48139],
        [173.28125, 249.5],
        "z"
    ]);
}

And to apply it:

$(svgRoot()).find("#mouth").attr("d", mouthPath(20));

But if we want a soft animation we should insert an attribute animation. For example if we want to softly open and close the mouth like saying ‘bla’ the function wouldbe quite similar to the one for the eye pupils, but now we use an animate instead animateMotion and specify the attributeName instead mpath, and instead of providing the movement path, we provide a sequence of paths to morph along them separated by semicolons.

function bla()
{
    var svg = svgRoot();
    var mouth = $(svg).find("#mouth");
    var blaanimation = $(mouth).find("#blaanimation")[0];
    if (blaanimation === undefined)
    {
        blaanimation = svgNew("animate");
        $(blaanimation).attr({
            'attributeName': 'd',
            'id': 'blaanimation',
            'begin': 'indefinite',
            'dur': 0.3,
            });
        $(mouth).append(blaanimation);
    }
    syllable = [
        mouthPath(0),
        mouthPath(10),
        mouthPath(0),
        ].join(";");
    $(blaanimation)
        .off()
        .attr('values', syllable)
        ;
    blaanimation.beginElement();
    sayBla(); // Triggers the audio
    nextBla = Math.random()*2000+600;
    window.setTimeout(bla, nextBla);
}

The actual code is quite more complicated because it makes words of many syllables (bla’s) and tries to synchronize the lipsing with audio. First of all, using the repeatCount attribute to be a random number between 1 and 4.

    var syllables = Math.floor(Math.random()*4)+1;
    $(blaanimation)
        .off()
        .attr('values', syllable)
        .attr('repeatCount', syllables)
        ;

And then spacing them proportional to the word length:

    var wordseconds = (syllables+1)*0.3;
    var nextBla = Math.random()*2000+wordseconds*1000;
    window.setTimeout(bla, nextBla);

Regarding the lipsing, *sayBla is defined like:

function sayBla()
{
    blaaudio = $("#blaaudio")[0];
    blaaudio.pause();
    blaaudio.currentTime=0;
    blaaudio.play();
}

So the smart move is adding a handler to the repeat event of the animation. But this seems not to work on Chrome. Instead we draw on a timer again.

    for (var i=1; i<syllables; i++)
    window.setTimeout(sayBla, i*0.3*1000);

When animating the eyelids, more browser issues pop up. The eyelid on one eye is an inverted and displaced clone of the other. Firefox won’t apply to clones javascript triggered animations. If you set the values without animation, they work, if they are triggered by the begin attribute, they work, but if you trigger an animation with beginElement, it won’t work.

User interface and FirefoxOS integration

Flashy buttons and checkboxes, panel dialogs that get hidden, the debug log side panel… All that is CSSery i tried to make simple enough so that it can be pulled out. So just take a look at the CSS.

As I said, besides SVG animation I wanted to learn webapp development for FirefoxOS. My first glance at the environment as developer has been a mix of good and bad impressions. On one side, using Linux + Gecko as the engine for the whole system is quite smart. The simulator is clearly an alpha that eats many computer resources. Anyway let’s see how it evolves.

This project I tried to minimized the use of libraries, just using [requirejs] (a library dependency solver) and [Zepto] (a reduced JQuery) because the minimal Firefox example already provides them. But there are a wide ecology of them everybody uses Next thing to investigate is how to work with VoloJs on how to deploy projects, and that wide ecology of libraries available.

You have many foundation JQuery like frameworks such as Prototype, Underscore, Backbone… Then you have libraries for user interface components such as: Dojo, JQuery Mobile, React, YUI, Hammer, w2ui, m-project… Too many to know which is the one to use.

June 10, 2013

Tools for Data Processing @ Webscale


A couple of days ago, I attended the Analytics @Webscale workshop at Facebook. I found this workshop to be very interesting from a technical perspective. This conference was mostly organized by Facebook Engineering, but they invited LinkedIn, and Twitter to present, and the result was pretty balanced. I think the presentations, though biased to what the 3 "Social Giants" do, were a good summary of many of the problems webscale companies face when dealing with Big Data. It is interesting to see how similar problems can be solved in different ways. I recently described how we address at Netflix many of these issues in our Netflix Techblog. It is also interesting to see how much sharing and interaction there is nowadays in the infrastructure space, with companies releasing most of what they do as open source, and using - and even building upon - what their main business competitors have created.

These are my barely edited notes:

Twitter presented several components in their infrastructure. They use Thrift on HDFS to store their logs. They have now build Twitter Parquet, a columnar storage database that improves storage efficiency by allowing to read columns at a time.

@squarecog talking about Parquet

They also presented their DAL (Data Access Layer), built on top of HCatalog.




Of course, they also talked about Twitter Storm, which is their approach to distributed/nearline computation. Every time I hear about Storm it sounds better. Storm now supports different parts of their production algorithms. For example, the ranking and scoring of tweets for real-time search is based on a Storm topology.



Finally, they also presented a new tool called Summingbird. This is still not open sourced, but they are planning on doing so soon. Summingbird is a DSL on top of Scalding that allows to define workflows that integrate offline batch processing from Hadoop and near-line from Storm.




LinkedIn also talked about their approach to combining offline/near-line/real-time computation although I always get the sense that they are much more leaning towards the former. They talked about three main tools: Kafka, their publish subscribe system; Azkaban, a batch job scheduler we have talked about using in the past; and Espresso a timeline-consistent NOSQL database.


Facebook also presented their whole stack. Some known tools, some not so much. Facebook Scuba is a distributed in-memory stats store that allows them to read distributed logs and query them fast. Facebook Presto was a new tool presented as the solution to get fast queries out of Exabyte-scale data stores. The sentence "A good day for me is when I can run 6 Hive queries" supposedly attributed to a FB data scientist stuck in my mind ;-). Morse is a different distributed approach to fast in-memory data loading. And, Puma/ptail is a different approach to "tailing" logs, in this case into HBase. 



Another Facebook tool that was mentioned by all three companies is Giraph. (To be fair, Giraph was started at Yahoo, but Facebook hired the creator Avery Ching). Giraph is a graph-based distributed computation framework that works on top of Hadoop. Facebook claims they ran a Page Rank on a graph with a trillion edges on 200 machines in less than 6 minutes/iteration. Giraph is another alternative to Graphlab. Both LinkedIn and Twitter are using it. In the case of Twitter, it is interesting to hear that they now prefer it to their own in-house (although single-node) Cassovary. It will be interesting to see all these graph processing tolls side by side in this year's Graphlab workshop.

Another interesting thread I heard from different speakers as well as coffee-break discussions was the use of Mesos vs. Yarn or even Spark. It is clear that many of us are looking forward to the NextGen Mapreduce tools to reach some level of maturity.


May 07, 2013

TestFarm 2.0 released

We just released TestFarm 2.0. Now on GitHub.

You can install it by running:

sudo pip install testfarm

In Debian/Ubuntu, if you installed python-stdeb first, it will be installed as a deb package you can remove as other debian packages.

This release is a major rewrite on the server side. You can expect it more reliable, more scalable and easier to install. It is also easier to maintain.
Most changes are at the server and the client-server interface. Client API is mostly the same and migration of existing clients should be quite straight forward.

Regarding CLAM, it would be nice if we can get a bunch of CLAM testfarm clients. Now clients are easier to setup. In order to setup one, please, contact us.


March 21, 2013

Refactoring the TestFarm server

I spent some time refactoring the TestFarm server for scalability and maintenability. TestFarm is a continuous integration platform we implemented for CLAM and other projects.

Read more to know why and how this refactoring took place.


That’s the thing: CLAM TestFarm server and the client doing the build for Linux were hosted by Barcelona Media, but, since none of the active CLAM developers, neither Pau or Nael or Xavi or me, are working anymore for them, we are not going to ask them to keep that box alive. Thanks to Barcelona Media for hosting it all those many years!!

The only public available hosts we have are in DreamHost, including clam-project.org, and they can not stand the current TestFarm server load.

Barcelona Media server hosted both, the client doing the Linux build and the server generating the website. We recently noted with surprise that, as the time goes on and more executions are logged into the server, the load came more from the website generation than from the actual Linux build. So the whole idea is to refactor the scalability flaws we found at the website generation, and move just the website generation to clam-project.org, while keeping the clients on private machines. Hey, one client can be yours!

How TestFarm works

Just in case you are not familiar with TestFarm let me introduce a little it’s work-flow. A team is working distributedly on a project hosted on some source code repositories (git, svn…). Project volunteers deploy TestFarm clients on their private machines to cover a range of supported platforms. TestFarm clients monitor the repositories for updates on the code base. When they detect an update, an execution starts, (compilations, tests… whatever) and they send the execution events to the server. The TestFarm server logs those events and builds a website with all that information.

Similar systems work in a different way: On those systems server is a master that tells its slaves to build for him a given revision of the code. TestFarm is thought with volunteers in mind. The clients are the ones that decide how often they check for the repository and the whole transaction is controled by the client, so you don’t need a fixed IP or a 24/7 connected server.

For consistence I changed the meaning of some of the terms used in TestFarm. The short glossary at the end of this article may help as reference.

The flaws

Long logs parsed once and again. At some point of TestFarm development, clients stalled when reporting their events because they had to wait for the webservice to regenerate the website. This was naively solved by moving web generation away from the web service to a cron job so that the web server just added entries into a monolithic log. That solved client stalls but made heavier web generation as the parsing of the monolithic log took a lot of memory and was done once and again.

Coupled generation and logging. Whenever we tried to fix the former flaw, by changing the structure of the logs to fit their usage, we hit a harder problem. Website generation was accessing directly the information from the logs, for different files to obtain different information. Changing the structure of the logs, affects a lot of code.

Lack of tests for web file generation. The generated files for the website had no tests backing them, so whenever you tried to do the former changes you were not able to know whether the output changed.

Lack of context for log events. Log events mimics the events of a local TestFarm execution and that’s not appropiate. In a local execution you just have a single client, and events occurs sequentially, so whenever you start a command you know its part of the last started task. You cannot make this assumption when playing with different clients unterminated executions… So the context cannot be implicit, and the event must have the whole information.

(Re)Design principles

Encapsulated logs. In order to be able to change the logs structure easily, access to the logs for reading and writting has to be encapsulated under a class, Server. This class provides two interfaces: The logging interface will be one exported as web service to the clients, and by manipulating it we can simulate a given client situation for testing. The information gathering interface is used by the code generating the website, and provides a semantic way of accessing information from the logs. Instead of accessing log event lines, and interpreting them, the Server class provides the semantics that we tried to derive from the logs.

All the state is on the log files. A different server instance is created each time a client logs an event. And we use a different Server instance to generate the website. All the state of the Server class should be built on the log files, any information holded by the Server should be a mere cached copy of the data on the filesystem.

Splitted and organized logs. Log events are no more saved on a monolithic log but split into a filesystem structure organized by projects, clients, executions. Former server achieved that but as a post-processor step. Providing full context for the log events facilitates this split on arrival time. This structure also facilitates gathering information like how many executions for a given client with a simple glob.

Cheaper information has it’s own access point. As said before some information can be retrieved in a cheaper way which requires less parsing or no parsing at all. The Server provides interface to get that information directly. This way we ensure that no superfluous parsing is done.

Query results as dynamic attributes. Some structured information from a single source such an execution log is provided as an object with dynamic attributes. Like having a dictionary but accessing it as attributes with meaninful names. Hyde metadata system inspired me that javascriptic approach. It is quite fast to implement in Python and quite confortable to use.

Information caching. Some information bits, such as the completion status of a finished execution or the current status of a given client, require parsing a lot of information to get some bits that can be cached in files. Specific interface for that information will be provided so that eventually the query could rely on the cached information.

One generator class per output file. Original server had all the methods for all the outputs mixed in a single class. It was very hard to understand which methods collaborate to get the same output. I created one class per output, which conveniently groups collaborating methods and you can name them in a simpler way as the class provides context. Common code is mainly information gathering, and has been moved to the Server class.

Not writting files yet. Page classes do not write any files. This eases the testing of the output by checking strings. This could be inconvenient if the generated files get big. A controller WebGenerator class orchestrates the file generators by providing the Server or the specific data.

Rely on Server for ‘now’. Because many functionalities depend on current time, test take control over that variable by using Server’s ‘now’ attribute. It returns the current time unless you set it to an arbitrary one.

New features

Speed. Isn’t it a feature? Yep it is. Splitting logs and reducing the need of parsing them all the time is a huge speed up. Further speed up can be added in the future by caching summarized information from logs, which now can be implemented quite transparently.

Easier to extend: Gathering data is now more easy so adding new features like the ones presented below is now quite easy.

Client summary on top of the history. Now you have a clear indicator at the top of the history page (the former index page) telling the status of every client. It has the status of the last complete execution (red, green, whatever) and the client status (waiting for updates, unresponsive, running). In previous versions you had to scroll down to see the current status of some clients.

Better stat plots. I got ride of Ploticus. Not that it is bad, but dependencies are a problem if you have to host the service. I used Google’s CoreChart. It gives me nice looking SVG-or-whatever-in-your-browser plots with useful tooltips.

This is at the cost of adding a couple of external javascript files But placing a plot becomes just a matter of writing this HTML:

<div class="plot" src="yourdatafile.json"></div>

and writting in ‘youdatafile.json’:

[
    ["Executions", "param1", "param2"],
    ["20130323-101010", 13, 43],
    ["20130323-103010", 15, 35],
    ["20130323-105010", 17, 78],
    ....
]

JSON based summary page. This feature has been here for a while. But as I had to reimplement it and The summary page is not generated at the server at all. It is fully rendered and updated on the browsing by fetching some json data.

JSON based TestFarm monitor tray icon applet. Also arround for a while but unexplained. testfarm-indicator, a tray icon applet to monitor the status of several testfarm projects without having to visit their web pages. Whenever a client gets red, you will see how the farm icon bursting into flames. It also relies on the JSON data generated for each project.

Status and TODO’s

Web generation is feature complete, replicating what former server did and adding some features. Still the communication between the clients and the server is not implemented. This is a key stage and a requirement before getting it integrated into the stable repository at SourceForge. Meanwhile I am developing it using a separate git repository. By now you can emulate client events and get nice screenshots of the generated web.

So my TODO list on TestFarm has a look like this:

  • Connecting the client so that they use the new Server API
  • Aborting executions
  • Caching gathered information
    • Execution summary once finished or aborted
    • Client state
  • Making it safe to client provided names (that get into the file system!!)
  • Web interface for creation and configuration of Projects, adding clients…
  • Signed client messages for security

Glossary

  • Server: Web-service collecting information from clients to generate TestFarm website for one or many projects.
  • Project: Set of clients, often sharing code base, that are displayed side-by-side in a TestFarm web.
  • Client: Set of tasks executed once and again in the same environment.
  • Execution: Result of executing once the sequence of tasks for a client.
  • Task: A sequence of commands under a descriptive label.
  • Command: An executed shell statement. Its execution provides an ansi/text output and an status value.
  • Command modifiers: functions that parse the output of a command to add extra information.
    • Info: generates text that will be always displayed (output is shown just when the command fails)
    • Stats: dictionary of numbers. Value evolution along executions will be plotted for each key.
    • Success: boolean overriding the command status value

February 22, 2013

JACK engine for ipyclam

Here you have a cute alternative to QJackCtl if you have it plenty of ardour multichannel ports connected in fancy ways. Auto-completion, broadcasting and Python slices on ports will be your friends.

ipyclam design enables other modular systems than CLAM to be controlled with the same interactive API just by reimplementing the internal engine. This could be exploited to provide interactive consoles for system such as Galan, Patchage… But a JACK engine is not just a proof of concept but also something that is useful within the CLAM NetworkEditor work-flow. You can explore and handle external JACK connections from the new NetworkEditor console in a similar way you can handle internal CLAM connections.

The screenshot is an independent ipyclam qtconsole with a simple session. Notice how quick is to get a description of the system status and how visual and replicable it is. Unit creation, deletion and renaming has no sense (maybe for testing), so I decide to ignore those messages. Transport control is not implemented either regardless it has sense.

Other changes on ipyclam

JACK engine has been a milestone of a set of changes I did this week to wrap-up ipyclam for the Linux Audio Conference 2013. Let me summarize them here:

Separate UI module

Now ui related methods are no more part of the Engine. They have been moved away and separated in independent modules for PyQt4 and PySide. Besides making the engines independent of Qt/Clam specifics splitting PyQt4 and PySide in two modules finally achieves the goal of having code which is independent of the Python binding, just by changing the imports.

if usePyQt :
    import ipyclam.ui.PyQt4 as ui
    from PyQt4 import QtGui, QtCore
else:
    import ipyclam.ui.PySide as ui
    from PySide import QtGui, QtCore

# using ui, QtGui and QtCore freely
app = QtGui.QApplication([])
w1 = ui.createWidget("Oscilloscope")
w2 = ui.loadUi("file.ui")
...

Engines are engines

Let’s forget about terms like ‘back-end’ and ‘network proxy’. ‘Back-end’ was a ambiguous term because we used it in CLAM for other purposes: Audio back-ends for JACK, PortAudio, LASDPA… Moreover, the corresponding classes had a nasty legacy name: XXX_NetworkProxy. So I started calling them engines and the classes XXX_Engine. Now we have Clam_Engine, Dummy_Engine and the brand new Jack_Engine.

The class naming is not definitive. I would like to make a submodule per engine with the same classes inside, including the Engine and the configuration wrapper. The goal should be having code like this:

from ipyclam.engines.clam import Engine as Clam_Engine, Config as Clam_Config
from ipyclam.engines.dummy import Engine as Dummy_Engine, Config as Dummy_Config

But I still I have to figure out how to implement it and it is not a priority.

CLAM and Dummy engines have been unified

They were meant to be analogous but it was clear they weren’t as I tried to pass one engine tests with other’s. It has been the harder task of this week but the good news is that once they were unified, implementing the JACK engine has been a one evening task.

Now all the tests are shared with some required divergences explicitly extracted. Still most about configuration is to be done.

Consistent connection broadcasting

There were incoherences, even within a single engine, about the behaviour of connections when one of the sides was not a single connector.

There are three kind of connectible sides:

  • A simple connector
  • A connector set: _inports,_outports… and slices
  • A processing

The behaviour has been defined as follows:

set1 > set2
# is equivalent to ordered one to one connection
for connector1,connector2 in zip(set1,set2) :
    connector1 > connector2

connector > set2
# is equivalent to one to many connection
for connector2 in set2 :
    connector > connector2

processing1 > processing2
# is equivalent to
processing1._outcontrols > processing2._incontrols
processing1._outports > processing2._inports

# for cases like
processing > set
# or
processing > connector
# the processing is substituted by the matching set
# _inports, _outports, _incontrols or _outcontrols

Angle operator now always check connector directions. connect method doesn’t, although when sides are processing units forward direction is implied.

a > b
b < a
a.connect(b)

January 24, 2013

10 "Little" lessons for life that I learned from running

(Sorry for allowing myself to depart from the usual geeky computer science algorithmic talk in this blog. I owed it to myself and my biggest hobby to write a post like this. I hope you bear with me.)

Around 3 years ago, I smoked, I was overweight, and only exercised occasionally. Being a fan of radical turns in my life, I decided one day to go on a week-long liquid diet, I stopped smoking, and I took up running, with the only goal in my mind to some time run the half marathon in my home town. Little did I know that the decision to run would change my life in so many ways. This last year 2012, I have run 3 marathons, 4 half marathons, and a 199 mile relay with a team of 12. But, beyond that, I am convinced that I owe part of my personal and professional success these past years to the fact that I am a runner.

This post is my little homage to running and to the many lessons I have found in my journey.



When I started running I had lots of problems. The main one was due to an old knee injury that hit back on me. I had an ACL surgery when I was 16, and ever since my right knee has not been the same. When my knee started hurting this time, I visited several doctors, some specialized in sports. All of them recommended I should give up running. Some told me straight out that I would never be able to run a marathon. It took me lots of visits to the chiropractor, and lots of quads exercises over months to get back to running. But, I overcame these initial hurdles, and went into running not one but several marathons.

Lesson 1. Beginnings are hard: Starting anything new in life will be hard. You will need to invest lots of energies, and at times you will want to give up. The more important and significant the change is, the more it will take from you.


After I finished my first marathon in Santa Cruz, and when I thought all my knee problems were long gone, my knee started hurting again. This was nothing like what I had experienced when starting. Still, it could have been enough to stop me from trying again. But, it didn't. I focused on recovering. Soon I was back on the road.

Lesson 2. There will be ups and downs: Once you have overcome the initial difficulties in starting something new, you will be tempted to think that everything else should be easy . But, life, like most running courses, will have hills with ups and downs.


It is hard to wake up at 6 am for the morning run. It is easy to stay in bed when your legs are still sore from yesterday's training. It is tough to go out running when it is raining or freezing outside. It is even harder to decide not to stop when you hit the wall on mile 20 of a marathon. All these day to day small decisions end up adding up and making the difference between you improving and accomplishing your running goals.

Lesson 3. The importance of those small decisions: The small day to day decisions play a huge role in building your character. They will end up determining your long term success and the direction your life takes.


When you are not at your best, it is even harder to face all these small decisions I mentioned. If you are down for some time because of an injury, it is tough to start again on your own. Having a group of friends that share your passion for running is extremely important. I am fortunate to have a large group of friends that push me to become better, and help me get up when I fall.

Lesson 4. You are not alone - the power of social influence... and friends: Whatever new adventure you start in life, it is important to have people around you that understand and support it. People that share your passion can make a difference when you need it.


As much as I have appreciated having that extra support from friends and other fellow runners, there are many times I have felt the pressure of having to make a decision on my own. Many of those small decisions such as getting up off bed on a rainy day, for example. Nobody is going to make them for you. I have also felt alone in many of my training runs. And, of course, in mile 20 of a marathon, when everyone is giving their best but you can only see strangers around you. In all those moments it is important to be strong and be ready to carry on, on your own.

Lesson 5. But, you will be alone: No matter how many friends support you, you will have to face important decisions on your own, and carry your own weight.


It is well known that "repetition leads to mastery". This is even more so for activities that require developing physical strength and resistance. There is no other secret to becoming a better runner than to run, and run often. Putting on more miles is the goal. Everything else will come.

Lesson 6. Repeat, repeat, repeat, repeat: Repetition is the key to mastering most things in life. If you want to become good at doing something, ask yourself how you can invest thousands of hours in it (read about the 10k hour rule in Malcolm Gladwell's Outliers).



As much as repetition is needed to improve, it is hard to do so without a goal in mind. During my time running I have learned the power of having concrete goals. Setting up goals that are achievable in the long run, but not too easy to get to. As I have progressed, I have learned to be more demanding. My current goals is to do a 3:30 marathon, and a 1:30 half. The first one is achievable, the second one will need much more work. But these goals will keep me going and focused for some time.

Lesson 6. Set your goals: Setting ambitious but achievable goals in life will help you push harder and will keep you focused and looking forward.


When I look back at the way I started running, I realize how many things I did wrong. I have learned so much since then. I have read books, watched movies and online videos, talked to people that know much more than I do. I have also learned from looking at the data that I generate from each of my trainings. I have also learned to listen to and understand my body I am fortunate enough that I love learning, and I have enjoyed every bit of this learning experience.

Lesson 7. Data and knowledge: Use all the information around you to improve your life. Data about you can give you insights into how to become better. And any knowledge you gain from external sources can make a difference when taking a decision.


One of the reasons why beginnings are hard (Lesson 1) is that people that start running tend to overdue it by, for example, increasing distance and pace at the same time. This typically leads to injury, and frustration. One of the most important things to learn when starting to run is to understand your own limitations. Even when you do, you will be tempted to push to hard by continuing to run when your leg hurts, or by doing one too many races in a short period. I have done all of the above. But it is important to remember that everyone has their limits and forcing beyond them can result in long term problems.

Lesson 8. Everyone has their limits: Pushing yourself hard is good. However, there is such a thing as pushing *too* hard. You need to understand where your limits are to push them further, but only little by little.



No matter how hard it can get at some points, no matter how long it can take you, there is no doubt that you can do whatever you set your mind to. I don't have any special conditions for running, and I have never had. I don't think I will ever be a "great" runner. However, now I look back and laugh when I remember my unreachable goal a little over 3 years ago was "only" to run a half marathon. If someone like me, with little or no pre-existing conditions, family and work obligations, and very little time, can do it, so can you.

Lesson 9. But, yes you can: No matter how low you fall or how far your goal is,  you can do it. Only think about the many people just like you who have done it before (e.g. Estimates are that around 0.1 to 0.5% of US Population has completed a Marathon). Why should you be any less?



As a conclusion, let me stress that the fact anyone can run does not mean that running is easy, and it requires no effort. It is precisely the fact that it is hard and requires effort for a long period of time what makes it worthwhile. Like most good things in life.

Lesson 10. All good things come hard: Think about it, all worthy things in life require effort and dedication. Being healthy, fit, happy, having a career, or a family, they all require your energy and long time investment. Just go with it, and enjoy every bit of the journey.

January 21, 2013

ipyclam: Embedded IPython console in NetworkEditor

Take a look at the screenshot.

Yes, it is what it seems to be: A Qt-based IPython console embedded into CLAM NetworkEditor. Yes, ipyclam and NetworkEditor share the same network. And yes, sadly, although ipyclam sees anything we do in NetworkEditor. NetworkEditor does not refresh on ipyclam changes yet.

Still it is quite a big step forward because we have been blocked at that point for a while and it has been unblocked in just an afternoon.

In the past months, Xavi and me spent many hours looking for a proper API in IPython to do what we wanted without getting too low level (and postal). Now that IPython provided that new API (and documented examples), it has been quite straight forward to integrate.

A text console but it is not

The new Qt based IPython console is amazing. It is even better in many ways than the standard IPython terminal most of us are used to. It provides:

  • syntax highlighting as you write,
  • graceful multi-line edition,
  • function API hints when writing calls,
  • navigable tab completion,
  • customizable graphical representations of results,
  • many hooks to add our bells and whistles.

So it seems a text console, but it is not. Yep, we can add many bells and whistles there.

Back to the original ipyclam goal

Lately, we clearly realized that ipyclam strength is clearly empowering CLAM prototyping, by adding interactive Python scripting and PyQt/PySide to the QtDesigner/NetworkEditor graphical design tandem. But the original motivation for ipyclam, which was very shortsighted, I must admit, was providing NetworkEditor with a complementary console to deal with big and complex networks where graphical patching was not practical. And that’s a milestone that we have not reached yet.

I’ll try to abstract the problem so that it could be useful beyond CLAM. We had a domain object implemented in C++ and we wanted to provide a Python console (a Qt widget) to manipulate it. So the first goal, which Xavi achieved very succesfully, is to build a Python wrapper, ipyclam to the domain object, a CLAM::Network.

IPython separated quite well the concepts of Python kernel from the Qt Console. Indeed it separated those concepts that much that they were at different processes and that made our goal unfeasible, at least, by using the stable high level API of IPython.

Luckily, IPython 1.3 provides high level interfaces and examples for having both kernel and console in the same process connected via a local pipe. I took that code and changed it to be able to inject the CLAM::Network and I got a Python script that serves as Qt based ipyclam_console which I named ipyclam_qtconsole.py and it is working at the repository. I modified the code from Jarosz Pawel to get the kernel encapsulated inside the console widget itself and I added a function to easily injecting objects into the interpret main namespace, for example the ipyclam network object.

Still the example was a Python program using a PySide base widget. Gladly, we already had taken our quest to get the PySide and PyQt4 working all together with C++, and we already have a few functions to transfer up and down QObjects from and to C++. So if I can abstract the whole thing into a QWidget I could use it as is and I still can’t believe it worked that well.

namespace py = boost::python;
QWidget * GetIPyClamConsole(CLAM::Network & network)
{
    try
    {
        Py_Initialize();
        // Dummy __main__ namespace, to run execs
        py::object _main = py::import("__main__");
        py::object _main_ns = _main.attr("__dict__");
        // Adding working dir to the Python search path
        py::exec("import sys; sys.path.append('.')" , _main_ns, _main_ns);
        // Simulate that we have a working command line (expected by IPython)
        py::exec("sys.argv=['ipyclam']\n", _main_ns, _main_ns);
        // Build an ipyclam network having the CLAM network as backend
        py::object ipyclamModule = py::import("ipyclam");
        py::object proxy = py::object(py::ptr(&network)); // The proxy backend
        py::object net = ipyclamModule.attr("Network")(proxy); // The ipyclam network
        // Creating the IPython based console widget

        py::object consoleModule = py::import("ipyclam.ipyclam_qtconsole");
        py::object console = consoleModule.attr("IPythonConsoleQtWidget")();

        // Injecting the network into the namespace
        console.attr("namespace_inject")("net", net);

        // Unwrapping the PySide based qt console to use it as a abstract QWidget
        QWidget * consoleWidget = (QWidget*) shibokenUnwrap(console.ptr());
        return consoleWidget;
    }
    catch (py::error_already_set & e)
    {
        std::cerr << "Run time Python error!" << std::endl;
        PyErr_Print();
        return 0;
    }

}

TODO’s

Well, still fully not working, what else we have to get done:

  • Clean-up dependencies
    • NetworkEditor depends on ipyclam on run-time
    • NetworkEditor depends on Boost-Python and Shiboken on compile-time, has to be both?
  • Refresh the canvas when Python changes the network, including:
    • Processing creation and deletion
    • Processing connection
    • Configuration affecting connector
    • Renames
    • Changes on the back-end
  • Providing documentation so that the Qt console hints help the user
  • Polishing ipyclam for some nasty things you get when you play a little with it.
  • Controling the banner and the prompt (we could control it on Terminal based but the same interface seems to be gone in Qt)

September 17, 2012

Recsys 2012: A long (and likely biased) summary

After a great week in beautiful and sunny Dublin (yes, sunny), it is time to look back and recap on the most interesting things that happened in the 2012 Recsys Conference. I have been attending the conference since its first edition in Minnesota. And, it has been great to see the conference mature to become the premiere event for recommendation technologies. I can't hide that this is my favorite conference for several reasons: perfect size, great community, good involvement from industry, and good side program of tutorials, workshops, and demos.

This year I arrived a bit late and missed the first day of tutorials, and first day of the conference. But, was able to catch up after jumping right in with my 90 minute tutorial on "Building Industrial-scale Real-world Recommender Systems"

In my tutorial (see slides here), I talked about the importance of four different issues in real-world recommender systems:
  • Paying attention to user interaction models that support things like explanations, diversity, or novelty.
  • Coming up with algorithms that, beyond rating prediction, focus on other aspects of recommendation such as similarity, or, in particular, ranking.
  • Using results of online A/B tests, and coming up with offline model metrics that correlate with the former.
  • Understanding the software architectures where your recommender system will be deployed.
I was happy to see that some of these issues not only were mentioned, but almost became conducting threads throughout the conference. Of course, this might be in the eye of the beholder, and others might have come back with the impression that the main topics were others (I recommend you read these two other Recsys 2012 summaries by Daniel Tunkelang and Daniele Quercia). In any case, grouping in topics will help me summarize the many things I found interesting.

Online A/B Testing and offline metrics

I am glad to see that this has become a relevant topic for the conference, because many of us believe this is one of the most important topics that need to be addressed by both industry and academia. One of these people is Ron Kohavi, who delivered a great keynote on "Online Controlled Experiments: Introduction, Learnings, and Humbling Statistics", where he described his learnings of many years of AB Testing in Amazon and Microsoft. It is funny that I cited his KDD 2012 paper in two slides in my tutorial, not knowing that he was in the audience. I recommend you go through his slides, it was one of the best talks of the conference for sure.

The importance of finding relevant metrics was, as a matter of fact, the focus of a workshop we organized with Harald Steck (Netflix), Pablo Castells (UAM), Arjen de Vries, and Christian Posse (LikedIn). The title of the workshop was "Recommendation Utility Evaluation: Beyond RMSE". Unfortunately, I was not able to attend. But, I do know the keynote by Carlos Gomez-Uribe, also from Netflix, was very well received. And, the workshop as a whole went very well with several interesting papers and even more interesting discussions. You can access the papers on the website.

A couple of papers in the main track of the conference also touched upon the importance of optimizing several objectives at the same time. In "Multiple Objective Optimization in Recommender Systems", Mario Rodriguez and others explain how they design LinkedIn recommendations by optimizing to several objectives at once (e.g. candidate that is good for the job + who is open to new opportunities). They report results from an AB Test run on LinkedIn. In "Pareto-Efficient Hybridization for Multi-Objective Recommender Systems", Marco Tulio Ribeiro and others from Universidade Federal de Minas Gerais & Zunnit Technologies take the multi-objective a step further. In their case, they optimize the system to not only be accurate, but also present novel or diverse items.

Some other papers went beyond the academic experimental procedure and implemented real systems that were tested with users. A good example is "Finding a Needle in a Haystack of Reviews: Cold Start Context-Based Hotel Recommender System" by researchers from the Tel Aviv Yaffo College and Technicolor.

Learning to Rank

Another hot topic in this year's recsys was ranking (or Top-n Recommendations as some prefer to call it). It is good to see that after some time publicly speaking about the importance of ranking approaches, the community seems now to be much more focused on ranking than on rating prediction. Not only there was a whole session devoted to ranking, but actually many other papers in the conference dealt with the topic in some way or another.

I will start by mentioning the very good work by my former colleagues from Telefonica. Their paper "CLiMF: Learning to Maximize Reciprocal Rank with Collaborative Less-is-More Filtering" won the best-paper award. And, I think most of us thought that it was very well-deserved. It is a very good piece of work. Well motivated, evaluated, and, it addresses a very practical issue. It is great to see the Recsys team at Telefonica that I started be acknowledged with this award. You can access the paper here and the slides here.

In that same session, researchers from the Université Paris 6 presented "Ranking with Non-Random Missing Ratings: Influence of Popularity and Positivity on Evaluation Metrics", an interesting study on the very important issue of negative sampling, and popularity bias in learning to rank. The paper discusses these effects on the AUC (Area Under the Curve) measure, a measure that is not very well-behaved, nor very much used in evaluating ranking algorithms. Still, it is a valuable first step in a very interesting line of work. It is interesting to point out that the CLiMF paper addressed the issue of negative sampling in a radically different way: only considering positive samples. Yet another interesting paper in that session was "Sparse Linear Methods with Side Information for Top-N Recommendations", a model for multidimensional context-aware learning to rank.

Another ranking paper, "Alternating Least Squares for Personalized Ranking" by Gábor Takács from Széchenyi István University and Domonkos Tikk from Gravity R&D, received an honorable mention. The main author coined an (un)popular sentence during his presentation when he invited anyone not interested in Mathematics to leave the room. An unnecessary invitation in a conference that prides itself for being inclusively multidisciplinary. In Recsys, psychologists are seating through systems presentations as much as mathematicians are seating through user-centric sessions, and that is what makes the conference appealing. In any case, the paper presents an interesting way to combines a ranking-based objective function introduced in last year's kdd and the use of ALS instead of SGD to come up with another approach to learning to rank.

Two papers dealing with recommendations in Social Networks also focused on ranking. "On Top-k Recommendation Using Social Networks" by researchers from NYU and Bell Labs, and "Real-Time Top-N Recommendation in Social Streams" by Ernesto Diaz-Aviles and other researchers from the University of Hannover. The same first author had an interesting short paper in the poster session: "Swarming to Rank for Recommender System". In that poster he proposes the use of a Particle Swarm Optimization algorithm to directly optimize ranking metrics such as MAP. The method proposes an interesting alternative to the use of Genetic Algorithms or Simulated Annealing for this purpose.

Finally, the industry keynote by Ralf Herbrich from Facebook, also introduced the world of Bayesian Factor Models for large-scale distributed ranking. This method, introduced by the same author and others from MSR as "Matchbox" is now used in different settings. For example, the poster "The Xbox Recommendation System" presented its applicability for recommending movies and games for the Xbox. And, in "Collaborative Learning of Preference Rankings" the authors apply it to... sushi recommendation!

User-centric, interfaces & explanations

This was probably the third big area of focus of the conference, with many contributions in papers, tutorials, and workshops. The first day, there were actually two tutorials that would fall into this category. In "Personality-based Recommender Systems: An Overview", the authors presented the idea of using personality traits for modeling user profiles. Among other things, they introduced their proposal to use PersonalityML, an XML-based language for personality description. Interestingly, in the industry session, we saw that this is actually a quite practical thing to do. Thore Graepel from Microsoft explained their experiments in using The Big Five personality traits for personalization. In the other tutorial, "Conducting User Experiments in Recommender Systems", Bart Knijnenburg gave a thorough overview of how to conduct user studies for recommender systems. He also introduced his model for using structural equations to model the effects to evaluate. Again, I missed this tutorial, but I was fortunate to hear a very similar presentation by him in Netflix.

In "Inspectability and Control in Social Recommenders", Bart himself (and researchers from UCSB) analyze the effect of giving more information and control to users in the context of social recommendations. A similar idea is explored in the short paper "The Influence of Knowledgeable Explanations on Users' Perception of a Recommender System" by Markus Zanker.

Two papers addressed the issue of how much information we should require from users. In "User Effort vs. Accuracy in Rating-Bbased Elicitation" Paolo Cremonesi and others analyze how many ratings "are enough" for producing satisfying recommendations in a cold-start setting. And, in "How Many Bits Per Rating?", the Movielens crew try to quantify the amount of information and noise in user ratings from an information-theoretical perspecive. An interesting continuation to my work on user ratings noise. However, as the first author himself admited, this is just initial work.

Other highlights of user-centric work that fell more on the UI side were the paper "TasteWeights: A Visual Interactive Hybrid Recommender System" by my friends at UCSB, as well as the many papers presented in the Workshop on Interfaces for Recommender System.

Data & Machine Learning Challenges

If somebody thought that data and machine learning challenges would fade away after the Netflix Prize, this year's Recys was a clear example that this is far from being the case. Many challenges have taken over after that: the yearly KDD Cups, Kagel, Overstock, last year's MoviePilot challenge, the Mendeley Challenge... Recsys had this year a Tutorial/Panel and a Workshop on Recommender Systems Challenges, both organized by Alan Said, Domonkos Tikk, and others. I could not attend the Tutorial since it was happening at the same time than mine. But, I was able to catch some interesting presentations in the Workshop. Domonkos Tikk from Gravity R&D gave a very interesting presentation on how they evolved from being a team in the Netflix Prize to a real-world company with very interesting projects. Kris Jack from Mendeley also gave two interesting talks on the Mendeley recommender systems. In one of them, he explained how they make use of AWS and Mahout in a system that can generate personalized recommendations for about $60 a month. In the other, he talked about their perspective on data challenges.

Context-aware and location-based recommendations

This has become a traditional area of interest in Recsys. It has now matured to a point that it has it own session, and two workshops: "Personalizing the Local Mobile Experience", and the "Workshop on Context-Aware Recommender Systems". But, besides having its own session in the conference, several other papers in others also deal with context-aware recommendations. I have already mentioned "Sparse Linear Methods with Side Information for Top-N Recommendations", for example. Other interesting papers in this area were "Context-Aware Music Recommendation Based on Latent Topic Sequential Patterns", on the issue of playlist generation, and "Ads and the City: Considering Geographic Distance Goes a Long Way" for location-aware recommendations.

Social

A similar area that has already matured over several Recsys is Social. It has its own session, and Workshop, "Workshop on Recommender Systems and the Social Web" , and trancends over many other papers. In this area, the paper that I have not mentioned in other categories and found interesting was "Spotting Trends: The Wisdom of the Few". One of the reasons I found the paper interesting is because it builds on our idea of using a reduced set of experts for recommendations, what we called "The Wisdom of the Few".

Others

And yes, I still have some interesting stuff from the poster session that I could not fit into any of the above categories.

First, the short paper "Using Graph Partitioning Techniques for Neighbour Selection in User-Based Collaborative Filtering" by Alejandro Bellogin. Alejandro won the Best Short Paper Award, for a great piece of work and presentation. He described an approach to use the Normalized Cut graph clustering approach for grouping similar users, and improve neighborhood formation in standard kNN Collaborative Filtering.

I also liked the poster "Local Learning of Item Dissimilarity Using Content and Link Structure", another graph-based approach, in this case to learn a similarity function.

Finally, in "When Recommenders Fail: Predicting Recommender Failure for Algorithm Selection and Combination", Michael Ekstrand starts to tap into an extremely important question: when and why do some recommendation algorithms fail? This question has been informally discussed in the context of hybrid recommenders and ensembles. But, there is clearly much more work to do, and many things to understand.


----------------------


Well, if you made it all the way to here, it means that you are really interested in Recommender Systems. So, chances are that I will be seeing you in next year's Recsys. Hope to see you in Hong Kong!

December 04, 2011

June 17, 2011

June 15, 2011

Como se hizo…

http://www.planetatortuga.com/noticias.item.3875/los-violentos-en-#parlamentcamp-son-policias-infiltrados.-ver-fotos-rt-plz.html

De ahí:
(aquí había un video en alta calidad, que tenía cientos de miles de visitas y cientos de comentarios… y fue borrado: http://www.youtube.com/embed/YcmvzRvsf8g…. va otra copia en menor calidad)

Actualización: otro link hablando de lo mismo: http://jmgoig.wordpress.com/2011/06/15/estrategias-del-poder-para-desprestigiar-movimientos-sociales-el-caso-parlamentcamp/

April 04, 2011

Ubuntu PPA for CLAM

For the convenience of Ubuntu users, we deployed a personal package archive (PPA) in launchpad.

https://launchpad.net/~dgarcia-ubuntu/+archive/ppa

Instructions available at the same page. It currently contains libraries, extension plugins, NetworkEditor and Chordata packages for maverick, and platforms i386 and amd64.


September 20, 2010

High abstraction level audio plugins specification (and code generation)

If you ever wrote at least 2 audio plugins in your life, for sure you have noticed you had to write a lot of duplicated code. In other words, most of the times, writing a plugin there is very little … Continue reading

March 08, 2010

CLAM Chordata 1.0

screenshot

The CLAM project is pleased to announce the first stable release of Chordata, which is released in parallel to the 1.4.0 release of the CLAM framework.

Chordata is a simple but powerful application that analyses the chords of any music file in your computer. You can use it to travel back and forward the song while watching insightful visualizations of the tonal features of the song. Key bindings and mouse interactions for song navigation are designed thinking in a musician with an instrument at hands.

Chordata in live: http://www.youtube.com/watch?v=xVmkIznjUPE
The tutorial: http://clam-project.org/wiki/Chordata_tutorial
Downloat it at http://clam-project.org

This application was developed by Pawel Bartkiewicz as his GSoC 2008 project, by using existing CLAM technologies under a more suited interface which is now Chordata. Please, enjoy it.