April 15, 2013 Derek Jones No comments

I was one of the 170 or so people at the Data Science hackathon in London over the weekend. As always this was well run by Carlos and his team who kept us fed, watered and connected to the Internet.

One of the three challenges involved a dataset containing pairs of Twitter users, A and B, where one of the pair had been ranked, by a person, as more influential than the other (the data was provided by PeerIndex, an event sponsor). The dataset contained 22 attributes, 11 for each user of the pair, plus 0/1 to indicate who was most influential; there was a training dataset of 5.5K pairs to learn against and a test dataset to make predictions against. The data was not messy or even sparse, how hard could it be?

Talks had been organized for the morning and afternoon. While Microsoft (one of the event sponsors) told us about Azure and F#, I sat at the back trying out various machine learning packages. Yes, the technical evangelists told us, Linux as well as Windows instances were available in Azure, support was available for the usual big data languages (e.g., Python and R; the Microsoft people seemed to be much more familiar with Python) plus dot net (this was the first time I had heard the use of dot net proposed as a big data solution for the Cloud).

Some members of Team Outliers from previous hackathons (Jonny, Bob and me) formed a team and after the talks had finished the Microsoft people+partners sat at our table (probably because our age distribution was similar to theirs, i.e., at the opposite end of the range to most teams; some of the Microsoft people got very involved in trying to produce a solution to the visualization challenge).

Integrating F# with bigdata seems to involve providing an interface to R packages (this is done by interfacing to the packages installed on a local R installation) and getting the IDE to know about the names of columns contained in data that has been read. Since I think the world needs new general purpose programming languages as much as it needs holes in the head I won’t say any more.

When in challenge solving mode I was using cross-validation to check the models being built and scoring around 0.76 (AUC, the metric used by the organizers). Looking at the leader board later in the afternoon showed several teams scoring over 0.85, a big difference; what technique were they using to get such a big improvement?

A note: even when trained on data that uses 0/1 predictor values machine learners don’t produce models that return zero or one, many return values in the range 0..1 (some use other ranges) and the usual technique is treat all values greater than 0.5 as a 1 (or TRUE or ‘yes’, etc) and all other values as a 0 (or FALSE or ‘no’, etc). This (x > 0.5) test had to be done to cross validate models using the training data and I was using the same technique for the test data. With an hour to go in the 24 hour hackathon we found out (change from ‘I’ to ‘we’ to spread the mistake around) that the required test data output was a probability in the range 0..1, not just a 0/1 value; the example answer had this behavior and this requirement was explained in the bottom right of the submission page! How many times have I told others to carefully read the problem requirements? Thankfully everybody was tired and Jonny&Bob did not have the energy to throw me out of the window for leading them so badly astray.

Having AUC as the metric should have raised a red flag, this does not make much sense for a 0/1 answer; using AUC makes sense for PeerIndex because they will want to trade off recall/precision. Also, its a good idea to remove ones ego when asked the question: are lots of people doing something clever or are you doing something stupid?

While we are on the subject of doing the wrong thing, one of the top three teams gave an excellent example of why sales/marketing don’t like technical people talking to clients. Having just won a prize donated by Microsoft for an app using Azure, the team proceeded to give a demo and explain how they had done everything using Google services and made it appear within a browser frame as if it were hosted on Azure. A couple of us sitting at the back were debating whether Microsoft would jump in and disqualify them.

What did I learn that I did not already know this weekend? There are some R machine learning packages on CRAN that don’t include a predict function (there should be a research-only subsection on CRAN for packages like this) and some ranking algorithms need more than 6G of memory to process 5.5K pairs.

There seemed to be a lot more people using Python, compared to R. Perhaps having the sample solution in Python pushed the fence sitters that way. There also seemed to be more women present, but that may have been because there were more people at this event than previous ones and I am responding to absolute numbers rather than percentage.

Categories: Uncategorized Tags: AUC, F#, Hackathon, machine learning, mistake, R

Success does not require understanding

July 23, 2012 Derek Jones 3 comments

I took part in the second Data Science London Hackathon last weekend (also my second hackathon) and it was a very different experience compared to the first hackathon. Once again Carlos and his team really looked after us.

The data was released 24 hours before the competition started and even though I had spent less than half an hour looking at it, at the start of the competition I did not feel under any time pressure; those 24 hours allowed me to get used to the data and come up with some useful looking ideas.
The instructions for the first competition had suggested that people form teams of 3-5 and there was a lot of networking happening before the start time. There was no such suggestion this time and as I networked around looking for people to work with I was surprised by the number of people who wanted to work alone; Jonny and Kannappan were the only members from my previous team (the Outliers) who had entered this event, with Kannappan wanting to work alone and Jonny joining me to create a two person team.
There was less community spirit this time, possible reasons include a lot more single person teams sitting in the corner doing their own thing, fewer people attending (it is the middle of the holiday season), fewer people staying over until the Sunday (perhaps single person teams got disheartened and left or the extra 24 hours of data access meant that teams ran out of ideas/commitment after 36 hours) or me being reduced to a single person team (Jonny had to leave at 20:00) meant I paid more attention to what was happening on the floor.

The problem was to predict what ratings different people would give to various music artists. We were given data involving 50 artists and 48,645 users (artists and users were anonymous) in five files (one contained the training dataset and another the test dataset).

A quick analysis of the data showed that while there were several thousand rows of data per artist there were only half a dozen rows per person, a very sparse dataset.

The most frequent technique I heard mentioned during my initial conversations with attendees was machine learning. In my line of work I am constantly trying to understand what is going on (the purpose of this understanding is to control and make things better) and consider anybody who uses machine learning as being clueless, dim witted or just plain lazy; the problem with machine learning is that it gives answers without explanations (ok decision trees do provide some insights). This insistence on understanding turned out to be my major mistake, the competition metric is based on correctness of answers and not on how well a competitor understands the problem domain. I had a brief conversation with a senior executive from EMI (who supplied the dataset and provided some of the sponsorship money) who showed up on Sunday morning and he had no problem with machine learning providing answers and not explanations.

Having been overly ambitious last time team Outliers went for extreme simplicity and started out with the linear model glm(Rating ~ AGE + GENDER...) being built for each artist (i.e., 50 models). For a small amount of work we got a score of just over 21 and a place of around 70th on the leader board, now we just needed to include information along the lines of “people who like Artist X also like Artist Y”. Unfortunately the only other member of my team (who did not share my view of machine learning and knew something about it) had a prior appointment and had to leave me consuming lots of cpu time on a wild goose chase that required me to have understanding.

The advantages of being in a team include getting feedback from other members (e.g., why are you wasting your time doing that, look how much better this approach is) and having access to different skill sets (e.g., knowing what ~~magic pixie dust~~ values to use for the optional parameters to machine learning routines). It was Sunday morning before I abandoned the ‘understanding’ approach and started thrashing around using various machine learning techniques, which told me that people demographics (e.g., age and gender) were not particularly good predictors compared to other data but did did not reduce my score to the 13-14 range that could be seen on the leader board’s top 20.

Realizing that seven hours was not enough time to learn how to drive R’s machine learning packages well enough to get me into the top ten, I switched tack and spent a lot more time wandering around chatting to people; those whose score was worse than mine were generally willing to chat. Some had gotten completely bogged down in data cleaning and figuring out how to handle missing data (a subject rarely covered in books but of huge importance in real life), I was surprised to find one team doing most of their coding in SQL (my suggestion to only consider Age+Gender improved their score from 35 to 22), I mocked the people using Clojure (people using a Lisp derived language think they have discovered the one true way and will suffer from self doubt if they are not mocked regularly). Afterwards it struck me that well over 50% of attendees were not British (based on their accents), was this yet another indicator of how far British Universities had dumbed down mathematics teaching that natives did not feel up to the challenge (well done to the Bristol undergraduate who turned up) or were the most gung-ho technical folk in London those who had traveled here to work from abroad?

The London winner was Dell Zhang, the only other person sitting at the table I was on (he sat opposite me throughout the competition), who worked quietly away for the whole 24 hours and seemed permanently unimpressed by the score he was achieving; he described his technique as “brute force random forest using Python (the source will be made available on the Data Science website).

Reading through posts made by competitors after the event was as interesting as last time. Factorization Machines seems to be the hot new technique for making predictions based on very sparse data and the libFM is the software I needed to know about last weekend (no R package providing an interface to this C++ code available yet).

Categories: Uncategorized Tags: Clojure, data science, Hackathon, machine learning, R, understanding

Incompetence borne of excessive cleverness

April 30, 2012 Derek Jones 5 comments

I have just got back from the 24 hour Data Science Global Hackathon; I was an on-site participant at Hub Westminster in London (thanks to Carlos and his team for doing such a great job looking after us all {around 50 turned up from the 100 who registered; the percentage was similar in other cities around the world}). Participants had to be registered by 11:00 UTC, self form into 3-5 person teams ready for the start at 12:00 UTC and finish 24 hours later. The world-wide event had been organized by our London hosts who told us they expected the winning team to come from those in the room; Team Outliers (Wang, Jonny, Kannappan, Bob, Simon, yours truely and Fran for the afternoon) started in an optimistic mood.

At 12:00 an air-quality training dataset + test points was made available and teams given the opportunity to submit eight predictions in each of the two 12 hour time periods. The on-line submissions were evaluated by Kaggle (one of the sponsors, along with EMC) to produce a mean estimated error that was used to rank teams.

The day before the event I had seen a press release saying that the task would involve air-quality and a quick trawl of the Internet threw up just the R package I needed, OpenAir; I also read a couple of Wikipedia articles on air pollution.

Team Outliers individually spent the first hour becoming familiar with the data and then had a get-together to discuss ideas. Since I had a marker pen, was sitting next to a white-board and was the only person with some gray hair I attempted to manage the herding of the data science cats and later went on to plot the pollution monitor sites on Google Maps as well as producing some visually impressive wind Roses (these did not contribute anything towards producing a better solution but if we had had a client they could have been used to give the impression we were doing something useful).

People had various ideas about the techniques to use for building the best model and how the measurements present in the training set might be used to predict air quality (the training data had names such as target_N_S, where N and S were small integer values denoting the kind of pollution and the site where the measurement was made). The training set included measurements of wind speed/direction data and hours of sunlight, and a couple of members wanted to investigate if these would make good predictors. Team Outliers had people looking at all the fancy stuff you find in textbooks, e.g., ARIMA for time series, svm for machine learning, and I was looking at getting the data into the form needed by OpenAir.

There were all sorts of problems with the data, just like real life, e.g., missing values (lots of them), some kinds of quality (i.e., pollutants) were only measured at one or two sites and fuzzy values such as the ‘most common month’ (what ever that might be). Some people were looking at how best to overcome this data quality problem.

20:30 arrived, and some great food was laid out for dinner, no actual predictions yet, but they would be arriving real soon now. A couple of hours later my data formatting project crashed and burned (being a 32-bit system, R got upset about my request to create a vector needing 5.2 Gig; no chance of using the R-based OpenAir package which needed data in a format different from the one we were given it in). Fifteen minutes to midnight I decided that we either used the eight submissions permitted in the first 12 hours or lost them, and wrote a dozen lines of R that built a linear model using one predictor variable (which I knew from some earlier plots was far from linear, but the coding was trivial, and the lm function would take no time to build separate models for each of the 30 odd response variables). I submitted the predictions, and we appeared on the score board at number 65 out of 112. Being better than 47 other teams was a bit of a surprise.

Panic over, we realised that the 12 hours ended at 12:00 UTC which was 01:00 BST (British Summer time) and we had another hour. Wang made a couple of submissions that improved our score, and at around 02:00 I went to grab a few hours sleep.

I was back online at 06:00 to find that team Outliers had slipped to 95th place as the Melbourne and San Francisco teams had improved during their daytime. More good food for breakfast at 08:30.

Jonny drew attention to the fact that the mean absolute error in our team’s current score was almost twice as great as that of the sample solution provided with the data. We had long ago dismissed this solution as being too simplistic (it was effectively a database solution in that it calculated the mean value of the various pollutants in the training set at various chunkID and hour points which were used as keys for ‘looking up’ prediction values required by the test dataset). Maybe team members ought to focus their attention on tweaking this very simple approach rather than our ‘cleverer’ approaches.

I suggested modifying the sample solution to use the median rather than the mean (less susceptible to outliers), this boosted our ranking back into the 1960s. Jonny and Simon tried using a rolling mean, no improvement; Wang tried other variations, no improvement.

Team Outliers finished the Hackathon in equal 61st, along with 22 other teams, out of 114 submissions.

What did we do wrong? Mistakes include:

trying to do too much for people with our various skill levels in the time available. For instance, I don’t regularly reformat large data tables or try to calculate the error in machine learning models and while I can easily knock out the code I still have to sit down and think about what needs to be written, something that somebody who does this sort of thing regularly would just know how to do; other team members seemed very familiar with the theory but were not used to churning out code quickly.
spending too much time studying all the various kinds of measurements available in the training set, many of which were not available in the test dataset. We should have started off ignoring measurements in the training set that were not available in the test set, perhaps looking to using these later if time permitted.

Members of team Outliers enjoyed themselves but were a little crestfallen that our clever stuff was not as good as such a crude, but insightful, approach. Most of us used R, a few made use of awk, Python, spreadsheets and Unix shell.

Our hosts are looking to run more data science hackathons this year, in particular one related to the music industry in a few months time. If you are interested in taking part, keep an eye on their website.

Update (later the next day)

At least one team achieved some good results using ARIMA. Fran had started building an ARIMA model, had to leave and nobody else picked it up; I should have been paying more attention to ensure that ideas did not disappear when people left.

Categories: Uncategorized Tags: arima, data science, Hackathon, openair, outliers, pollution, R

Newer Entries

The Shape of Code

Archive

Never too experienced to make a basic mistake

Success does not require understanding

Incompetence borne of excessive cleverness

Recent Posts

Recent Comments

Archives

Meta