Home > Uncategorized > Predicting the next value in an integer sequence

Predicting the next value in an integer sequence

There was a Kaggle meetup group hackathon on Saturday. Integer sequence learning is a recently posted Kaggle challenge; build a model to predict the next value in an integer sequence, with example data coming from the On-Line Encyclopedia of Integer Sequences. How could I not want to try my hand at this challenge, and I signed up for the hackathon hoping to find like-minded folk.

My only previous encounter with mining the OEIS was a paper that attempted to combine two or more existing sequences to match one existing sequence and to contribute a few sequences I had found.

The event was well attended, and I found a fellow enthusiast in Lampros.

We kicked around a few ideas and while I jumped in and started investigating the characteristics of the data, Lampros started searching for solutions that others had found to the next value in an integer sequence problem (a much more sensible approach, but probably not as much fun as jumping in feet first; this was his first hackathon, so he has not picked up any bad habits yet ;-).

This is one of those few problems when over-fitting is required for things to work.

I concentrated on characterizing the 113,845 sequences in the training set and the following is a summary (code and data):

  • 74,465 sequences contained a maximum value less than 2^32 and 98,916 less than 2^64. So symbolic maths is going to be needed (a shame since I had found tscount, an R package for analyzing count time series),
  • a goodly 72,202 sequences are in sorted order, leaving 41,643 to go up and down,
  • The least significant digit of the values in many sequences are a subset of the ten possible values. The following is a list of the number of unique least significant digits in the training sequences:
        1     2     3     4     5     6     7     8     9    10 
     1289  4692  6800 11589 15773 13701  7635  8644 10837 32885

    After removing sequences containing fewer than 20 values (the -1 count)

       -1     1     2     3     4     5     6     7     8     9    10 
    33398   764  3006  3171  6068  8930  7898  3690  5604  9073 32243

Lampros found Martin Rubey’s work on guessing formulas for sequences, which had an Open Source implementation using FriCAS (source code, which needs Steel Bank Common Lisp to build, which itself needs your OS to support 512 open files {OS X default is 256}).

Other software and papers include: the gfun package plus paper, sequence prediction in Mathematica and a Masters thesis on Inductive Inference of Integer Sequences.

These systems are all based on fitting, a potentially very complicated, polynomial to each sequence and achieve a success rate of around 20% on the complete OEIS.

What are the characteristics of the 80% for which a polynomial fit does not predict the next value? Perhaps there are not enough values specified in these sequence to fit the necessary polynomial. Perhaps the values grow at a faster rate than can be fitted by any polynomial expression, e.g., the Busy beaver function.

A 11:00-17:00 hackathon is not really enough time to get anything sensible up and running. People doing Masters projects had more hours and were managing to get around 20% correct.

The competition leaderboard currently has somebody with a score of 0.74. This is a big lead over everybody else. Am I being cynical by thinking the model might be reading the OEIS text to find the formula describing the sequence and then evaluating this?

Update

Comparing Computer Models Solving Number Series Problems provides an interesting review of systems using an AI based approach, i.e., trying to mimic what people do.

  1. June 9, 2016 21:37 | #1

    Do you think the 0.74 score is genuine? Given that the prediction targets are in the training set it seems incredibly easy to data snoop.

  2. June 9, 2016 21:49 | #2

    @Maurice
    I never thought to check for that possibility.

    Many sequences in the OEIS include a formula for deriving them, which is how I thought the test set had been generated and how the 0.74 score had been achieved.

  3. June 21, 2016 19:32 | #3

    @Derek Jones

    I’ve actually taken a look at the training data now and realised there are a different set of sequences in there so I was wrong about the potential for data snooping. Looks like the most likely explanation is as you say, or some kind of automated lookup on the OEIS website. Either way, the Kaggle administrators have removed that submission and the current highest score is a more believable 0.23.

  1. No trackbacks yet.