Full Fact checking of number words
I was at the Full Fact hackathon last Friday (yes, a weekday hackathon; it looked interesting and interesting hackathons have been very thin on the ground in the last six months). Full Fact is an independent fact checking charity; the event was hosted by Facebook.
Full Fact are aiming to check facts in real-time, for instance tweeting information about inaccurate statements made during live political debates on TV. Real-time responses sounds ambitious, but they are willing to go with what is available, e.g., previously checked facts built up from intensive checking activities after programs have been aired.
The existing infrastructure is very basic, it is still early days.
Being a numbers person I volunteered to help out analyzing numbers. Transcriptions of what people say often contains numbers written as words rather than numeric literals, e.g., eleven rather than 11. Converting number words to numeric literals would enable searches to made over a range of values. There is an existing database of checked facts and Solr is the search engine used in-house, this supports numeric range searches over numeric literals.
Converting number words to numeric literals sounds like a common problem and I expected to be able to choose from a range of fancy Python packages (the in-house development language).
Much to my surprise, the best existing code I could find was rudimentary (e.g., no support for fractions or ranking words such as first, second).
spaCy was used to tokenize sentences and decide whether a token was numeric and text2num converted the token to a numeric literal (nltk has not kept up with advances in nlp).
I quickly encountered a bug in spaCy, which failed to categorize eighteen as a number word; an update was available on github a few hours after I reported the problem+fix :-). The fact that such an obvious problem had not been reported before suggests that few people are using this functionality.
Jenna, the other team member writing code, used beautifulsoup to extract sentences from the test data (formatted in XML).
Number words do not always have clear cut values, e.g., several thousand, thousands, high percentage and character sequences that could be dates. Then there are fraction words (e.g., half, quarter) and ranking words (e.g., first, second), all everyday uses that will need to be handled. It is also important to be able to distinguishing between dates, percentages and ‘raw’ numbers.
The UK is not the only country with independent fact checking organizations. A member of the Chequeado, in Argentina, was at the hack. Obviously number words will have to handle the conventions of other languages.
Full Fact are looking to run more hackathons in the UK. Keep your eyes open for Hackathon announcements. In the meantime, if you know of a good python library for handling word to number conversion, please let me know.
Hi Derek, thanks for all your efforts on Friday. My associate Martin White mentioned this paper which may be of interest http://www.mitpressjournals.org/doi/abs/10.1162/COLI_a_00086#.WIXIiH2YLSl
@Charlie Hull
Yes, the NumGen project did some interesting analysis. Chris Cummins PhD thesis contains lots of interesting material.
This work relates to how people express uncertainty in the value they are trying to express, e.g., “I think there are 100 people” suggests that the speaker has greater uncertainty about they actual value than the statement “I think there are 101 people”.