In part three of this short introductory series, Jim Vine outlines some of the machine learning techniques that are likely to be deployed in analysing housing’s Big Data.
Whilst I do not want to get hung up on individual technologies or particular algorithms, it is not possible to give a full flavour of the sorts of insights that we can gain from Big Data without giving at least a high-level overview of the types of analysis we can perform. Hopefully this short blog post gives a brief overview of the broad types of machine learning process that are available, to give a rough idea of what types of relationships we might be able to model
Regression will be familiar to many people who have undertaken analysis of data in the past. It is the process of trying to find a statistical relationship between one or more explanatory variables and an outcome variable. In conducting a regression we hope to find factors in our data that can be combined to give some estimate of the value of a particular outcome of interest. For example, with a basket of characteristics of each home, and data on many of these homes’ void periods, we might be able to generate a model that gives an estimate of likely void periods for other homes.
Classification is similar to regression in that we attempt to build a model from a set of explanatory variables, but in this case the outcome we are trying to model is a set of categories. This can be as few as two categories, for example a classification from past data that models the shape of the line that separates boilers that have broken down from those that have not; if we can create a strong model of this behaviour, then we can monitor the statistics of the working boilers going forwards, and could proactively attend to likely issues as they start to cross the line into the area of likely failure.
Clustering is most obviously used in customer segmentation. It differs from classification in that you do not have a particular underlying factor that you are grouping people on. Instead, you ask the computer to find groups of people (or homes or other entities) that are similar based on a number of variables; the analyst typically looks at the clusters the computer has produced and creates descriptive labels for them. Whilst these clusters do not directly predict drivers of behaviour in the same way that a classification model would aim to, they might be useful in situations where there is no existing record of outcomes to look back on to observe past propensities of different groups.
Anomaly detection is used to identify outliers: cases that are outside of the expected ranges of behaviour. This might be useful to identify properties that are experiencing unusually high rates of repairs requirements; this will not necessarily be just the properties with the highest rates of repairs, as there will be certain points in a property’s lifecycle where you would expect more repairs (‘snagging’ for new properties, and after a certain period when several components start to wear out).
There are also a few other types of machine learning techniques that are perhaps less likely to find use in Housing Big Data such as recommender systems (commonly used by retailers to suggest other things you might like based on purchase history) and reinforcement learning (where feedback from performance informs successive iterations, such as this helicopter learning to fly). There are also forms of analysis that data scientists might want in their toolkit, such as dimension reduction, that perform technical tasks to ease the main forms of analysis.
>> Next : Part 4: Potential benefits
HACT’s Housing Big Data project has been generously supported by the Nominet Trust, the UK’s only dedicated Tech for Good funder that invests in the use of technology to transform the way we address social challenges.