Good prospects ahead for data mining
This topic is sponsored by Australian university mathematical sciences departments and the Australian Government's National Innovation Awareness Strategy.
Using simple statistics and some sophisticated computational techniques, data miners are quarrying our vast reserves of raw data for little gems of knowledge.
Don’t look now but you’re being monitored. Each time you withdraw money from a bank, make a phone call, log onto the Internet, rent a video, or even claim flybuy points at a supermarket, the transactions are more often than not recorded and stored by computers. As a result, masses and masses of data megabytes, gigabytes, terabytes are piling up in the electronic vaults of companies, governments and research institutions.
What use are all these data? Up until the early 1990s, the answer to this was ‘not much’. But statisticians and data miners now have faster analysis tools that can help sift and analyse the stockpiles of data, turning up valuable and often surprising information.
What is data mining?
Data mining can be defined as the exploration and analysis of large data sets, in order to discover meaningful patterns and rules. Automation is essential. Staring at a huge spreadsheet is not a good way to analyse any data. The trick is to find effective ways to combine the computer's power to process data with the human eye's ability to detect patterns. The techniques of data mining are designed for, and work best with, large data sets.
How data mining works
Data mining is a component of a wider process called ‘knowledge discovery from databases’. It involves scientists from a wide range of disciplines, including mathematicians, computer scientists and statisticians, as well as those working in fields such as machine learning, artificial intelligence, information retrieval and pattern recognition.
Before a data set can be mined, it first has to be ‘cleaned’. This removes errors, ensures consistency and takes missing values into account. The clean data are then ‘mined’ for unusual patterns by computer algorithms, and the patterns interpreted (usually by humans) to produce new knowledge.
Data mining may use quite simple statistical techniques or it may use highly sophisticated data analysis. What is new for data miners is the employment of these techniques on vast quantities of data. Because of the large size of the data set, data miners can be extravagant with it. For example, data mining techniques may start by sampling or selecting just some of the data, called the 'training' data perhaps 20 per cent or less of the total. An algorithm is then applied. Its task is to explore the training data, seeking patterns in it. Patterns are then tested and refined on data which have been kept aside for this purpose, called the 'test' data. In addition to the training and test sets, it is wise to use a 'validation' set to estimate generalisation error, in order to see how well the model performs under conditions of actual use.
Consider a bank that wants to learn more about the people to whom it lends money. If data mining can reveal information about the kinds of people that are most likely to want a loan of a particular type, the bank could target its marketing accordingly.
So, a computer armed with algorithms is given the task of mining the bank’s databank for useful knowledge. The databank contains the records of the bank’s customers over a number of years. It includes a large amount of information on each client such as age, sex, marital status, occupation, number of children, and so on.
Using test data, an algorithm identifies characteristics that distinguish customers who took out a particular kind of loan from those who didn’t. Eventually, it develops ‘rules’ by which it can identify customers who are likely to be ‘good prospects’ for such a loan. These rules are then used to identify such customers on the remainder of the database. Finally, another algorithm is used to sort the database into clusters or groups of people with many similar attributes, in the hope that these might reveal interesting and unexpected patterns. The patterns revealed by these clusters are then interpreted by the data miners, in collaboration with bank personnel.
Data structure and data size
Size on its own is not enough to allow the effective use of data mining techniques. Suppose that a bank has huge amounts of data on just half a dozen business customers. For finding results that apply to other business customers this is a very small sample, consisting of just six customers! There may be problems in processing and summarising the data, because of its size. Before one can think about dividing data between training sets and test sets, data from a large number of business customers would be needed.
Approaches to data mining
Data mining can perform a number of tasks, some of which are described below.
- Classification finds a rule or a formula for organising data into classes. For example, a bank may wish to classify clients requesting loans into categories based on the likelihood of repayment. A rule or formula for making the classification is developed from the data in the training set. (Examples of classification methods are linear discriminant analysis, decision trees and neural networks). The reliability of the rule or formula is then evaluated using the test set of data. This gives an indication of how well the procedure will work on the remaining bulk of the data.
- Like classification, clustering breaks a large database into different subgroups or clusters. It differs from classification because there are no predefined classes the clusters are put together on the basis of similarity to each other, but it is up to the data miners to determine whether the clusters offer any useful insight.
- As the name suggests, market basket analysis can be used to determine which things go together. It’s a form of clustering; for example, a market basket analysis of supermarket sales records might reveal that shopping trolleys containing cheese are also likely to contain pickled onions. The retailer could use this information in arranging its shelves or for targeting an advertising campaign.
- Regression uses values of one or more explanatory variables to explain or predict an outcome variable. For example, insurance risk analysts use regression when they have to estimate the average value of a claim (an outcome variable) as a function of variables such as the age and gender of policy-holders (explanatory variables). These explanatory variables are often called rating factors, so-called because they are used by insurance companies when setting the rates of premiums.
In all these cases, the fundamental aim is to find something unusual, something that we might not expect just by using common sense.
The future for data mining
As the examples above illustrate, data mining has considerable commercial application, but it can also be applied in many other fields. It has been used by law enforcement agencies to identify criminals by looking for patterns and relationships in the texts of statements taken from dozens or hundreds of suspects. Tax collectors, including those at the Australian Tax Office, are using data mining to spot fraudulent behaviour. Data mining is also well suited to the analysis of scientific data, such as those amassed by astronomers (Box 1: Data mining the stars from Canberra to the cosmos).
The electronic monitoring of our lives will undoubtedly increase, and the mountains of data will grow. Many scientific and ethical issues concerning data mining require resolution so quite a bit of spadework is still required (Box 2: Challenges in data mining). Nevertheless, expect the practice to unearth some interesting information in coming years.
Posted September 1999.