Good prospects ahead for data miningUsing simple statistics and some sophisticated computational techniques, data miners are quarrying our vast reserves of raw data for little gems of knowledge.
Key text
Key textDon’t look now but you’re being monitored. Each time you withdraw money from a bank, make a phone call, log onto the Internet, rent a video, or even claim flybuy points at a supermarket, the transactions are more often than not recorded and stored by computers. As a result, masses and masses of data megabytes, gigabytes, terabytes are piling up in the electronic vaults of companies, governments and research institutions.What use are all these data? Up until the early 1990s, the answer to this was ‘not much’. But statisticians and data miners now have faster analysis tools that can help sift and analyse the stockpiles of data, turning up valuable and often surprising information. What is data mining? Data mining can be defined as the exploration and analysis of large data sets, in order to discover meaningful patterns and rules. Automation is essential. Staring at a huge spreadsheet is not a good way to analyse any data. The trick is to find effective ways to combine the computer's power to process data with the human eye's ability to detect patterns. The techniques of data mining are designed for, and work best with, large data sets. How data mining works Data mining is a component of a wider process called ‘knowledge discovery from databases’. It involves scientists from a wide range of disciplines, including mathematicians, computer scientists and statisticians, as well as those working in fields such as machine learning, artificial intelligence, information retrieval and pattern recognition. Before a data set can be mined, it first has to be ‘cleaned’. This removes errors, ensures consistency and takes missing values into account. The clean data are then ‘mined’ for unusual patterns by computer algorithms, and the patterns interpreted (usually by humans) to produce new knowledge. Data mining may use quite simple statistical techniques or it may use highly sophisticated data analysis. What is new for data miners is the employment of these techniques on vast quantities of data. Because of the large size of the data set, data miners can be extravagant with it. For example, data mining techniques may start by sampling or selecting just some of the data, called the 'training' data perhaps 20 per cent or less of the total. An algorithm is then applied. Its task is to explore the training data, seeking patterns in it. Patterns are then tested and refined on data which have been kept aside for this purpose, called the 'test' data. In addition to the training and test sets, it is wise to use a 'validation' set to estimate generalisation error, in order to see how well the model performs under conditions of actual use. Consider a bank that wants to learn more about the people to whom it lends money. If data mining can reveal information about the kinds of people that are most likely to want a loan of a particular type, the bank could target its marketing accordingly. So, a computer armed with algorithms is given the task of mining the bank’s databank for useful knowledge. The databank contains the records of the bank’s customers over a number of years. It includes a large amount of information on each client such as age, sex, marital status, occupation, number of children, and so on. Using test data, an algorithm identifies characteristics that distinguish customers who took out a particular kind of loan from those who didn’t. Eventually, it develops ‘rules’ by which it can identify customers who are likely to be ‘good prospects’ for such a loan. These rules are then used to identify such customers on the remainder of the database. Finally, another algorithm is used to sort the database into clusters or groups of people with many similar attributes, in the hope that these might reveal interesting and unexpected patterns. The patterns revealed by these clusters are then interpreted by the data miners, in collaboration with bank personnel. Data structure and data size Size on its own is not enough to allow the effective use of data mining techniques. Suppose that a bank has huge amounts of data on just half a dozen business customers. For finding results that apply to other business customers this is a very small sample, consisting of just six customers! There may be problems in processing and summarising the data, because of its size. Before one can think about dividing data between training sets and test sets, data from a large number of business customers would be needed. Approaches to data mining Data mining can perform a number of tasks, some of which are described below.
In all these cases, the fundamental aim is to find something unusual, something that we might not expect just by using common sense. The future for data mining As the examples above illustrate, data mining has considerable commercial application, but it can also be applied in many other fields. It has been used by law enforcement agencies to identify criminals by looking for patterns and relationships in the texts of statements taken from dozens or hundreds of suspects. Tax collectors, including those at the Australian Tax Office, are using data mining to spot fraudulent behaviour. Data mining is also well suited to the analysis of scientific data, such as those amassed by astronomers (Box 1: Data mining the stars from Canberra to the cosmos). The electronic monitoring of our lives will undoubtedly increase, and the mountains of data will grow. Many scientific and ethical issues concerning data mining require resolution so quite a bit of spadework is still required (Box 2: Challenges in data mining). Nevertheless, expect the practice to unearth some interesting information in coming years.
Already, astronomers know of many different kinds of star, from quasars to black holes, from red giants to white dwarves. But they are always on the lookout for new star types, since these might aid our understanding of the processes that shape our universe. So astronomers have enlisted the assistance of data miners at the Cooperative Research Centre for Advanced Computational Systems based in Canberra. Their task is to sort through the data and identify clusters of stars in the hope that some of these may represent star types that were previously unknown. This is just one project being conducted within the data-mining program at the CRC. Others are focused on solving business problems in sectors such as retail, finance, health care, government, telecommunications, and manufacturing. These include the detection of fraud against Medicare and the Australian Tax Office, and the modelling of motor vehicle claim frequency and cost on behalf of an insurance company. Related sites
Other examples of scientific data mining
If, on average, each of the five factors has ten possible values (age, for example, might be categorised into brackets such as 21-25, 26-30, and so on), then the total number of possible combinations equals 105 (which is 100,000). This is the number of possible values or outcomes for the formula. But there are potentially thousands of factors that might influence customer behaviour. Let’s say there are a thousand, each with ten possible values. The total number of possible values is therefore 101000 (1 followed by a thousand zeroes - a huge number). This is often called the ‘curse of dimensionality’; as the dimensions of databases grow, so too will the curse. Another challenge is to achieve the ideal of ‘scalability’, which holds that if a database doubles in size then it should only take twice as long to mine it using the same-sized computer. The problem is, scientists are finding that this linear effect doesn’t always apply and that the time needed to run an algorithm can actually increase exponentially as the database grows. Ethical issues Data miners are faced with a host of other technical challenges, but there are also significant ethical questions. For example, data mining might identify groups that are less profitable to companies or more prone to anti-social behaviour. This could lead to discrimination against certain customers. Other ethical questions associated with data mining will undoubtedly arise as large companies and government institutions employ the techniques more widely. As the science advances, it will be important that our understanding of the social effects increases at the same rate.
New Scientist 9 June 2006 Pentagon sets its sights on social networking websites (by Paul Marks) Reports that the US National Security Agency is gathering personal data to map social networks for counter-terrorism purposes.
12 May 2006 Covert surveillance of US phone records revealed (by Tom Simonite) Reports on sophisticated data mining of phone records.
6 January 2006 MP3 players to select tunes to your taste (by Kurt Kleiner) Describes a new technology to let your computer recommend new music you might like, based on an analysis of the tunes you enjoy.
15 October 2005, page 44-47 Pets to the rescue (by Janet Ginsburg) Discusses the use of a database of veterinary pet treatments to foreworn of possible human diseases.
29 September 2005 NASA forges alliance with Google (by Maggie McKee) Reports that NASA and Google will cooperate on a number of projects, including one on ways to manage large amounts of data.
5 December 2004 Cyber detective links up crimes (by Duncan Graham-Rowe) Describes a system to spot patterns in criminal activities to solve more crimes.
Science 1 July 2005, page 94 How will big pictures emerge from a sea of biological data? (by Elizabeth Pennisi) Describes an automated search tool that could discover previously unidentified patterns in the growing quantity of biological data.
Scientific American May 2005, pages 70-73 Molecular treasure hunt (by Gary Stix) Describes software used to find previously undiscovered gene or protein pathways by combing through hundreds of thousands of journal articles.
24 January 2005 Seeking better web searches (by Javed Mostafa) Summarises recent developments in web search engines.
An introductory article about data mining.
Data mining: Extending the information warehouse framework (IBM, USA)
Describes data mining and its potential benefits to users. Includes examples.
Data mining in a scientific environment (Charles Sturt University, Australia)
Provides examples of data mining methods used by scientists.
Data mining: What is data mining? University of California, Anderson School of Management, USA)
Provides an introduction to the mining of data, information and knowledge.
An overview of data mining at Dun and Bradstreet (Thearling.com)
Describes the scope of data mining, data mining techniques and how data mining can be used to extract information from a large database. Written from a business perspective.
computer memory. Computer memory is measured in bytes.
decision tree. A hierarchy of rules within a computer program, represented by a tree-like structure, that enables a set of data to be classified. A series of selection criteria classify the data into smaller and smaller categories. linear discriminant analysis. A method of classification that uses a weighted sum. For each object that is to be classified, linear discriminant analysis takes a weighted sum of values of the variables that determine the classification. The value of the weighted sum is then used to determine the classification. For example, a bank may wish to classify loan customers into those at risk of defaulting and those not at risk, based on salary and financial commitments. In the plot of financial commitments against salary, a linear discriminant function appears as a line. The high-risk customers will have a low salary and high financial commitments and lie above the line, while those with a high salary and low financial commitment will have low risk and lie below the line. neural network. A statistical analysis procedure based on models of nervous system learning in animals. Neural networks have the ability to ‘learn’ from a collection of examples to discover patterns and trends. These data-mining techniques can be used in forecasting or predicting. For more information see and An introduction to neural networks (University of Stirling, UK). regression. A regression relationship allows the approximate prediction of one variable from the value of one or more other variables. For example, we might be interested in the prediction of the weight of Australian women given their height. Such a relationship is commonly expressed in the form of a mathematical equation, often a straight line equation.
External sites are not endorsed by the Australian Academy of Science. Posted September 1999. The Australian Foundation for Science is also a supporter of Nova.
This topic is sponsored by Australian university mathematical sciences departments and the Australian Government's National Innovation Awareness Strategy.
|