Good prospects ahead for data mining

Box 2 | Challenges in data mining

It’s not hard to generate a large data set. Consider a commercial enterprise that wants to know more about the people to whom it sells products. It believes there are at least five factors (also called dimensions) that influence customer behaviour. These might be age, occupation, marital status, number of children and social status. The company sets out to develop a formula that relates these factors to each other, hoping for a simple procedure for assessing potential clients.

If, on average, each of the five factors has ten possible values (age, for example, might be categorised into brackets such as 21-25, 26-30, and so on), then the total number of possible combinations equals 105 (which is 100,000). This is the number of possible values or outcomes for the formula. But there are potentially thousands of factors that might influence customer behaviour. Let’s say there are a thousand, each with ten possible values. The total number of possible values is therefore 101000 (1 followed by a thousand zeroes – a huge number). This is often called the ‘curse of dimensionality’; as the dimensions of databases grow, so too will the curse.

Another challenge is to achieve the ideal of ‘scalability’, which holds that if a database doubles in size then it should only take twice as long to mine it using the same-sized computer. The problem is, scientists are finding that this linear effect doesn’t always apply and that the time needed to run an algorithm can actually increase exponentially as the database grows.

Ethical issues

Data miners are faced with a host of other technical challenges, but there are also significant ethical questions. For example, data mining might identify groups that are less profitable to companies or more prone to anti-social behaviour. This could lead to discrimination against certain customers.

Other ethical questions associated with data mining will undoubtedly arise as large companies and government institutions employ the techniques more widely. As the science advances, it will be important that our understanding of the social effects increases at the same rate.

External sites are not endorsed by the Australian Academy of Science.
Posted September 1999.