Nova home Published by


Australian Academy
of Science


Good prospects ahead for data mining

Box 2 | Challenges in data mining


It’s not hard to generate a large data set. Consider a commercial enterprise that wants to know more about the people to whom it sells products. It believes there are at least five factors (also called dimensions) that influence customer behaviour. These might be age, occupation, marital status, number of children and social status. The company sets out to develop a formula that relates these factors to each other, hoping for a simple procedure for assessing potential clients.

If, on average, each of the five factors has ten possible values (age, for example, might be categorised into brackets such as 21-25, 26-30, and so on), then the total number of possible combinations equals 105 (which is 100,000). This is the number of possible values or outcomes for the formula. But there are potentially thousands of factors that might influence customer behaviour. Let’s say there are a thousand, each with ten possible values. The total number of possible values is therefore 101000 (1 followed by a thousand zeroes – a huge number). This is often called the ‘curse of dimensionality’; as the dimensions of databases grow, so too will the curse.

Another challenge is to achieve the ideal of ‘scalability’, which holds that if a database doubles in size then it should only take twice as long to mine it using the same-sized computer. The problem is, scientists are finding that this linear effect doesn’t always apply and that the time needed to run an algorithm can actually increase exponentially as the database grows.

Ethical issues

Data miners are faced with a host of other technical challenges, but there are also significant ethical questions. For example, data mining might identify groups that are less profitable to companies or more prone to anti-social behaviour. This could lead to discrimination against certain customers.

Other ethical questions associated with data mining will undoubtedly arise as large companies and government institutions employ the techniques more widely. As the science advances, it will be important that our understanding of the social effects increases at the same rate.

Other boxes

Box 1. Data mining the stars – from Canberra to the cosmos

KEY TEXT
GLOSSARY
ACTIVITIES
FURTHER READING
USEFUL SITES

Posted September 1999.

NOVA HOME TOPIC LIST KEYWORDS SEARCH


The Australian Foundation for Science is a supporter of Nova.

This topic is sponsored by Australian university mathematical sciences departments and the Australian Government's National Innovation Awareness Strategy.


© Australian Academy of Science