AAO Colloquium.

Thursday 22 November 2001- 3:30pm AAO Conference Room

Data Mining and the Virtual Observatory

David Schade

Canadian Astronomy Data Centre
A successful Global Virtual Observatory (GVO) initiative needs to balance the need for high-level systems and protocols that enable access to large, distributed datasets against the need to ensure that sufficient content exists to scientifically justify the GVO expense and effort. If a powerful GVO infrastructure were miraculously created today it would find itself desperately short of the high-quality data content that would make the scientific payoff of the GVO a realizable goal. The production of usable content is an enormous challenge to the success of the GVO.

Data Mining is the extraction of knowledge from very large collections of data and our interest in this field is strengthened by anticipation of the huge, and eminently mine-able, datasets from a new generation of instruments including wide-field, multi-object spectrographs such as the 2dF and wide-field imagers like CFH12k and MegaCam at CFHT. The surveys executed with these instruments present an opportunity to develop exactly the type of high-quality content that is needed before advanced data mining and GVO systems can fulfill their promise. We have developed a coherent end-to-end design for a processing and data warehousing system. We have created a scaleable distributed processing system to enable the increased demands on our pipeline capabilities and we have explored solutions to the challenge of effective querying on catalog databases of many objects with many potentially query-able parameters. The ultimate product of the GVO should be a fully cross-identified "Catalogue of the Universe" which completely integrates our astrophysical understanding with the catalogue of multi-wavelength observations. Such a catalogue would encapsulate our understanding of the population of objects in the universe at a point in time.