An automated approach for retrieving hierarchical data from HTML tables
- 1 November 1999
- conference paper
- Published by Association for Computing Machinery (ACM)
- p. 466-474
- https://doi.org/10.1145/319950.320052
Abstract
Among the HTML elements, HTML tables [RHJ98] encapsulate hierarchically structured data (hierarchical data in short) in a tabular structure. HTML tables do not come with a rigid schema and almost any forms of two-dimensional tables are acceptable according to the HTML grammar. This relaxation complicates the process of retrieving hierarchical data from HTML tables. In this paper, we propose an automated approach for retrieving hierarchical data from HTML tables. The proposed approach constructs the content tree of an HTML table, which captures the intended hierarchy of the data content of the table, without requiring the internal structure of the table to be known beforehand. Also, the user of the content tree does not deal with HTML tags while retrieving the desired data from the content tree. Our approach can be employed by (i) a query language written for retrieving hierarchically structured data, extracted from either the contents of HTML tables or other sources, (ii) a processor for converting HTML tables to XML documents, and (iii) a data warehousing repository for collecting hierarchical data from HTML tables and storing materialized views of the tables. The time complexity of the proposed retrieval approach is proportional to the number of HTML elements in an HTML table.Keywords
This publication has 3 references indexed in Scilit:
- Formal models of Web queriesPublished by Association for Computing Machinery (ACM) ,1997
- The Lorel query language for semistructured dataInternational Journal on Digital Libraries, 1997
- Querying documents in object databasesInternational Journal on Digital Libraries, 1997