PHP Classes

HTML SQL: Parse and extract information from HTML using SQL

Recommend this page to a friend!
  Info   Screenshots   View files Files   Install with Composer Install with Composer   Download Download   Reputation   Support forum   Blog    
Ratings Unique User Downloads Download Rankings
StarStarStarStar 74%Total: 5,526 All time: 443 This week: 488Down
Version License Categories
htmlsql 1.0.0BSD LicenseHTML, Text processing
Description 

Author

This class can be used to parse and extract information from HTML documents using a query language similar to SQL to define the information to be extracted.

The class can open HTML documents stored as local files or as remote pages using the Snoopy class.

The class can execute a query with a syntax similar to SQL SELECT statements to search an find certain tags in the opened document with attributes that match query condition.

The occurrences that it find are returned as result set rows that may contain a given list of attributes of the matched tags.

Innovation Award
PHP Programming Innovation award nominee
May 2006
Number 2


Prize: One subscription to the PHP Magazine
Certain types of applications need to retrieve HTML pages and extract information from them to be processed for specific purposes.

Often, parsing HTML pages to extract only the relevant information is not an easy task. On the other hand, most Web developers are very familiar with SQL and can use it to define what information they want from their database tables.

This class provides a means to extract data from HTML pages using a query language very similar to SQL. It simplifies greatly the implementation of scripts that need to process data from HTML pages.

Manuel Lemos
Picture of J.
Name: J. <contact>
Classes: 1 package by
Country: Germany Germany
Innovation award
Innovation award
Nominee: 1x

 

Details

htmlSQL - Version 0.5 - README --------------------------------------------------------------------- AUTHOR: Jonas John (http://www.jonasjohn.de/) DESCRIPTION: --------------------------------------------------------------------- htmlSQL is a experimental PHP class which allows you to access HTML values by an SQL like syntax. This means that you don't have to write complex functions (regular expressions) to extract specific values. The htmlSQL queries look like this: SELECT href,title FROM a WHERE $class == "list" ^ Attributes ^ ^ search query (can be empty) to return ^ ^ HTML tag to search in "*" is possible = all tags This query returns an array with all links that contain the attribute class="list". All web transfers in htmlSQL are using the awesome Snoopy class (package version 1.2.3 - URL: http://snoopy.sourceforge.net/) But for file or string queries Snoopy is not required. You find all Snoopy related documents (copyright, readme, etc) in the snoopy_data/ folder. HOW TO USE: --------------------------------------------------------------------- Just include the "snoopy.class.php" and the "htmlsql.class.php" files into your PHP scripts and look at the examples (examples/) to get an idea of how to use the htmlSQL class. It should be very simple :-) BACKGROUND / IDEA: --------------------------------------------------------------------- I had this idea while extracting some data from a website. As I realized that the algorithms and functions to extract links and other tags are often the same - I had the idea to combine all functions to an universal usable class. While drinking a coffee and thinking on that problem, I thought it would be cool to access HTML elements by using SQL. So I started creating this class... WARNING: --------------------------------------------------------------------- The eval() function is used for the WHERE statement. Make sure that all user data is checked and filtered against malicious PHP code. Never trust user input! TODO: --------------------------------------------------------------------- - enhance the HTML parser - test htmlSQL with invalid and bad HTML files - replace the ugly eval() method for the WHERE statement with an own method - more error checks - include the LIMIT function/method like in SQL LICENSE: --------------------------------------------------------------------- htmlSQL uses a modified BSD license, you find the full license text in the "htmlsql.class.php".

Screenshots (1)  
  • htmlsql_syntax_example.png
  Files folder image Files (20)  
File Role Description
Files folder imageexamples (15 files)
Accessible without login Plain text file htmlsql.class.php Class Contains the main htmlSQL class
Accessible without login Plain text file snoopy.class.php Class The famous snoopy class by Monte Ohrt - v1.01
Accessible without login Plain text file readme.txt Doc. English readme with description and todo list
Accessible without login Plain text file readme_german.txt Doc. The same as the readme.txt just in german language

The PHP Classes site has supported package installation using the Composer tool since 2013, as you may verify by reading this instructions page.
Install with Composer Install with Composer
 Version Control Unique User Downloads Download Rankings  
 0%
Total:5,526
This week:0
All time:443
This week:488Down
User Ratings User Comments (4)
 All time
Utility:95%StarStarStarStarStar
Consistency:91%StarStarStarStarStar
Documentation:86%StarStarStarStarStar
Examples:87%StarStarStarStarStar
Tests:-
Videos:-
Overall:74%StarStarStarStar
Rank:112
 
Really useful!!!
11 years ago (Massimiliano Chichi)
77%StarStarStarStar
Excellent idea, very neat coded and great examples.
15 years ago (Matt)
80%StarStarStarStarStar
Really helpful and efficient
15 years ago (LiliwoL)
75%StarStarStarStar
This is a brilliant class.
17 years ago (Wayne Zeller)
77%StarStarStarStar