一年一又一年

Categories: others

3. Web Crawler and Data Extraction

For a general purpose web crawler, the task of data extraction is rather complicated. It needs to identify the data region first and then extracts the data in the region using some algorithms plus some sort of NLP techniques. Detecting the data region is a rather difficult task. A very normal problem like automatically finding the news title and body leads to thousands of top conference and journal papers. I will skip this part and focus on a much easier scenario where manually labelling is involved. So the only problem here is how to extract the data or wrapper induction (WI). A wrapper is a program that extracts data from an information source (like a Web page) such that the information integration system can access the data in a unified way. Apart from manually writing wrappers, there are two main approaches to wrapper generation – supervised approach and unsupervised approach. Supervised approach uses supervised learning to learn data extraction rules (like a regular expression) from a set of manually labelled examples. The unsupervised approach, RoadRunner[2], EXALG[1] and DEPTA[4], does not need any labelled examples. To make the extraction as accurate as possible, the approach discussed below incorporates manually labelling and the partial tree alignment algorithm in DEPTA.

A very common pattern in Web design is too make a list page which contains general information of records. The list page can be either one page or multiple pages with a “next page” link to the next page of records. Additionally, each record has a link to its corresponding record page with detailed information. Both the list page and the record page are generated using predefined templates. Fig. 1 shows an example of list page. It has multiple records on the page and each record has a record page link. The records share the same underlying template and the record pages also share the same underlying template.

Figure 1: A list page contains records and links to record pages

WI systems like RoadRunner and EXALG are all based on this observation. The necessity of using multiple similar pages is due to the large noise of Web pages and the information incompleteness of single pages. For example, in the problem of news body extraction, a Web page may contain several discontinuous part identified as news bodies and determining which part is the real news body is difficult. But if you put multiple news pages together, the part which is always identified as news body is highly likely to be the real news body region. Another example is the problem of product information extraction. Blank items may be invisible on the Web pages. So a single page may not have all the product information displayed. A reasonable approach is to compare as much pages as possible to determine all the product information items.

This observation also enables us to maximize the efficiency of Web crawlers. The crawler starts from the first list page and visits all the links to record pages on this page, and then visits the next list page until no more “next page” link exists.

The only remaining problem now is how the find the common underlying template. This can be solved using the partial tree alignment algorithm[4] on the DOM tree obtained in the previous section. The following subsection introduces the partial tree alignment algorithm. The tree obtained from the algorithm is called the template tree. The template tree is used in the data extraction.

3.1 Partial tree alignment algorithm

The partial tree alignment algorithm was proposed by Yanhong Zhai and Bing Liu in [4]. Given a forest, it calculates the tree edit distance of each two trees in the forest. The calculation of tree edit distance between two trees also gives a mapping between these two trees which enables to merge these two trees together. Repeat this procedure on the forest will result in only one tree.

Tree edit distance between two labelled ordered trees A and B is the cost associated with the minimum set of operations needed to transform A into B. Standard operation are node removal, node insertion and node replacement where a cost is assigned to each of the operations. The concept of mapping is formally defined as [3]:

Let X be a tree and let X[i] be the i-th node of tree X in a preorder walk of the tree. A mapping M between a tree A of size n1 and a tree B of size n2 is a set of ordered pairs (i,j), one from each tree, satisfying the following conditions for all (i1, j1),(i2,j2) ∈ M:

1. i1 = i2 iff j1 = j2;
2. A[i1] is on the left of A[i2] iff B[j1] is on the left B[j2];
3. A[i1] is an ancestor of A[i2] iff B[j1] is an ancestor of B[j2].

In the matching, the hierarchical relation between nodes and the order between sibling nodes are both preserved. Fig. 2 shows a matching example.

Figure 2: A general tree mapping example (from [4])

The standard tree matching problem can be solved in a time complexity of $\mathcal{O}(n_1n_2h_1h_2)$ using dynamic programming where h1 and h2 are the heights of trees. In the standard setting, the mapping can across levels(node a in Tree A and node b in Tree B in Fig. 2). The simple tree matching algorithm considers a restricted version where no node replacement and no level crossing are allowed. The time complexity of the algorithm is $\mathcal{O}(n_1n_2)$. Let Ai be the subtree corresponds to the i-th child node of A’s root and Bj be the subtree corresponds to the j-th child node of B’s root. Denote M[i,j] as the number of maximal matchings using $\{A_1, \cdots, A_i\}$ and $\{B_1, \cdots, B_j\}$. Then the recursive formula is $M[i, j] = \max(M[i - 1, j], M[i, j - 1], M[i - 1, j - 1] + W[i, j])$ where W[i, j] is the value of maximal matchings between Ai and Bj. After the matching, we can trace back to find the matched nodes between the two trees.

The partial tree alignment works on the result of the simple tree matching algorithm. It is about how to merge two trees based on the matched nodes. Just code with your instinct. Anything which preserves the order of the sibling nodes will work. After repeating the algorithm enough times, all the trees will be merged into one tree which corresponds to the underlying template of the pages.

The above discussion works for general trees. For DOM tree, the type of node also matters in matching. Some common node types are element, attribute, text and comment. I suggest the following rules for DOM tree matching (comment nodes are removed).

1. Two nodes match only if their node types are the same.
2. Two element nodes match iff they have the same tag names. A more rigorous rule is two element nodes match iff they have the same tag names and class names.
3. A mismatch scores 0 point and a match scores 1 point except for special cases.
4. Two text nodes always match without considering their texts.
5. If two text nodes share the same prefixes, then the score is 2. If the texts of two text nodes belong to the same ontology(this requires NLP techniques), then the score is higher than 1.

Here, the target of tree matching is to find the matching with maximal score. Two trees match if their matching score is above a specific threshold. Given a forest, define a score for each tree which is a function of the number of matched trees in the forest and the sum of the matching scores correspond to matched trees. The tree with the highest score is the reference tree of the forest and the score of this tree is the matching score of the forest. Denote the two parameter as n and s respectively, a possible scoring function is n + s / n which is a linear combination of the number of matched trees and the average score of matching.

For a list page, find the node whose children(this is a forest) has the highest matching score and merge all the other children with the reference tree will result in the template tree. The list page, the node and the template tree is called the list tree, the record zone tree and the record tree respectively. If the scoring function does not lead to the desired record zone tree, one can manually choose the record zone tree and calculate its corresponding record tree. For a group of data pages(it is also a forest), calculate the reference tree among these pages and merge all the other pages with the reference tree will result in the template tree. The template tree is called the record detail tree.

3.2 Data extraction using the template tree

A template tree represents the underlying template of a series Web pages. It is also a wrapper. Typical wrapper usage is to convert it into an automata and uses the automata to match the new pages. It is natural to use regular expression because regular expression describes a deterministic finite automata. However, the implementation of regular expression in Java is not robust enough which may fall into a dead loop. And Java does not provide a way to listen to the matching procedure and terminate it when timeout. So it is not a good idea to use regular expression here as the regular expression corresponds to a template tree is extremely complicated.

An alternative approach is to calculate the simple tree matching between the template tree and the DOM tree of the new page, and trace back to find out the node mapping. Then we can find the corresponding node of any node in the template tree from the DOM tree of the new page. As the time complexity is $\mathcal{O}(n_1n_2)$, and a page usually has only thousands of nodes, the matching speed is rather fast. This approach turns out to be very efficient in practice.

The real data extraction system works as follows.

1. Labelling phase.
1. Feed the target list page to the system and the system shows the detected record zone tree and record tree. If the detected record zone tree is what you need, then move to the next step, otherwise notifying the system with the correct record zone tree and the system shows the corresponding record tree.
2. Labelling the node in the record tree which contains the “next page” link. Labelling the nodes in the record tree which contain the data needed. If the data in the record tree is enough, then go to the next phase.
3. Labelling the node in the record tree which contains link to record page.
4. The system calculates the correspondent node of the labelled node for all the siblings of the record tree and visits all the retrieved record links to get the forest of record pages.
5. The system calculates the record detail tree of the forest and shows the tree to the user. If the tree is not the desired one, the user may notify the system the desired one and the system calculates the record detail tree again.
6. Labelling the nodes in the record detail tree which contain the data needed.
2. Extracting phase.
1. Visiting the target list page and convert it into a DOM tree. Calculate the mapping from the list tree to the DOM tree. For each child of the record zone tree’s corresponding node, calculate the mapping from the record tree to the subtree corresponds to the child and extract the labelled node (including the link node).
2. If the information on the record page is needed, then visit all the extracted record page and calculate the mapping from the list detail tree to the DOM tree of the record page. Extract all the labelled nodes’ corresponding node.
3. Following the corresponding node of the “next page” node to the next list page and loops until no more pages available.

After you get the information in the nodes, all your need to do is to convert the information into your desired format. Writing some regular expressions to match things like money, time and cellphone will be necessary. A crawler using method crawls about 0.1m houses for rent/sale from the internet everyday at www.fang.com. The crawler runs on a dual-core machine with 4g memory and 4m Internet access. It uses up all the Internet bandwidth with an average CPU usage of about 25%.

References
[1] A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 337-348. ACM, 2003.
[2] V. Crescenzi, G. Mecca, P. Merialdo, et al. Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of the international conference on very large data bases, pages 109-118. Citeseer, 2001.
[3] K.C. Tai. The tree-to-tree correction problem. Journal of the ACM (JACM), 26(3):433, 1979.
[4] Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In Proceedings of the 14th international conference on World Wide Web, pages 76-85. ACM, 2005.

Categories: information retrieval

2. Parsing a HTML Page

There are a lot of choices when trying to parse a HTML page. For a complete list of all possible parsers, you can search from google for query “html parser”. Basically, there are two different ways to parse a HTML page – parse with an embeded parser and parse with a local web browser.

For the first method, you can write your own HTML parser or using some libraries from the internet, like HTML Parser, JTidy, NekoHTML, etc. These parsers simply parse the content of the HTML page using the encoding you provided and correct the syntax error of the HTML tags. You can output the parsed HTML page or get the parsed DOM tree. You can directly manipulate the DOM tree or convert this tree into your own memory model. This kind of parser is usually fast but not robust. I’ve tried the previous three parsers and I would recommend to use the HTML Parser.

For the second method, you may choose to use the JDesktop Integration Components (JDIC) or the Standard Widget Toolkit (SWT). Both methods provide a way to access the local web browser(IE and firefox). For SWT, you can use XULRunner even if you don’t have web browser installed. XULRunner is a Mozilla runtime package that can be used to bootstrap XUL + XPCOM applications that are as rich as Firefox and Thunderbird. As the web page is resolved using the local web browser, all the resources used by the page (like styles and images) are loaded from the remote server and the javascript inside the page is also executed. The time needed to parse a page is much longer than that of the first method. But, this method is extremely robust and the resolve quality is extremely high. What you get after the parsing is exactly the same as you browse the page from a web browser. This character is especially important when you try to extract data from web pages mainly generated by javascript. The first method will not work under this situation. This method is also easy to use. Simply create an instance of the local browser and pass the target url to the browser, the browser will provide a parsed DOM tree or the parsed document. A small problem of this method is that the parsed web page uses the original encoding. So you still have to detect the encoding of the web page and convert the page into your project encoding.

In the following subsections, I will show one example for each of the above two methods.

2.1 Example for Working with HTML Parser

To use HTML Parser, you need to add htmllexer.jar and htmlparser.jar to your Java build path. The following snippet parses a html page and traverse the body of the DOM tree. Note that the class of “node” in the code is org.htmlparser.Node which does not implement the standard W3C node interface org.w3c.dom.Node. You should write your own adapter if you want to use the W3C node type.

try {
Parser parser = Parser.createParser(content, "utf-8");
HtmlPage page = new HtmlPage(parser);
parser.visitAllNodesWith(page);
NodeList list = page.getBody();
for (NodeIterator iterator = list.elements(); iterator.hasMoreNodes();) {
Node node = iterator.nextNode();
}
} catch (ParserException e) {
}

2.2 Example for Working with SWT & XULRunner

Simply adding swt.jar to your Java build path and you will be able to work with SWT. Using XULRunner is a little bit complicated. You have to download the right version of XULRunner regarding to your platform and execute the registration command under the XULRunner direction.
To make XULRunner available for all the users, use command:

• Windows XULRunner –register-global
• Linux ./XULRunner –register-global
• Mac ./XULRunner-bin –register-global

To make it only available for the current user, use command:

• Windows XULRunner –register-user
• Linux ./XULRunner –register-user
• Mac ./XULRunner-bin –register-user

You also need to add MozillaGlue.jar and MozillaInterfaces.jar into your Java build path. These two packages can be found in the xulrunner-sdk archive. Now you can use XULRunner in your application.

The following snippet resolves a url using XULRunner. You can disable the caching of the browser by adding parameter “Cache-Control: no-cache” when calling the setUrl() method. The complete() event is called when the page resolving is finished. You can add code in this event to trigger the processing of the page. The sample code in the snippet gets the DOM tree of the page. Note that the DOM tree is not the standard W3C DOM tree.

Display display = new Display();
final Shell shell = new Shell(display);
FillLayout layout = new FillLayout();
shell.setLayout(layout);

final Browser browser = new Browser(shell, SWT.BORDER | SWT.MOZILLA);
browser.setUrl(url, null, new String[] { "Cache-Control: no-cache" });
public void completed(ProgressEvent event) {
nsIWebBrowser webBrowser = (nsIWebBrowser) browser.getWebBrowser();
nsIDOMWindow domWindow = webBrowser.getContentDOMWindow();
nsIDOMDocument document = domWindow.getDocument();
documentElement = document.getDocumentElement();
}
});
shell.open();
while (!shell.isDisposed()) {
display.sleep();
}
display.dispose();

Notes on Chinese Web Data Extraction in Java(part 1)

Note. The code is developed with Eclipse and tested under JDK 1.6. To make the code running correctly, you need to set the encoding of the project to utf-8 and include some necessary libraries. All the code will be available at http://sourceforge.net/projects/ptawebdataextra.

Correctly loading a Chinese Web page using Java is not a trivial task. Given a target url, you need to read the content from the url and then decode the content using the right encoding. Chinese Web pages can be encoded using utf-8, gbk, gb2312, gb18030, big5, etc. If you did not use the right encoding when resolving a page, you will only get meaningless characters with some html tags. This is usually true for Web pages not written in English. However, Java does not handle the encoding issue automatically. So your code is responsible for the encoding detection of Web pages.

The first step is to get a http connection to the target url. This can be done using the following code.

public static HttpURLConnection getConnection(URL url)
throws IOException
{
HttpURLConnection httpurlconnection = null;
try {
URLConnection urlconnection = url.openConnection();
urlconnection.setConnectTimeout(60000);
urlconnection.connect();

if (!(urlconnection instanceof HttpURLConnection)) {
return null;
}

httpurlconnection = (HttpURLConnection) urlconnection;
int responsecode = httpurlconnection.getResponseCode();
switch (responsecode) {
case HttpURLConnection.HTTP_OK:
case HttpURLConnection.HTTP_MOVED_PERM:
case HttpURLConnection.HTTP_MOVED_TEMP:
break;
default:
System.err.println("Invalid response code: " +
responsecode + " " + url);
httpurlconnection.disconnect();
return null;
}
} catch (IOException ioexception) {
System.err.println("unable to connect: " + ioexception);
if (httpurlconnection != null) {
httpurlconnection.disconnect();
}
throw ioexception;
}
return httpurlconnection;
}

The code first gets a URLConnection instance and then sets the time out parameter. These parameters must be set before calling the connect() method. Calling the getResponseCode() method to get the response code. If the code is valid, it returns with the cast object HttpURLConnection.

The next step is to get an InputStream from the HttpURLConnection. It retries 3 times before returns nothing.

public static InputStream getInputStream(HttpURLConnection connection)
{
InputStream inputstream = null;
for (int i = 0; i < 3; ++i) {
try {
inputstream = connection.getInputStream();
break;
} catch (IOException e) {
System.err.println("error opening connection " + e);
}
}
return inputstream;
}

The third step is the most important part which reads the content attribute of the connection and detects the encoding at the same time. The code is as follows.

public static final int STREAM_BUFFER_SIZE = 4096;
public static final String DEFAULT_ENCODING = "utf-8";
public static String[] getContent(HttpURLConnection connection)
throws IOException
{
InputStream inputstream = null;
try {
inputstream = getInputStream(connection);
if (inputstream == null) {
return null;
}
UniversalDetector detector = new UniversalDetector(null);
byte[] buf = new byte[STREAM_BUFFER_SIZE];
int nread = 0, nTotal = 0;
if (detector.isDone())
break;
}
detector.dataEnd();
String encoding = detector.getDetectedCharset();
detector.reset();
if (encoding == null) {
encoding = DEFAULT_ENCODING;
}
}
byte[] contentByte = new byte[nTotal];
int offSet = 0, l = byteList.size();
for (int i = 0; i < l; ++i) {
byte[] bytes = byteList.get(i);
int length = byteLength.get(i);
System.arraycopy(bytes, 0, contentByte, offSet, length);
offSet += length;
}
return new String[] { encoding, new String(contentByte, encoding) };
} catch (IOException ioe) {
throw ioe;
} finally {
if (inputstream != null) {
inputstream.close();
}
}
}

The encoding detection is achieved using a library called ‘juniversalchardet’. It is a Java implementation of ‘universalchardet’ which is the encoding detector library of Mozilla. To use the library, you need to construct an instance of org.mozilla.universalchardet.UniversalDetector and feed some data to the detector by calling UniversalDetector.handleData(). After notifying the detector of the end of data by calling UniversalDetector.dataEnd(), you can get the detected encoding name by calling UniversalDetector.getDetectedCharset(). Please refer to http://code.google.com/p/juniversalchardet for more details.

The getContent function reads bytes from the input stream and feeds these bytes to the encoding detector. These bytes are also stored into a list. When the encoding detection is done, the function reads up the remaining bytes and concatenates all the bytes into one array. It then decodes the bytes using the detected encoding. Here is a very small trick. You shouldn’t read the remaining bytes using the detected encoding because the encoding detection may stop in the middle of a specific character(a character is two bytes). If the detection stops in the middle of a character, then the remaining bytes is a single byte plus consecutive characters. Decoding these bytes using the detected encoding will get unreadable characters.

Finally, put the above three functions together, we get the following function which reads the content from a specific url.

public static String getContent(URL url)
{
HttpURLConnection connection = null;
try {
connection = NetUtilities.getConnection(url);
if (connection != null) {
String[] resource = NetUtilities.getContent(connection);
if (resource != null) {
return resource[1];
}
}
} catch (Exception e) {
} finally {
if (connection != null) {
connection.disconnect();
}
}
return null;
}
Categories: information retrieval

刚刚过去的十一

Categories: others

Hello world!

September 28, 2010 1 comment

Welcome to WordPress.com. This is your first post. Edit or delete it and start blogging!

Categories: Uncategorized