一年一又一年

January 5, 2011 Leave a comment

念小学的时候总听人说着21世纪的时候会如何如何,一转眼21世纪的第一个十年就过去了,却未见当初听到的事情有多少成真。2000年的时候我小学毕业,在这十年中我念完了初中、高中和大学,现在在中国的硅谷——中关村体验北漂生活。

大学毕业只有半年,高中毕业也不过四年半。但如今一见高中或大学同学总是聊着当初在学校的往事,仿佛我们已经老去,而那已是很久之前的事情。茶余饭后一个不可缺少的话题就是柳同学,我在12月中旬的时候去寺里见过他一面,也就讲讲我所见他的近况。他除了穿着寺里统一的外套以外,长相、说话声音什么的基本没有变化,还有他特别的走路姿势,在人群中还是很好辨认的。寺里有不少和他经历相似的人在一起修行,每日除了研读经书之外还要参加劳动。诵经念佛自然辛苦,对于他这个常年读书的人来说寺里的劳动也是一项非常头疼的修行,诸如搬东西如何使劲、如何清扫厕所之类的问题都是他之前从来没有碰到过的。当然,在这个远离尘世喧嚣的小院子里面,衣食无忧,一群志同道合的人一起修行的生活是非常惬意的。以我所见,他的生存状态是非常之好的,没有什么可惋惜的。当年的高中同学,大半已身在大洋彼岸,剩下的也多在继续念书,我们每日奔波劳碌无非也只图个一日三餐加上一个容身之所,至于所谓的似锦前程都是旁人所见,要找一群志同道合之人更是难上加难。如此想想,柳同学确实很幸运,兴许多少年后又有一个弘一大师一样的人物诞生。

年末给自己放了假,从圣诞节一直休息到了现在,期间回了一趟上海,见了大学同学和老师,之后又回家了一趟。这两年都是一年才回家一趟,留给家里人的时间实在是少的可怜。这次回家什么都没有带,也不处理任何事情,唯一做的事情就是陪家人特别是老人家聊天。虽然都祝愿老人家健康长寿,现在的医疗条件也越来越好,但是看着老人家每况愈下的身体,也知道能够在一起说说话的机会都不多了。每每想到这一点,就非常怀念小的时候,总希望能够回到那个什么都不懂的年纪。今年的气候极度反常,说是千年极寒,可北京到现在为止一场雪都没有下,但是家里已经下了三场雪了。我在的几天里很多天最高气温都只有零度左右,和北京的气温一样低。但没有暖气,日子非常难过,对于老人来说更是难熬,即使整天开着空调也不舒服。外公就是去年这个时候走的,守夜的那个晚上极冷,我在外面呆了一夜之后就被冻感冒了,接下来的整个冬天咳嗽都没好。

和这个极寒的冬天一样,我们的网站没有什么起色,最近还被baidu惩罚了,收录数量从两万五变成了零!网站的基本功能已经全部完成,下一步就是对现有功能的优化和开发我们网站的特色功能。这半年我看了很多很多成功的和不成功的网站,也看了很多网站运营相关的资料,还有很多IT行业的动态和展望。我看到了很多东西,知道了许多念书的时候不知道或者从未考虑过的东西,但事实上我总在被动的接受别人的观点。运作一个网站或者开创一项事业,重要的自己的思考,总是跟风是不可能做出伟大的事业的。希望新的一年里面我有更多自己的思考,能够看到别人未曾发现的方向。另外一方面还是激情问题,我渴望那种“几个年轻人在车库里没日没夜折腾自己的新玩意”的状态,据说比尔盖茨最怕的就是这件事。年轻人的激情、汗水加上天分必然成就新时代的苹果和微软,现代版硅谷海盗故事的主人公会变成我们和我们的同龄人!

Categories: others

Notes on Chinese Web Data Extraction in Java(part 3)

October 26, 2010 Leave a comment

3. Web Crawler and Data Extraction

For a general purpose web crawler, the task of data extraction is rather complicated. It needs to identify the data region first and then extracts the data in the region using some algorithms plus some sort of NLP techniques. Detecting the data region is a rather difficult task. A very normal problem like automatically finding the news title and body leads to thousands of top conference and journal papers. I will skip this part and focus on a much easier scenario where manually labelling is involved. So the only problem here is how to extract the data or wrapper induction (WI). A wrapper is a program that extracts data from an information source (like a Web page) such that the information integration system can access the data in a unified way. Apart from manually writing wrappers, there are two main approaches to wrapper generation – supervised approach and unsupervised approach. Supervised approach uses supervised learning to learn data extraction rules (like a regular expression) from a set of manually labelled examples. The unsupervised approach, RoadRunner[2], EXALG[1] and DEPTA[4], does not need any labelled examples. To make the extraction as accurate as possible, the approach discussed below incorporates manually labelling and the partial tree alignment algorithm in DEPTA.

A very common pattern in Web design is too make a list page which contains general information of records. The list page can be either one page or multiple pages with a “next page” link to the next page of records. Additionally, each record has a link to its corresponding record page with detailed information. Both the list page and the record page are generated using predefined templates. Fig. 1 shows an example of list page. It has multiple records on the page and each record has a record page link. The records share the same underlying template and the record pages also share the same underlying template.

Figure 1: A list page contains records and links to record pages

WI systems like RoadRunner and EXALG are all based on this observation. The necessity of using multiple similar pages is due to the large noise of Web pages and the information incompleteness of single pages. For example, in the problem of news body extraction, a Web page may contain several discontinuous part identified as news bodies and determining which part is the real news body is difficult. But if you put multiple news pages together, the part which is always identified as news body is highly likely to be the real news body region. Another example is the problem of product information extraction. Blank items may be invisible on the Web pages. So a single page may not have all the product information displayed. A reasonable approach is to compare as much pages as possible to determine all the product information items.

This observation also enables us to maximize the efficiency of Web crawlers. The crawler starts from the first list page and visits all the links to record pages on this page, and then visits the next list page until no more “next page” link exists.

The only remaining problem now is how the find the common underlying template. This can be solved using the partial tree alignment algorithm[4] on the DOM tree obtained in the previous section. The following subsection introduces the partial tree alignment algorithm. The tree obtained from the algorithm is called the template tree. The template tree is used in the data extraction.

3.1 Partial tree alignment algorithm

The partial tree alignment algorithm was proposed by Yanhong Zhai and Bing Liu in [4]. Given a forest, it calculates the tree edit distance of each two trees in the forest. The calculation of tree edit distance between two trees also gives a mapping between these two trees which enables to merge these two trees together. Repeat this procedure on the forest will result in only one tree.

Tree edit distance between two labelled ordered trees A and B is the cost associated with the minimum set of operations needed to transform A into B. Standard operation are node removal, node insertion and node replacement where a cost is assigned to each of the operations. The concept of mapping is formally defined as [3]:

Let X be a tree and let X[i] be the i-th node of tree X in a preorder walk of the tree. A mapping M between a tree A of size n1 and a tree B of size n2 is a set of ordered pairs (i,j), one from each tree, satisfying the following conditions for all (i1, j1),(i2,j2) ∈ M:

  1. i1 = i2 iff j1 = j2;
  2. A[i1] is on the left of A[i2] iff B[j1] is on the left B[j2];
  3. A[i1] is an ancestor of A[i2] iff B[j1] is an ancestor of B[j2].

In the matching, the hierarchical relation between nodes and the order between sibling nodes are both preserved. Fig. 2 shows a matching example.


Figure 2: A general tree mapping example (from [4])

The standard tree matching problem can be solved in a time complexity of \mathcal{O}(n_1n_2h_1h_2) using dynamic programming where h1 and h2 are the heights of trees. In the standard setting, the mapping can across levels(node a in Tree A and node b in Tree B in Fig. 2). The simple tree matching algorithm considers a restricted version where no node replacement and no level crossing are allowed. The time complexity of the algorithm is \mathcal{O}(n_1n_2). Let Ai be the subtree corresponds to the i-th child node of A’s root and Bj be the subtree corresponds to the j-th child node of B’s root. Denote M[i,j] as the number of maximal matchings using \{A_1, \cdots, A_i\} and \{B_1, \cdots, B_j\}. Then the recursive formula is M[i, j] = \max(M[i - 1, j], M[i, j - 1], M[i - 1, j - 1] + W[i, j]) where W[i, j] is the value of maximal matchings between Ai and Bj. After the matching, we can trace back to find the matched nodes between the two trees.

The partial tree alignment works on the result of the simple tree matching algorithm. It is about how to merge two trees based on the matched nodes. Just code with your instinct. Anything which preserves the order of the sibling nodes will work. After repeating the algorithm enough times, all the trees will be merged into one tree which corresponds to the underlying template of the pages.

The above discussion works for general trees. For DOM tree, the type of node also matters in matching. Some common node types are element, attribute, text and comment. I suggest the following rules for DOM tree matching (comment nodes are removed).

  1. Two nodes match only if their node types are the same.
  2. Two element nodes match iff they have the same tag names. A more rigorous rule is two element nodes match iff they have the same tag names and class names.
  3. A mismatch scores 0 point and a match scores 1 point except for special cases.
  4. Two text nodes always match without considering their texts.
  5. If two text nodes share the same prefixes, then the score is 2. If the texts of two text nodes belong to the same ontology(this requires NLP techniques), then the score is higher than 1.

Here, the target of tree matching is to find the matching with maximal score. Two trees match if their matching score is above a specific threshold. Given a forest, define a score for each tree which is a function of the number of matched trees in the forest and the sum of the matching scores correspond to matched trees. The tree with the highest score is the reference tree of the forest and the score of this tree is the matching score of the forest. Denote the two parameter as n and s respectively, a possible scoring function is n + s / n which is a linear combination of the number of matched trees and the average score of matching.

For a list page, find the node whose children(this is a forest) has the highest matching score and merge all the other children with the reference tree will result in the template tree. The list page, the node and the template tree is called the list tree, the record zone tree and the record tree respectively. If the scoring function does not lead to the desired record zone tree, one can manually choose the record zone tree and calculate its corresponding record tree. For a group of data pages(it is also a forest), calculate the reference tree among these pages and merge all the other pages with the reference tree will result in the template tree. The template tree is called the record detail tree.

3.2 Data extraction using the template tree

A template tree represents the underlying template of a series Web pages. It is also a wrapper. Typical wrapper usage is to convert it into an automata and uses the automata to match the new pages. It is natural to use regular expression because regular expression describes a deterministic finite automata. However, the implementation of regular expression in Java is not robust enough which may fall into a dead loop. And Java does not provide a way to listen to the matching procedure and terminate it when timeout. So it is not a good idea to use regular expression here as the regular expression corresponds to a template tree is extremely complicated.

An alternative approach is to calculate the simple tree matching between the template tree and the DOM tree of the new page, and trace back to find out the node mapping. Then we can find the corresponding node of any node in the template tree from the DOM tree of the new page. As the time complexity is \mathcal{O}(n_1n_2), and a page usually has only thousands of nodes, the matching speed is rather fast. This approach turns out to be very efficient in practice.

The real data extraction system works as follows.

  1. Labelling phase.
    1. Feed the target list page to the system and the system shows the detected record zone tree and record tree. If the detected record zone tree is what you need, then move to the next step, otherwise notifying the system with the correct record zone tree and the system shows the corresponding record tree.
    2. Labelling the node in the record tree which contains the “next page” link. Labelling the nodes in the record tree which contain the data needed. If the data in the record tree is enough, then go to the next phase.
    3. Labelling the node in the record tree which contains link to record page.
    4. The system calculates the correspondent node of the labelled node for all the siblings of the record tree and visits all the retrieved record links to get the forest of record pages.
    5. The system calculates the record detail tree of the forest and shows the tree to the user. If the tree is not the desired one, the user may notify the system the desired one and the system calculates the record detail tree again.
    6. Labelling the nodes in the record detail tree which contain the data needed.
  2. Extracting phase.
    1. Visiting the target list page and convert it into a DOM tree. Calculate the mapping from the list tree to the DOM tree. For each child of the record zone tree’s corresponding node, calculate the mapping from the record tree to the subtree corresponds to the child and extract the labelled node (including the link node).
    2. If the information on the record page is needed, then visit all the extracted record page and calculate the mapping from the list detail tree to the DOM tree of the record page. Extract all the labelled nodes’ corresponding node.
    3. Following the corresponding node of the “next page” node to the next list page and loops until no more pages available.

After you get the information in the nodes, all your need to do is to convert the information into your desired format. Writing some regular expressions to match things like money, time and cellphone will be necessary. A crawler using method crawls about 0.1m houses for rent/sale from the internet everyday at www.fang.com. The crawler runs on a dual-core machine with 4g memory and 4m Internet access. It uses up all the Internet bandwidth with an average CPU usage of about 25%.

References
[1] A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 337-348. ACM, 2003.
[2] V. Crescenzi, G. Mecca, P. Merialdo, et al. Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of the international conference on very large data bases, pages 109-118. Citeseer, 2001.
[3] K.C. Tai. The tree-to-tree correction problem. Journal of the ACM (JACM), 26(3):433, 1979.
[4] Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In Proceedings of the 14th international conference on World Wide Web, pages 76-85. ACM, 2005.

Categories: information retrieval

Notes on Chinese Web Data Extraction in Java(part 2)

October 26, 2010 Leave a comment

2. Parsing a HTML Page

There are a lot of choices when trying to parse a HTML page. For a complete list of all possible parsers, you can search from google for query “html parser”. Basically, there are two different ways to parse a HTML page – parse with an embeded parser and parse with a local web browser.

For the first method, you can write your own HTML parser or using some libraries from the internet, like HTML Parser, JTidy, NekoHTML, etc. These parsers simply parse the content of the HTML page using the encoding you provided and correct the syntax error of the HTML tags. You can output the parsed HTML page or get the parsed DOM tree. You can directly manipulate the DOM tree or convert this tree into your own memory model. This kind of parser is usually fast but not robust. I’ve tried the previous three parsers and I would recommend to use the HTML Parser.

For the second method, you may choose to use the JDesktop Integration Components (JDIC) or the Standard Widget Toolkit (SWT). Both methods provide a way to access the local web browser(IE and firefox). For SWT, you can use XULRunner even if you don’t have web browser installed. XULRunner is a Mozilla runtime package that can be used to bootstrap XUL + XPCOM applications that are as rich as Firefox and Thunderbird. As the web page is resolved using the local web browser, all the resources used by the page (like styles and images) are loaded from the remote server and the javascript inside the page is also executed. The time needed to parse a page is much longer than that of the first method. But, this method is extremely robust and the resolve quality is extremely high. What you get after the parsing is exactly the same as you browse the page from a web browser. This character is especially important when you try to extract data from web pages mainly generated by javascript. The first method will not work under this situation. This method is also easy to use. Simply create an instance of the local browser and pass the target url to the browser, the browser will provide a parsed DOM tree or the parsed document. A small problem of this method is that the parsed web page uses the original encoding. So you still have to detect the encoding of the web page and convert the page into your project encoding.

In the following subsections, I will show one example for each of the above two methods.

2.1 Example for Working with HTML Parser

To use HTML Parser, you need to add htmllexer.jar and htmlparser.jar to your Java build path. The following snippet parses a html page and traverse the body of the DOM tree. Note that the class of “node” in the code is org.htmlparser.Node which does not implement the standard W3C node interface org.w3c.dom.Node. You should write your own adapter if you want to use the W3C node type.

try {
    Parser parser = Parser.createParser(content, "utf-8");
    HtmlPage page = new HtmlPage(parser);
    parser.visitAllNodesWith(page);
    NodeList list = page.getBody();
    for (NodeIterator iterator = list.elements(); iterator.hasMoreNodes();) {
        Node node = iterator.nextNode();
    }
} catch (ParserException e) {
}

2.2 Example for Working with SWT & XULRunner

Simply adding swt.jar to your Java build path and you will be able to work with SWT. Using XULRunner is a little bit complicated. You have to download the right version of XULRunner regarding to your platform and execute the registration command under the XULRunner direction.
To make XULRunner available for all the users, use command:

  • Windows XULRunner –register-global
  • Linux ./XULRunner –register-global
  • Mac ./XULRunner-bin –register-global

To make it only available for the current user, use command:

  • Windows XULRunner –register-user
  • Linux ./XULRunner –register-user
  • Mac ./XULRunner-bin –register-user

You also need to add MozillaGlue.jar and MozillaInterfaces.jar into your Java build path. These two packages can be found in the xulrunner-sdk archive. Now you can use XULRunner in your application.

The following snippet resolves a url using XULRunner. You can disable the caching of the browser by adding parameter “Cache-Control: no-cache” when calling the setUrl() method. The complete() event is called when the page resolving is finished. You can add code in this event to trigger the processing of the page. The sample code in the snippet gets the DOM tree of the page. Note that the DOM tree is not the standard W3C DOM tree.

Display display = new Display();
final Shell shell = new Shell(display);
FillLayout layout = new FillLayout();
shell.setLayout(layout);

final Browser browser = new Browser(shell, SWT.BORDER | SWT.MOZILLA);
browser.setUrl(url, null, new String[] { "Cache-Control: no-cache" });
browser.addProgressListener(new ProgressAdapter() {
    public void completed(ProgressEvent event) {
        nsIWebBrowser webBrowser = (nsIWebBrowser) browser.getWebBrowser();
        nsIDOMWindow domWindow = webBrowser.getContentDOMWindow();
        nsIDOMDocument document = domWindow.getDocument();
        documentElement = document.getDocumentElement();       
    }
});
shell.open();
while (!shell.isDisposed()) {
    if (!display.readAndDispatch())
        display.sleep();
}
display.dispose();

Notes on Chinese Web Data Extraction in Java(part 1)

October 21, 2010 Leave a comment

Note. The code is developed with Eclipse and tested under JDK 1.6. To make the code running correctly, you need to set the encoding of the project to utf-8 and include some necessary libraries. All the code will be available at http://sourceforge.net/projects/ptawebdataextra.

1. Correctly Loading a Chinese Web Page

Correctly loading a Chinese Web page using Java is not a trivial task. Given a target url, you need to read the content from the url and then decode the content using the right encoding. Chinese Web pages can be encoded using utf-8, gbk, gb2312, gb18030, big5, etc. If you did not use the right encoding when resolving a page, you will only get meaningless characters with some html tags. This is usually true for Web pages not written in English. However, Java does not handle the encoding issue automatically. So your code is responsible for the encoding detection of Web pages.

The first step is to get a http connection to the target url. This can be done using the following code.

public static HttpURLConnection getConnection(URL url)
    throws IOException
{
    HttpURLConnection httpurlconnection = null;
    try {
        URLConnection urlconnection = url.openConnection();
        urlconnection.setConnectTimeout(60000);
        urlconnection.setReadTimeout(60000);
        urlconnection.connect();

        if (!(urlconnection instanceof HttpURLConnection)) {
            return null;
        }

        httpurlconnection = (HttpURLConnection) urlconnection;
        int responsecode = httpurlconnection.getResponseCode();
        switch (responsecode) {
        case HttpURLConnection.HTTP_OK:
        case HttpURLConnection.HTTP_MOVED_PERM:
        case HttpURLConnection.HTTP_MOVED_TEMP:
            break;
        default:
            System.err.println("Invalid response code: " +
                responsecode + " " + url);
            httpurlconnection.disconnect();
            return null;
        }
    } catch (IOException ioexception) {
        System.err.println("unable to connect: " + ioexception);
        if (httpurlconnection != null) {
            httpurlconnection.disconnect();
        }
        throw ioexception;
    }
    return httpurlconnection;
}

The code first gets a URLConnection instance and then sets the time out parameter. These parameters must be set before calling the connect() method. Calling the getResponseCode() method to get the response code. If the code is valid, it returns with the cast object HttpURLConnection.

The next step is to get an InputStream from the HttpURLConnection. It retries 3 times before returns nothing.

public static InputStream getInputStream(HttpURLConnection connection)
{
    InputStream inputstream = null;
    for (int i = 0; i < 3; ++i) {
        try {
            inputstream = connection.getInputStream();
            break;
        } catch (IOException e) {
            System.err.println("error opening connection " + e);
        }
    }
    return inputstream;
}

The third step is the most important part which reads the content attribute of the connection and detects the encoding at the same time. The code is as follows.

public static final int STREAM_BUFFER_SIZE = 4096;
public static final String DEFAULT_ENCODING = "utf-8";
public static String[] getContent(HttpURLConnection connection)
    throws IOException
{
    InputStream inputstream = null;
    try {
        LinkedList<byte[]> byteList = new LinkedList<byte[]>();
        LinkedList<Integer> byteLength = new LinkedList<Integer>();
        inputstream = getInputStream(connection);
        if (inputstream == null) {
            return null;
        }
        UniversalDetector detector = new UniversalDetector(null);
        byte[] buf = new byte[STREAM_BUFFER_SIZE];
        int nread = 0, nTotal = 0;
        while ((nread = inputstream.read(buf, 0, STREAM_BUFFER_SIZE)) > 0) {
            byteList.add(buf.clone());
            byteLength.add(nread);
            nTotal += nread;
            detector.handleData(buf, 0, nread);
            if (detector.isDone())
                break;
        }
        detector.dataEnd();
        String encoding = detector.getDetectedCharset();
        detector.reset();
        if (encoding == null) {
            encoding = DEFAULT_ENCODING;
        }
        while ((nread = inputstream.read(buf, 0, STREAM_BUFFER_SIZE)) > 0) {
            byteList.add(buf.clone());
            byteLength.add(nread);
            nTotal += nread;
        }
        byte[] contentByte = new byte[nTotal];
        int offSet = 0, l = byteList.size();
        for (int i = 0; i < l; ++i) {
            byte[] bytes = byteList.get(i);
            int length = byteLength.get(i);
            System.arraycopy(bytes, 0, contentByte, offSet, length);
            offSet += length;
        }
        return new String[] { encoding, new String(contentByte, encoding) };
    } catch (IOException ioe) {
        throw ioe;
    } finally {
        if (inputstream != null) {
            inputstream.close();
        }
    }
}

The encoding detection is achieved using a library called ‘juniversalchardet’. It is a Java implementation of ‘universalchardet’ which is the encoding detector library of Mozilla. To use the library, you need to construct an instance of org.mozilla.universalchardet.UniversalDetector and feed some data to the detector by calling UniversalDetector.handleData(). After notifying the detector of the end of data by calling UniversalDetector.dataEnd(), you can get the detected encoding name by calling UniversalDetector.getDetectedCharset(). Please refer to http://code.google.com/p/juniversalchardet for more details.

The getContent function reads bytes from the input stream and feeds these bytes to the encoding detector. These bytes are also stored into a list. When the encoding detection is done, the function reads up the remaining bytes and concatenates all the bytes into one array. It then decodes the bytes using the detected encoding. Here is a very small trick. You shouldn’t read the remaining bytes using the detected encoding because the encoding detection may stop in the middle of a specific character(a character is two bytes). If the detection stops in the middle of a character, then the remaining bytes is a single byte plus consecutive characters. Decoding these bytes using the detected encoding will get unreadable characters.

Finally, put the above three functions together, we get the following function which reads the content from a specific url.

public static String getContent(URL url)
{
    HttpURLConnection connection = null;
    try {
        connection = NetUtilities.getConnection(url);
        if (connection != null) {
            String[] resource = NetUtilities.getContent(connection);
            if (resource != null) {
                return resource[1];
            }
        }
    } catch (Exception e) {
    } finally {
        if (connection != null) {
            connection.disconnect();
        }
    }
    return null;
}
Categories: information retrieval

刚刚过去的十一

October 7, 2010 2 comments

这是在wordpress上面写的第一篇日志,计划维护一个技术博客,写点自己做的东西和平时看到的有趣的东西。从计划写第一篇关于partial tree alignment的文章开始到现在半个月又过去了,我还是没有动笔。。。

在十一还没开始的时候也计划了不少的事情,最后除了其中的一件以外别的都没动。对于上海的同学们来说,到今天为止他们的假期只过了一半,因为还有10号开始的又一周的世博假期。不过即使再给我7天假期,很可能还是会被浪费掉。

一号高中同学聚餐,晚上在公司三国杀。一般来说玩桌游这个事情都会选择去桌游吧或者避风塘之类的地方,不过我这里的现状是公司比其它地方都更加合适。美其名曰公司,不如说是豪宅。房主买下了位于某写字楼高层的一间面积约350平米的大间,豪华装修之后用于谈生意,不想公司倒闭了,房子也就此闲置。碰巧某同学和房主认识,遂借于我们作为办公场地。虽然房子很大,我们实际上也只用了靠窗的不到30平米的面积,剩下的空间还是维持原来的陈设。于是乎对于某个头一次走进公司的人来说,第一感觉是进入了豪宅,而非某创业的小公司。还是回到三国杀上来,我之前很少玩桌游,对于三国杀也只是稍微知道点规则。晚上是要玩通宵的,而我已经很久没通宵过了,为了能够撑到天亮,我只能以最娱乐的态度参与游戏。我每次选择的人物都是AOE专业户,只要一拿到牌就放群攻。。。于是乎尸横遍野啊,战局常常因为我的捣乱而逆转。不知道多少盘之后,天终于要亮了,我们结束游戏准备返回各自的住所。走到电梯间一看,电梯没开,只能走楼梯下到一层,发现大门也没开,最后从地下车库走出去。在刚搬进来的时候就听到邻居告诉我这里晚上十点是要锁门的,那个时候只能从车库走出去了。没想到我头一次走车库不是因为加班,而是因为通宵玩桌游。

四号去了一趟凤凰岭,这是十一唯一做了的计划中的事情,不过也没有完全做成。去凤凰岭主要还是去龙泉寺找柳同学,在坐车1.5小时之后(准确地说是一路站过去的)终于来到了凤凰岭脚下。找了个路人一问,龙泉寺居然在一个要收门票的公园里,成人票价25大洋,学生票则是对折,立马后悔没有带学生证来了。买了门票进入公园,顺着指示牌很快就走到寺门前,还真和报纸上的图一模一样,我也相信没有来错地方了。在来之前百度了一番,可是地图上叫龙泉寺的地方有好几个,而且相距甚远,我根据报纸的描述选择了一个比较可能的地方。走进寺院看到不少游人,一直往里走来到一个写着游客止步的小门,我猜想这大概就是僧众生活区了。小门右侧就是客堂,找了个师傅问“可有一位叫xxx的在这里?”,师傅说“没有”。这个回答很简洁,也没有任何思索,我相信是准备过的,又或者是经常被人问已经习惯了。毕竟出家人不打诳语么,也不应该忽悠我这个游人吧。

出了客堂我在寺院的其它区域闲逛了一下,无意走到了寺院的后山。我正对着的是一幢3层高的楼房,里面有诵经声传出。静下来仔细听是能听出他们在念的内容的,但是稍一走神就又觉得是一堆噪音了。因为修缮的缘故,后山搭着很多脚手架,若是想要进去自是不难的,不过出于尊重的缘故我还是没有翻墙进去。想着好歹买了25的门票,也该到山上走走,听到山上有游人的声音,便循着声音走了上去。我是一路跑上去的,从山脚到山顶大约只是几分钟的时间,之后就开始下山了。这么小的山对于我这个自虐爱好者来说很是无趣,另外一方面这么小的山又能容下多少高人呢?虽说山不在高有仙则名,这一路我也未见什么仙人,真不知是什么东西吸引了诸多高人来此修行?

下山之后我又回到寺中,在客堂门前站了约半个小时,期间人来来往往,椿象(俗名放屁虫)也是来来往往。真不知是什么原因,客堂门前的小院子里爬满了椿象。若是在寻常人家的院子里,定然是杀虫剂+各种其它能够想象到的狠毒招数伺候了,在这里这些小虫却也能活得悠然自得。似乎是什么“精进公修”开始的缘故,写着游客止步的小门也锁上了,客堂的通往内院的门也锁上了,之后游客就少了下来。趁着没人的间隙,我又问了之前那位师傅一句,“曾经是否有叫xxx的在这里?”。这次师傅又回答得很爽快,“是”。之后我就不知道该说什么了,讨了个座位在客堂坐下歇息,也坐了大概半个小时的时间。期间还真有好事的游人来向师傅打听“那个北大毕业的xxx是不是在这里”,师傅一句“没有”也把那人打发了。听到“没有”的时候看到那个游人无奈的表情,我差点笑出声来,还是忍了下去。

坐着休息的时候一个工作人员(室里有许多志愿者)闲得无聊在桌上玩一只椿象,被刚才的师傅看到了。师傅很生气的对他说,“这有什么好玩的呢,你在玩人家的命啊。”说完把那只椿象放飞了。出家人真是以慈悲为怀,万物的生命都看得很重,一只虫子的生命和一条人命是分不出轻重贵贱的,也难怪这些椿象在这里可以获得如此悠然自得了。

客堂的两侧是两间小屋,游人不能进去,里面有人在闲聊。讲话的声音很大,似乎也不在乎路过的人听到。因为闲着无聊就听了其中的一段。似乎一个人是位居士,在这里负责种菜,在来这里之前他似乎是做媒体方面的。他跟一个人讲,“你不要看我在这里种菜,我也没有把我的本行荒废了,我一边种菜一边拍纪录片,几年之后就能有一部好片子,定能拿个国际大奖回来。一个将奖金有30万,另一个有10万美金,拍一部片子就一百万来了。”说这话的时候他的声音很大,但是听着却很平淡。他又讲,“很多居士在这边,他们想的事情太多了。他们来这里多是现实不如意了,觉得这里的生活可能怎么怎么样,于是就过来了,来了之后发现这里也不是想象的那样。世界本来就不可能是你想象的样子,如果世界都照你想的那么样走,那不世界末日了么。”是否每日吃斋念佛并不重要,那都是形式上的东西,重要的是一种心境。若是始终带着尘世之心,那么就算身在寺院,每日念经百遍也不会明白修行的真正意义。

这么一看,这里虽然只是个小庙,确有不少高人。虽然游人不断很是嘈杂,但若是来此追求一种恬淡的生活方式,确是很好的选择。而对于我这个入世之人,大山大海大风大浪才适合我的口味,我和柳同学的路在这一点上是完全不同的吧。

最后我想到了小学课文的一首诗,是贾岛的寻隐者不予。

松下问童子,

言师采药去。

只在此山中,

云深不知处。

松下问童子,言师采药去。只在此山中,云深不知处。

我花了25元门票+2.5元×2公交+2元×2地铁还是未能见到柳同学。他也算不上隐者,之前一段时间估计他也被那些无良的记者烦的够呛,现在不知道跑到哪里去了。出国念书是一种修行,诵经念佛亦是一种修行,无论他选择哪个,也无论他身在何方,我相信他最终都必成大家,祝他一切都好!

Categories: others

Hello world!

September 28, 2010 1 comment

Welcome to WordPress.com. This is your first post. Edit or delete it and start blogging!

Categories: Uncategorized

Send a future letter

December 31, 2005 3 comments
给十年后的自己发了一封信,看看十年后自己变了多少,希望那时我的梦想已经成真了.
Categories: others