Monday, April 17, 2006

Latest progress...

Crawler was stopped by the time when it had crawled 2.8 million pages and link information from iu. This page properties (size, type, url, etc) and the link data (30GB pure-text) were then loaded into a PostgreSQL database for further analysis.

For basic indiana.edu web page statisics, I've done the following distributions: size, mime-types, url length, etc. Based on the link information, in- and out-degree distributions have been plotted.

For co-citation analysis, we removed internal links, i.e. only keeping links from other hosts. URLs with length > 250 characters were also removed because most of these are usually dynamically generated URLs. A co-citation analysis of a fully indexed citation (linkage) database table (i.e. the iu link data) was performed. The co-citation network (preserved in a database table with a hundred-million-scale records) has been produced. The next step is produce a layout for this network and visualize it.

In the mean time, we are getting close to analyzing the other network properties: b) average path length between any two pages, c) cluster coefficients, d) reachability, e) cycles, and f) domain diameter.

Wednesday, March 15, 2006

Now crawling...

Sucessfully rewrote some of Heritrix codes to save links only. It works nicely now--crawled 186,907 pages after first 1 hour. According to google, indiana.edu has 13 million pages. So it will take a few days to download the indiana.edu link structure.

Tuesday, February 21, 2006

Crawler...

I think the best option is to build our own Crawler an retrieve only the link structure (it's hard work but I think it would be easier!). I was thinking to implement one using Java...

Monday, February 20, 2006

Crawling...

The crawler crawled more than 100GB pages and ran out of space. We have to figure out a way to download the link structure only...

Wednesday, February 15, 2006

Parsing ARC files: detailed info

There are 4 command-line arguments that must be included, they are (in order):

"<The path to the source ARC files>"
"<The path to the destination directory>"
"<How many ARC files to process, if set to 0 then it will process all ARC files in the source directory>"
"<Whether to parse text/html content only, 1 = true, 0 = false>"

The help menu can be accessed by running filter with a command-line argument of "-?".

Several files will be placed in the destination directory. After processing, each ARC file will generate an "<ARC file name>.tofrom" file which contains its link structure. Each will also generate an "<ARC file name>.dat" file which contains its zipped contents. Any pages that were skipped will be documented in "SkippedURLs.log". Finally, "FinalStats.log" will contain statistics such as the total size of data parsed, the number of pages, and various skipped file extensions.

The link structure file (".tofrom") contains the following structure:
"<Source URL>|<Destination URL>|<Anchor Text>|<Anchor Link>|<Hash of Contents>|<Byte offset in .dat file>|"

The zipped contents file (".dat") can be read by first unzipping it using G-Zip, then grabbing the data between the byte offsets given by the link structure file.

Parsing ARC files to link structure

Crawling the iu domain (*.iu.edu and *.indiana.edu) is in progress...The following tool will be used to parse the ARC files crawled by the IA crawler and generate link structure output of the IU domain.

https://gforge.cis.cornell.edu/projects/wri

"The Web Research Infrastructure is a NSF funded joint project of Cornell University and Internet Archive to provide data & computing facilities for research about the World Wide Web. The computing facilities of Cornell Theory Center will be employed."

Wednesday, February 08, 2006

Open-source crawlers to be used...

The following crawlers look suitable and customizable for our project. Both are Java-based.

Heritrix: the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project:
http://crawler.archive.org/

WebSPHINX: A Personal, Customizable Web Crawler:
http://www.cs.cmu.edu/~rcm/websphinx/

Monday, January 30, 2006

B659 Project Proposal

B659 Project Proposal

Web Topology of the Indiana University Domain

Weimao Ke and Tiago Simas

January 28, 2006


1- Overview
In this project we have as main goal the study of some Web topological properties. We restrict ourselves to the Indiana University domain, i.e. all host names ending with iu.edu and its variations (e.g. indiana.edu). The primary topological property we wish to study is the degree distribution of the hyperlink structure, more specifically the in-degree and out-degree distributions. We also want to analyze the co-citation network of the Web domain by treating hyperlinks as citations and cluster the network to reveal similarities and influences among the domain web sites/pages. In the following we present major aspects of our project plan.

2- Crawling the IU domain
Our first task is to crawl the Indiana University domain. We intend to implement a crawler using a benchmark API (e.g. the W3C API) to fetch all URLs in this domain and store them in a local database. We will collect an initial set of seeding URLs within the IU domain for crawling and ignore all URLs that point outside of the IU domain during the process.

3- Creation of the Domain Graph
After crawling is finished, all URLs will be stored in a database with hyperlinks pointed to each other. Web contents will not be preserved. Based on the inter-linkages, we are able to create a direct graph that represents the Web topology of the IU domain.

4- Network Analysis
Our main task of this project is network analysis. We are going to examine the following topological properties: a) degree distributions (in and out), b) average path length between any two pages, c) cluster coefficients, d) reachability, e) cycles, and f) domain diameter.

We also plan to do "co-citation" analysis by treating hyperlinks as citations. Co-citations can be easily derived from directed links and are useful for identifying "similarity" and "influence" clues in the given directed graph. If a Web site/page has lots of strong co-citation connections with others, it might be treated as a "landmark" in its topical community, i.e. an "authority" in Kleinberg's paper. Computing "centrality" values of the nodes in such networks will reveal their importance. If we incorporate year/date information in some way, then we will also be able to study how topics became increasingly influential while communities emerge. We are going to use some clustering and network layout algorithms to present this network in a visually understandable way. For instance, a tool called VxOrd/VxInsight has relevant features of converting a huge citation graph to a co-citation network and producing a layout for the network. This analysis will be computational intensive though.