Latest progress...
Crawler was stopped by the time when it had crawled 2.8 million pages and link information from iu. This page properties (size, type, url, etc) and the link data (30GB pure-text) were then loaded into a PostgreSQL database for further analysis.
For basic indiana.edu web page statisics, I've done the following distributions: size, mime-types, url length, etc. Based on the link information, in- and out-degree distributions have been plotted.
For co-citation analysis, we removed internal links, i.e. only keeping links from other hosts. URLs with length > 250 characters were also removed because most of these are usually dynamically generated URLs. A co-citation analysis of a fully indexed citation (linkage) database table (i.e. the iu link data) was performed. The co-citation network (preserved in a database table with a hundred-million-scale records) has been produced. The next step is produce a layout for this network and visualize it.
In the mean time, we are getting close to analyzing the other network properties: b) average path length between any two pages, c) cluster coefficients, d) reachability, e) cycles, and f) domain diameter.
For basic indiana.edu web page statisics, I've done the following distributions: size, mime-types, url length, etc. Based on the link information, in- and out-degree distributions have been plotted.
For co-citation analysis, we removed internal links, i.e. only keeping links from other hosts. URLs with length > 250 characters were also removed because most of these are usually dynamically generated URLs. A co-citation analysis of a fully indexed citation (linkage) database table (i.e. the iu link data) was performed. The co-citation network (preserved in a database table with a hundred-million-scale records) has been produced. The next step is produce a layout for this network and visualize it.
In the mean time, we are getting close to analyzing the other network properties: b) average path length between any two pages, c) cluster coefficients, d) reachability, e) cycles, and f) domain diameter.
