<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-21724952</id><updated>2011-04-21T14:33:24.459-07:00</updated><title type='text'>Blog B659 Project</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://b659.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21724952/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://b659.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>B659 Project</name><uri>http://www.blogger.com/profile/13585224365130797688</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>8</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-21724952.post-114530656171499385</id><published>2006-04-17T13:27:00.000-07:00</published><updated>2006-04-17T13:42:41.726-07:00</updated><title type='text'>Latest progress...</title><content type='html'>Crawler was stopped by the time when it had crawled 2.8 million pages and link information from iu. This page properties (size, type, url, etc) and the link data (30GB pure-text) were then loaded into a PostgreSQL database for further analysis.&lt;br /&gt;&lt;br /&gt;For basic indiana.edu web page statisics, I've done the following distributions: size, mime-types, url length, etc. Based on the link information, in- and out-degree distributions have been plotted.&lt;br /&gt;&lt;br /&gt;For co-citation analysis, we removed internal links, i.e. only keeping links from other hosts. URLs with length &gt; 250 characters were also removed because most of these are usually dynamically generated URLs. A co-citation analysis of a fully indexed citation (linkage) database table (i.e. the iu link data) was performed. The co-citation network (preserved in a database table with a hundred-million-scale records) has been produced. The next step is produce a layout for this network and visualize it.&lt;br /&gt;&lt;br /&gt;In the mean time, we are getting close to analyzing the other network properties: b) average path length between any two pages, c) cluster coefficients, d) reachability, e) cycles, and f) domain diameter.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21724952-114530656171499385?l=b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://b659.blogspot.com/feeds/114530656171499385/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21724952&amp;postID=114530656171499385' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21724952/posts/default/114530656171499385'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21724952/posts/default/114530656171499385'/><link rel='alternate' type='text/html' href='http://b659.blogspot.com/2006/04/latest-progress.html' title='Latest progress...'/><author><name>B659 Project</name><uri>http://www.blogger.com/profile/13585224365130797688</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21724952.post-114245974010811370</id><published>2006-03-15T13:53:00.000-08:00</published><updated>2006-03-15T13:55:40.120-08:00</updated><title type='text'>Now crawling...</title><content type='html'>&lt;pre wrap=""&gt;Sucessfully rewrote some of Heritrix codes to save links only. It works nicely now--crawled 186,907 pages after first 1 hour. According to google, indiana.edu has 13 million pages. So it will take a few days to download the indiana.edu link structure.&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21724952-114245974010811370?l=b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://b659.blogspot.com/feeds/114245974010811370/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21724952&amp;postID=114245974010811370' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21724952/posts/default/114245974010811370'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21724952/posts/default/114245974010811370'/><link rel='alternate' type='text/html' href='http://b659.blogspot.com/2006/03/now-crawling.html' title='Now crawling...'/><author><name>B659 Project</name><uri>http://www.blogger.com/profile/13585224365130797688</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21724952.post-114057893997974359</id><published>2006-02-21T19:26:00.000-08:00</published><updated>2006-02-21T19:40:55.303-08:00</updated><title type='text'>Crawler...</title><content type='html'>I think the best option is to build our own Crawler an retrieve only the link structure (it's hard work but I think it would be easier!). I was thinking to implement one using Java...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21724952-114057893997974359?l=b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://b659.blogspot.com/feeds/114057893997974359/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21724952&amp;postID=114057893997974359' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21724952/posts/default/114057893997974359'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21724952/posts/default/114057893997974359'/><link rel='alternate' type='text/html' href='http://b659.blogspot.com/2006/02/crawler.html' title='Crawler...'/><author><name>B659 Project</name><uri>http://www.blogger.com/profile/13585224365130797688</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21724952.post-114044947678302562</id><published>2006-02-20T07:29:00.000-08:00</published><updated>2006-02-20T07:31:16.806-08:00</updated><title type='text'>Crawling...</title><content type='html'>The crawler crawled more than 100GB pages and ran out of space. We have to figure out a way to download the link structure only...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21724952-114044947678302562?l=b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://b659.blogspot.com/feeds/114044947678302562/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21724952&amp;postID=114044947678302562' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21724952/posts/default/114044947678302562'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21724952/posts/default/114044947678302562'/><link rel='alternate' type='text/html' href='http://b659.blogspot.com/2006/02/crawling.html' title='Crawling...'/><author><name>B659 Project</name><uri>http://www.blogger.com/profile/13585224365130797688</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21724952.post-114001634119840497</id><published>2006-02-15T07:07:00.000-08:00</published><updated>2006-02-15T07:12:21.200-08:00</updated><title type='text'>Parsing ARC files: detailed info</title><content type='html'>There are 4 command-line arguments that must be included, they are (in order):&lt;br /&gt;&lt;br /&gt;"&amp;lt;The path to the source ARC files&amp;gt;"&lt;br /&gt;"&amp;lt;The path to the destination directory&amp;gt;"&lt;br /&gt;"&amp;lt;How many ARC files to process, if set to 0 then it will process all ARC files in the source directory&amp;gt;"&lt;br /&gt;"&amp;lt;Whether to parse text/html content only, 1 = true, 0 = false&amp;gt;"&lt;br /&gt;&lt;br /&gt;The help menu can be accessed by running filter with a command-line argument of "-?".&lt;br /&gt;&lt;br /&gt;Several files will be placed in the destination directory. After processing, each ARC file will generate an "&amp;lt;ARC file name&amp;gt;.tofrom" file which contains its link structure. Each will also generate an "&amp;lt;ARC file name&amp;gt;.dat" file which contains its zipped contents. Any pages that were skipped will be documented in "SkippedURLs.log". Finally, "FinalStats.log" will contain statistics such as the total size of data parsed, the number of pages, and various skipped file extensions.&lt;br /&gt;&lt;br /&gt;The link structure file (".tofrom") contains the following structure:&lt;br /&gt;"&amp;lt;Source URL&amp;gt;|&amp;lt;Destination URL&amp;gt;|&amp;lt;Anchor Text&amp;gt;|&amp;lt;Anchor Link&amp;gt;|&amp;lt;Hash of Contents&amp;gt;|&amp;lt;Byte offset in .dat file&amp;gt;|"&lt;br /&gt;&lt;br /&gt;The zipped contents file (".dat") can be read by first unzipping it using G-Zip, then grabbing the data between the byte offsets given by the link structure file.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21724952-114001634119840497?l=b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://b659.blogspot.com/feeds/114001634119840497/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21724952&amp;postID=114001634119840497' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21724952/posts/default/114001634119840497'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21724952/posts/default/114001634119840497'/><link rel='alternate' type='text/html' href='http://b659.blogspot.com/2006/02/parsing-arc-files-detailed-info_15.html' title='Parsing ARC files: detailed info'/><author><name>B659 Project</name><uri>http://www.blogger.com/profile/13585224365130797688</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21724952.post-114001584426833461</id><published>2006-02-15T07:00:00.000-08:00</published><updated>2006-02-15T07:04:04.280-08:00</updated><title type='text'>Parsing ARC files to link structure</title><content type='html'>Crawling the iu domain (*.iu.edu and *.indiana.edu) is in progress...The following tool will be used to parse the ARC files crawled by the IA crawler and generate link structure output of the IU domain.&lt;br /&gt;&lt;br /&gt;https://gforge.cis.cornell.edu/projects/wri&lt;br /&gt;&lt;br /&gt;"The Web Research Infrastructure is a NSF funded joint project of Cornell University and Internet Archive to provide data &amp;amp; computing facilities for research about the World Wide Web. The computing facilities of Cornell Theory Center will be employed."&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21724952-114001584426833461?l=b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://b659.blogspot.com/feeds/114001584426833461/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21724952&amp;postID=114001584426833461' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21724952/posts/default/114001584426833461'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21724952/posts/default/114001584426833461'/><link rel='alternate' type='text/html' href='http://b659.blogspot.com/2006/02/parsing-arc-files-to-link-structure.html' title='Parsing ARC files to link structure'/><author><name>B659 Project</name><uri>http://www.blogger.com/profile/13585224365130797688</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21724952.post-113942919263106510</id><published>2006-02-08T12:01:00.000-08:00</published><updated>2006-02-08T12:06:32.643-08:00</updated><title type='text'>Open-source crawlers to be used...</title><content type='html'>The following crawlers look suitable and customizable for our project. Both are Java-based.&lt;br /&gt;&lt;br /&gt;Heritrix: the Internet Archive's open-source, extensible,           web-scale, archival-quality web crawler project:&lt;br /&gt;http://crawler.archive.org/&lt;br /&gt;&lt;br /&gt;WebSPHINX: A Personal, Customizable Web Crawler:&lt;br /&gt;http://www.cs.cmu.edu/~rcm/websphinx/&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21724952-113942919263106510?l=b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://b659.blogspot.com/feeds/113942919263106510/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21724952&amp;postID=113942919263106510' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21724952/posts/default/113942919263106510'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21724952/posts/default/113942919263106510'/><link rel='alternate' type='text/html' href='http://b659.blogspot.com/2006/02/open-source-crawlers-to-be-used.html' title='Open-source crawlers to be used...'/><author><name>B659 Project</name><uri>http://www.blogger.com/profile/13585224365130797688</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21724952.post-113865968628585778</id><published>2006-01-30T14:20:00.000-08:00</published><updated>2006-01-30T14:21:26.296-08:00</updated><title type='text'>B659 Project Proposal</title><content type='html'>B659 Project Proposal&lt;br /&gt;&lt;br /&gt;Web Topology of the Indiana University Domain&lt;br /&gt;&lt;br /&gt;Weimao Ke and Tiago Simas&lt;br /&gt;&lt;br /&gt;January 28, 2006&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;1- Overview&lt;br /&gt;In this project we have as main goal the study of some Web topological properties. We restrict ourselves to the Indiana University domain, i.e. all host names ending with &lt;a href="http://www.iu.edu/"&gt;iu.edu&lt;/a&gt; and its variations (e.g. &lt;a href="http://www.indiana.edu/"&gt;indiana.edu&lt;/a&gt;). The primary topological property we wish to study is the degree distribution of the hyperlink structure, more specifically the in-degree and out-degree distributions. We also want to analyze the co-citation network of the Web domain by treating hyperlinks as citations and cluster the network to reveal similarities and influences among the domain web sites/pages. In the following we present major aspects of our project plan.&lt;br /&gt;&lt;br /&gt;2- Crawling the IU domain&lt;br /&gt;Our first task is to crawl the Indiana University domain. We intend to implement a crawler using a benchmark API (e.g. the W3C API) to fetch all URLs in this domain and store them in a local database. We will collect an initial set of seeding URLs within the IU domain for crawling and ignore all URLs that point outside of the IU domain during the process.&lt;br /&gt;&lt;br /&gt;3- Creation of the Domain Graph&lt;br /&gt;After crawling is finished, all URLs will be stored in a database with hyperlinks pointed to each other. Web contents will not be preserved. Based on the inter-linkages, we are able to create a direct graph that represents the Web topology of the IU domain.&lt;br /&gt;&lt;br /&gt;4- Network Analysis&lt;br /&gt;Our main task of this project is network analysis. We are going to examine the following topological properties: a) degree distributions (in and out), b) average path length between any two pages, c) cluster coefficients, d) reachability, e) cycles, and f) domain diameter.&lt;br /&gt;&lt;br /&gt;We also plan to do "co-citation" analysis by treating hyperlinks as citations. Co-citations can be easily derived from directed links and are useful for identifying "similarity" and "influence" clues in the given directed graph. If a Web site/page has lots of strong co-citation connections with others, it might be treated as a "landmark" in its topical community, i.e. an "authority" in Kleinberg's paper. Computing "centrality" values of the nodes in such networks will reveal their importance. If we incorporate year/date information in some way, then we will also be able to study how topics became increasingly influential while communities emerge. We are going to use some clustering and network layout algorithms to present this network in a visually understandable way. For instance, a tool called VxOrd/VxInsight has relevant features of converting a huge citation graph to a co-citation network and producing a layout for the network. This analysis will be computational intensive though.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21724952-113865968628585778?l=b659.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://b659.blogspot.com/feeds/113865968628585778/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21724952&amp;postID=113865968628585778' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21724952/posts/default/113865968628585778'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21724952/posts/default/113865968628585778'/><link rel='alternate' type='text/html' href='http://b659.blogspot.com/2006/01/b659-project-proposal.html' title='B659 Project Proposal'/><author><name>B659 Project</name><uri>http://www.blogger.com/profile/13585224365130797688</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
