Friday, November 20, 2009

HW3 - Final Project Proposal

Part 1: The problem to solve

It is a common task for computer users to visit the same resource multiple times during a short time span. Some example of resources a user might check often are

· news webpage

· server status

· job board

· auction or shopping site

· network file

· message board

Moreover, it is often critical to discover updates as soon as they occur. RSS feeds can notify users of updates on sites that support RSS, and such updates can even be emailed using third-party websites, but RSS has limitations. Many resources do not support RSS, and RSS reader applications frequently do not support alternative notification protocols, such as SMS. Additionally, RSS fails when the desired resource requires navigation through an authentication page. RSS also has another obstacle: configuration inflexibility. A full solution to the problem would allow users to specify which feeds were most important, and should be checked most frequently, and which feeds could be checked less often. Similarly, different notification protocols should be assignable to different feeds or classes of feeds. While an RSS reader application that provides all this functionality may become available in the future, the primary obstacle remains ­– RSS service cannot be used for all resources. The service only works with the HTTP protocol, and must be included by the resource’s provider.


There are software packages that allow users to monitor resources that do not provide RSS feeds, but these also have limitations.

· Powerful packages are not free.

· Most are platform-specific

· Limited numbers of supported resource and notification protocols

· Limited or no functional extensibility

· Most advanced tools are GUI based with unnecessary clutter and complexity

· External monitoring services cannot monitor local files without insecure exposure. Also, such services require users to store their credentials in plain text to check for the updates of the secured resources.

The insufficiency of existing solutions requires users who need to discover updates to a variety of resources to perform many of the required tasks manually.

Part 2: Study of the existing solution.

The most feature-rich solution on the market is WebSite-Watcher (http://www.aignes.com), a retail, Windows-only, GUI-based application. It has the following features

· Monitor web pages

· Monitor password protected pages

· Monitor forums

· Monitor RSS-Feeds

· Monitor Newsgroups

· Monitor binary files

· Monitor local files

· Powerful yet simple filter system

· Highlight changes

· Monitor pages for specified words

· Monitor whole sites instead of single pages

· Additional actions when updates are detected

· Work with checked pages (Searches, Reports, etc.)

· Archive pages permanently

· Synchronize bookmark files

· Backup and Restore

Limitations that we found in the implementation are:

· When checking for updates, server synchronization issues can generate false positives. When a new version of a resource is published to a single server, and the other servers that offer the resource are only later synchronized with the updated version, the application interprets the previously-current version as a new update. The result is up to 50 update notifications for a single update.

· Creating a simple monitoring task is unnecessarily complex and time-consuming. Even the simplest tasks take as much time to create as the most complex.

· No extensibility to support new resource and notification protocols, or content extraction approaches.

· Minimum monitoring interval is 1 minute

An example of a simple scenario that monitors craigslist for new job postings is shown below.


Internally, WebSite Watcher has a scheduler thread that wakes up every minute to check on all enabled tasks. For each task, if the current system less the time that the task was run last is equal or greater than the monitoring interval specified in the settings, it adds the task to a queue of waiting tasks. It then spawns a new thread for each task than needs to run. In each thread, WebSite Watcher requests the resource and receives the response. It writes the response to a local file to be used later. If this is the first time this task has run, processing is complete for this task. If the user has specified to ignore updates that contain keywords and the resource response does contain such keywords, the update is ignored. Similarly, if the user has specified to restrict valid updates based on keywords and the response does not contain the specified keywords, the update is ignored.

WebSite Watcher also applies content filtering as specified by the user. It will apply such filtering on the previously saved version of the resource and the update content and then compare the results. If they do not match, it will highlight changes in the new content and notify the user as specified in the task settings.

Part 3: Supported features.

The domain of our language is retrieval of updates for local or Internet resources. The first version will support HTTP GET and local file resources. It will also support the following features

· Run on multiple platforms

· Depend only on the freely available tools/libraries/languages

· Monitor 1 or more resources with different refresh periods

· Support email notification

· Provide error logging and error email reporting

· Support regex extraction of content

· Support content extraction plugins

· Support notification plugins

· Support resource retrieval plugins

· Catch cyclical false positives.

Our language is loosely object-oriented, and has the following major objects

· Tasks

· Actions – such as get, monitor, and notify in the example below

· Configuration blocks – data for a specific method, such as email settings (address, subject, etc.) for a notify method

· Variables

· Literals (numeric, strings)

· Regular expressions


Tasks are created by chaining actions. Actions take configuration blocks, variables, and constants as possible arguments. Internally, actions are implemented as dynamically resolved python functions, abstracting away complex logic. New plugins are just additional actions, which are, at base, python functions.

The demo will monitor craigslist job postings and email notification upon detecting the changes.

Additional features possible for future resources:

  1. Support the following types of resources:
    • HTTP Post
    • Ping requests
    • HTTP authentication
  1. Support exceptions (cases when not to report errors)

Sample program to monitor 2 resources:

In the craigslist task below, job posting listings are contained between two Found: blocks; to extract the job listings, extract and remove actions are used.

[EmailSettings1]

to = cs164@cs164.com


get(“http://sfbay.craigslist.org/search/jjj?query=cool+jobs&catAbbreviation=jjj”).removeTags().extract(“.*Found: [0-9]* Displaying: [0-9]*”).remove(“.*Found: [0-9]* Displaying: [0-9]*”).monitor(5).notify(EmailSettings1)


get(\logs\firewall.log).extractAllLinesWith(“error”).monitor(0.1).notify(EmailSettings1)

Part 4: Implementation

Two possible approaches for implementation:

  1. Parsing the input into a data structure (effectively an AST), storing all the input into predefined fields in a custom class and then creating a list of instances where each instance represents a task. Configuration blocks and variables will be stored in the environment.
  2. Eval loop processing each task and returning a tuple of two lambdas: 1st lambda being the monitor function and the 2nd lambda being the resolved chain call to the resource retrieval, text extraction, notification actions. Configuration blocks and variables will be stored in the environment.

With the above approaches the following aspects will be implemented as follows:

· Frontend

With the first approach, a parser will be used to construct an ast that will be passed to the interpreter. With the second approach, eval loop will interpret each action and keep constructing the two lambda’s (monitor, chain call).

· The core language

Python will be our core language where scheduling thread and utility functions will be created. The actions will be implemented as separate python files with the semantics that the python file and function inside the file should have the same name as the action in the task. For instance, get(\localhost\logs.txt) call would expect a python file with name “get.py” to exist in the current directory and that such file has function get with one parameter. In addition, plugins will have to implement the following functions to help with syntax and runtime error checking: boolean isNotifier(), boolean isNavigator(),boolean isContentProcessor(),boolean isMonitor(),… .

· Internal representation

The first approach will generate AST as internal representation of a program. The second approach will construct a list of tuples with lambdas for monitor and chain action calls. In both approaches, variables and configuration blocks will be stored in the environment.

· Interpreter/Compiler

In both approaches program will not be compiled but will be interpreted using driver written in Python.

· Debugging

The interpreter and parser will be providing error details in the case of an issue in the program. Syntax errors will be detected before the program starts monitoring. Plugin errors will be detected upon test task execution which will invalidate such task from further execution, report once in the email and create error log but will not stop program from executing. Runtime errors in data extraction or resource retrieval will be logged in the error log but will neither invalidate task nor terminate the program.

No comments:

Post a Comment