Saturday, December 19, 2009

Friday, December 18, 2009

Final Writeup

Part 1: The problem to solve

It is a common task for computer users to visit the same resource multiple times during a short time span. Some example of resources a user might check often are

· news webpage

· server status

· job board

· auction or shopping site

· network file

· message board

Moreover, it is often critical to discover updates as soon as they occur. RSS feeds can notify users of updates on sites that support RSS, and such updates can even be emailed using third-party websites, but RSS has limitations. Many resources do not support RSS, and RSS reader applications frequently do not support alternative notification protocols, such as SMS. Additionally, RSS fails when the desired resource requires navigation through an authentication page. RSS also has another obstacle: configuration inflexibility. A full solution to the problem would allow users to specify which feeds were most important, and should be checked most frequently, and which feeds could be checked less often. Similarly, different notification protocols should be assignable to different feeds or classes of feeds. While an RSS reader application that provides all this functionality may become available in the future, the primary obstacle remains ­– RSS service cannot be used for all resources. The service only works with the HTTP protocol, and must be included by the resource’s provider.


There are software packages that allow users to monitor resources that do not provide RSS feeds, but these also have limitations.

· Powerful packages are not free.

· Most are platform-specific

· Limited numbers of supported resource and notification protocols

· Limited or no functional extensibility

· Most advanced tools are GUI based with unnecessary clutter and complexity

· External monitoring services cannot monitor local files without insecure exposure. Also, such services require users to store their credentials in plain text to check for the updates of the secured resources.

The insufficiency of existing solutions requires users who need to discover updates to a variety of resources to perform many of the required tasks manually.

Part 2: Study of the existing solution.

The most feature-rich solution on the market is WebSite-Watcher (http://www.aignes.com), a retail, Windows-only, GUI-based application. It has the following features

· Monitor web pages

· Monitor password protected pages

· Monitor forums

· Monitor RSS-Feeds

· Monitor Newsgroups

· Monitor binary files

· Monitor local files

· Powerful yet simple filter system

· Highlight changes

· Monitor pages for specified words

· Monitor whole sites instead of single pages

· Additional actions when updates are detected

· Work with checked pages (Searches, Reports, etc.)

· Archive pages permanently

· Synchronize bookmark files

· Backup and Restore

Limitations that we found in the implementation are:

· When checking for updates, server synchronization issues can generate false positives. When a new version of a resource is published to a single server, and the other servers that offer the resource are only later synchronized with the updated version, the application interprets the previously-current version as a new update. The result is up to 50 update notifications for a single update.

· Creating a simple monitoring task is unnecessarily complex and time-consuming. Even the simplest tasks take as much time to create as the most complex.

· No extensibility to support new resource and notification protocols, or content extraction approaches.

· Minimum monitoring interval is 1 minute

An example of a simple scenario that monitors craigslist for new job postings is shown below.


Internally, WebSite Watcher has a scheduler thread that wakes up every minute to check on all enabled tasks. For each task, if the current system less the time that the task was run last is equal or greater than the monitoring interval specified in the settings, it adds the task to a queue of waiting tasks. It then spawns a new thread for each task than needs to run. In each thread, WebSite Watcher requests the resource and receives the response. It writes the response to a local file to be used later. If this is the first time this task has run, processing is complete for this task. If the user has specified to ignore updates that contain keywords and the resource response does contain such keywords, the update is ignored. Similarly, if the user has specified to restrict valid updates based on keywords and the response does not contain the specified keywords, the update is ignored.

WebSite Watcher also applies content filtering as specified by the user. It will apply such filtering on the previously saved version of the resource and the update content and then compare the results. If they do not match, it will highlight changes in the new content and notify the user as specified in the task settings.

Part 3: Supported features.

Our solution is a domain-specific language: MUNDane (Monitor for Updates, Deliver Notifications - relieves you of those mundane tasks). The domain of MUNDane is retrieval of updates for local or network resources. The first version is a framework which supports retrieval through HTTP GET, operating system commands, and local files. This framework is designed to be readily extensible, and includes the following features

· Run on multiple platforms

· Depend only on freely available tools/libraries/languages

· Monitor multiple resources with different refresh periods

· Support email notification

· Provide error logging

· Support content extraction plugins

· Support notification plugins

· Support resource retrieval plugins

· Support monitoring plugins

· Catch cyclical false positives.


MUNDane has the following major objects

· Configuration Blocks optional structures, which contain variable definitions. These blocks can be passed as arguments to actions, as an alternative to literal values.

· Variables – defined in configuration blocks, these contain data used by an action. Variable names must match parameter names from the intended action, but can appear in a configuration block in any order.

· Task Blocks contain actions, which collectively define a task. Task blocks may also contain variable redefinitions by referencing a configuration block and variable by their names. Redefinitions can use the '+' operator for concatenation of values.

· Actions – are defined by plugins, which are of four types: navigation, processing, monitoring, and notification. Actions take arguments: literal values or configuration blocks.

· Literals (numeric, strings)

· Regular expressions


While a MUNDane program may be written without any configuration blocks, using them allows data to be reused without retyping it. Hiding this data in a task block reduces task block complexity and aids in readability. Variable redefinition in a task block extends this functionality. If data from a configuration block is used frequently, and must be altered in a minor fashion for a given task, it may be done so without requiring an entire, additional configuration block. Value concatenation in redefinition allows a user to use templates, and extend them on a per-task basis.

Internally, actions are implemented as plugins, which are, in fact, dynamically resolved python functions; complex logic is abstracted away from the user program. This structure allows the language to be easily extended by adding new plugins.


Examples:


PROGRAM 1

Here is a MUNDane program to monitor craigslist for software internship postings. First a configuration block is defined to contain the data necessary for notification of the results of this monitoring task. Variables are defined in this block for the data, which will be used by the notify() plugin. Another block is defined for the url data, which might in another task contain a user ID and password. For this task only a url is necessary.
Finally a task is defined and begun using a get plugin. The httpGet() plugin retrieves text from a web address using the data in its argument block, which is passed to the next plugin called. removeHTMLtags() strips away HTML tags, leaving text for processing by the next plugin. keepAfter() passes on the text which comes after the first occurrence of its argument string (optionally using regular expressions), and removeFrom() strips from the text its argument string and everything after it.
The above plugin calls completely define what is to be done in the task, and their result is passed to the monitor() plugin, which defines how often this task is to be performed. The emailNotify() plugin comes last and performs the needed notification of results.


[Config:StdEmail]

to = "afclay@gmail.com"

server = "smtp.gmail.com"

port = 587

userid = "monitordemo@gmail.com"

password = ********

encryption = "tls"


[Config:CraigURL]

url = "http://sfbay.craigslist.org/search/egr?query=internship&catAbbreviation=sof"


[Task:CraigslistInternship]

httpGet(CraigURL)

removeHTMLtags()

keepAfter("Found: [0-9]* Displaying: [0-9]*[ \-0-9]*\s*")

removeFrom("Sort by: most recent best match")

monitor(1)

emailNotify(StdEmail)


Whenever the text result of this task is different from the previous result of the task, the user will be notified.

Note the asterisks used as a value for the password variable in the StdEmail configuration block. The interpreter will respond to this special value by prompting the user for the actual password. That password will be encrypted with the DES algorithm, and the program file will be updated with this encrypted password – see Part 4:Implementation for more details.


PROGRAM 2

This program similarly uses a configuration Block for notification, but the 'to' value is incomplete. It can serve as a template, and be extended with concatenation as needed for each task. The task here is a simple execution of the unix time command, in which the 'to' value from the configuration block is redefined.


[Config:EmailTemplate]

to = "@gmail.com"

server = "smtp.gmail.com"

port = 587

userid = "monitordemo@gmail.com"

password = ********

encryption = "tls"


[Task:Time]

execCmd("time")

monitor(1)

EmailTemplate.to = "monitordemo" + EmailTemplate.to

emailNotify(EmailTemplate)


PROGRAM 3

This program gets a locally stored file. It retains as a result every line in the file containing "error", and notifies the user of new errors.


[Config:StdEmail]

to = "monitordemo@gmail.com"

server = "smtp.gmail.com"

port = 587

userid = "monitordemo@gmail.com"

password = ********

encryption = "tls"


[Task:LogErrors]

getFile("firewall.log")
keepEachLineWith(“error”)
monitor(1)
emailNotify(StdEmail)


Part 4: Implementation

Our implementation uses four main components: a grammar, program file (input), an interpreter/monitor loop, and a set of plugins. The program file is the frontend, generated by the user. This is passed as an argument to a parser generated by our grammar, and the result is an AST - a python tuple containing configuration and task data.

This AST is passed to the interpreter, which uses a Read-Eval loop. The interpreter will generate a structure for the global environment, which holds sub-environments for each configuration block. The local environment of each configuration-block may also be mapped to overriding local environments associated with each task block, which contain variable redefinitions. Because variable redefinitions are limited to the scope of a specific task block, they do not overwrite values used by other tasks.

The global environment, along with a list of tasks stored as instances of a Task class, is passed to the MUNDane driver, which executes a processing loop. This loop determines if the program file has been altered since it was parsed. If it has been altered, it will be parsed again; changes to the program file are implemented dynamically - the MUNDane driver does not to be closed and restarted. The loop then calls the plugins for each task. For each task, the loop also records the current time in the Task class instace. This Task instance is passed as an additional argument to each plugin - a desugaring element of MUNDane. All plugins support a method which returns their plugin type (navigation, processing, monitoring, notification). The loop will first call the task's monitoring plugin which will use the time recorded in the Task instance to determine if the task should be run. If so, navigation plugin(s) are called next, in the order they were entered in the task block. Their result is stored implicitly in the Task instance and need not be referenced by the user. Processing plugin(s) are called next, also in order. Their result is compared with the two previous results of the task. If the current result differs from both of these, the notification will be called.

Storing two previous results prevents false positives associated with server synchronization. Remote servers are generally not updated simultaneously. One possible result is that a change is detected in the data stored on one server, but the next time the data is visited it is retrieved from a different server, which still stores earlier data. If two previous versions of the data were not stored by the MUNDane driver, a false positive would be generated in this situation, and two notifications would be sent, instead of one.

Python was selected as the language underlying MUNDane primarily because of its simplicity and extensive libraries. In python, many plugins can be written with just a few lines of code. Again for simplicity, and for uniformity, the other elements of MUNDane - the interpreter, driver, and support files - are also written in python.

Debugging MUNDane programs is done in two steps. Syntax errors in the program file will generate error-notifications immediately. Run-time errors are recorded in the log file, which is configurable. Using command-line arguments, the desired level of log detail can be selected, providing feedback useful in debugging. Critically, errors in one task will not end the MUNDane driver loop. Errors will be recorded for failing tasks, but other tasks will continue to run. Detecting these errors in the log file can be done with MUNDane itself, by defining a task to visit the logfile and send notifications appropriately.

Friday, December 11, 2009

Final project, milestone 1

PROGRAM 1
Here is a program to monitor craigslist for software internship postings. First a configuration block is defined to contain the data necessary for notification of the results of this monitoring task. Variables are defined in this block for the data, which will be used by the notify(configBlock) plugin. Another block is defined for the url data, which might in another task contain a user ID and password. For this task only a url is necessary.
Finally a task is defined and begun using a get plugin. The getHTML(configBlock) plugin retrieves text from a web address using the data in its argument block, which is passed to the next plugin called. removeHTMLtags() strips away HTML tags, leaving text for processing by the next plugin. keepAfter(strArg) passes on the text which comes after the first occurrence of its argument string (optionally using regular expressions), and removeFrom(strArg) strips from the text its argument string and everything after it.
The above plugin calls completely define what is to be done in the task, and their result is passed to the monitor(timeValue) plugin, which defines how often this task is to be performed. The notify(configBlock) plugin comes last and performs the needed notification of results.

[Config:StdEmail]
to = myname@gmail.com
server = smtp.gmail.com
port = 465
UserID = myname@gmail.com
Password = rumplestiltskin

[Config:CraigURL]
url = "http://sfbay.craigslist.org/search/egr?query=internship&catAbbreviation=sof"

[Task:CraigslistInternship]
getHTML(CraigURL)
removeHTMLtags()
keepAfter("Found: [0-9]* Displaying: [0-9]*")
removeFrom("Found: [0-9]* Displaying: [0-9]*)
monitor(5 minutes)
notify(StdEmail)

Whenever the text result of this task is different from the previous result of the task, the user will be notified.

PROGRAM 2
This program similarly uses a configuration Block for notification, but gets a locally stored file. It retains as a result every line in the file containing "error", and notifies the user of new errors every tenth of a second.

[Config:SecurityEmail]
to = security@mydomain.com
server = smtp.mydomain.com
port = 950
UserID = monitor@mydomain.com
Password = unbreakable

[Task:FirewallLog]
getFile("\logs\firewall.log")
keepEachLineWith(“error”)
monitor(0.1 second)
notify(SecurityEmail)

PROGRAM 3
This last program combines two tasks: the one from the first program and a unix command. The unix command does its own parsing and extracting, so only the montor(timeValue) and notify(configBlock) are required.

[Config:StdEmail]
to = myname@gmail.com
server = smtp.gmail.com
port = 465
UserID = myname@gmail.com
Password = rumplestiltskin

[Config:CraigURL]
url = "http://sfbay.craigslist.org/search/egr?query=internship&catAbbreviation=sof"

[Config:CS164Email]
to = cs164-ai@berkeley.edu
server = smtp.gmail.com
port = 465
UserID = myname@gmail.com
Password = rumplestiltskin

[Task:CraigslistInternship]
getHTML(CraigURL)
removeHTMLtags()
keepAfter("Found: [0-9]* Displaying: [0-9]*")
removeFrom("Found: [0-9]* Displaying: [0-9]*)
monitor(5 minutes)
notify(StdEmail)

[Task:Glookup]
execCmd(`glookup|grep final`)
monitor(15 minutes)
notify(CS164Email)

Friday, November 20, 2009

HW3 - Final Project Proposal

Part 1: The problem to solve

It is a common task for computer users to visit the same resource multiple times during a short time span. Some example of resources a user might check often are

· news webpage

· server status

· job board

· auction or shopping site

· network file

· message board

Moreover, it is often critical to discover updates as soon as they occur. RSS feeds can notify users of updates on sites that support RSS, and such updates can even be emailed using third-party websites, but RSS has limitations. Many resources do not support RSS, and RSS reader applications frequently do not support alternative notification protocols, such as SMS. Additionally, RSS fails when the desired resource requires navigation through an authentication page. RSS also has another obstacle: configuration inflexibility. A full solution to the problem would allow users to specify which feeds were most important, and should be checked most frequently, and which feeds could be checked less often. Similarly, different notification protocols should be assignable to different feeds or classes of feeds. While an RSS reader application that provides all this functionality may become available in the future, the primary obstacle remains ­– RSS service cannot be used for all resources. The service only works with the HTTP protocol, and must be included by the resource’s provider.


There are software packages that allow users to monitor resources that do not provide RSS feeds, but these also have limitations.

· Powerful packages are not free.

· Most are platform-specific

· Limited numbers of supported resource and notification protocols

· Limited or no functional extensibility

· Most advanced tools are GUI based with unnecessary clutter and complexity

· External monitoring services cannot monitor local files without insecure exposure. Also, such services require users to store their credentials in plain text to check for the updates of the secured resources.

The insufficiency of existing solutions requires users who need to discover updates to a variety of resources to perform many of the required tasks manually.

Part 2: Study of the existing solution.

The most feature-rich solution on the market is WebSite-Watcher (http://www.aignes.com), a retail, Windows-only, GUI-based application. It has the following features

· Monitor web pages

· Monitor password protected pages

· Monitor forums

· Monitor RSS-Feeds

· Monitor Newsgroups

· Monitor binary files

· Monitor local files

· Powerful yet simple filter system

· Highlight changes

· Monitor pages for specified words

· Monitor whole sites instead of single pages

· Additional actions when updates are detected

· Work with checked pages (Searches, Reports, etc.)

· Archive pages permanently

· Synchronize bookmark files

· Backup and Restore

Limitations that we found in the implementation are:

· When checking for updates, server synchronization issues can generate false positives. When a new version of a resource is published to a single server, and the other servers that offer the resource are only later synchronized with the updated version, the application interprets the previously-current version as a new update. The result is up to 50 update notifications for a single update.

· Creating a simple monitoring task is unnecessarily complex and time-consuming. Even the simplest tasks take as much time to create as the most complex.

· No extensibility to support new resource and notification protocols, or content extraction approaches.

· Minimum monitoring interval is 1 minute

An example of a simple scenario that monitors craigslist for new job postings is shown below.


Internally, WebSite Watcher has a scheduler thread that wakes up every minute to check on all enabled tasks. For each task, if the current system less the time that the task was run last is equal or greater than the monitoring interval specified in the settings, it adds the task to a queue of waiting tasks. It then spawns a new thread for each task than needs to run. In each thread, WebSite Watcher requests the resource and receives the response. It writes the response to a local file to be used later. If this is the first time this task has run, processing is complete for this task. If the user has specified to ignore updates that contain keywords and the resource response does contain such keywords, the update is ignored. Similarly, if the user has specified to restrict valid updates based on keywords and the response does not contain the specified keywords, the update is ignored.

WebSite Watcher also applies content filtering as specified by the user. It will apply such filtering on the previously saved version of the resource and the update content and then compare the results. If they do not match, it will highlight changes in the new content and notify the user as specified in the task settings.

Part 3: Supported features.

The domain of our language is retrieval of updates for local or Internet resources. The first version will support HTTP GET and local file resources. It will also support the following features

· Run on multiple platforms

· Depend only on the freely available tools/libraries/languages

· Monitor 1 or more resources with different refresh periods

· Support email notification

· Provide error logging and error email reporting

· Support regex extraction of content

· Support content extraction plugins

· Support notification plugins

· Support resource retrieval plugins

· Catch cyclical false positives.

Our language is loosely object-oriented, and has the following major objects

· Tasks

· Actions – such as get, monitor, and notify in the example below

· Configuration blocks – data for a specific method, such as email settings (address, subject, etc.) for a notify method

· Variables

· Literals (numeric, strings)

· Regular expressions


Tasks are created by chaining actions. Actions take configuration blocks, variables, and constants as possible arguments. Internally, actions are implemented as dynamically resolved python functions, abstracting away complex logic. New plugins are just additional actions, which are, at base, python functions.

The demo will monitor craigslist job postings and email notification upon detecting the changes.

Additional features possible for future resources:

  1. Support the following types of resources:
    • HTTP Post
    • Ping requests
    • HTTP authentication
  1. Support exceptions (cases when not to report errors)

Sample program to monitor 2 resources:

In the craigslist task below, job posting listings are contained between two Found: blocks; to extract the job listings, extract and remove actions are used.

[EmailSettings1]

to = cs164@cs164.com


get(“http://sfbay.craigslist.org/search/jjj?query=cool+jobs&catAbbreviation=jjj”).removeTags().extract(“.*Found: [0-9]* Displaying: [0-9]*”).remove(“.*Found: [0-9]* Displaying: [0-9]*”).monitor(5).notify(EmailSettings1)


get(\logs\firewall.log).extractAllLinesWith(“error”).monitor(0.1).notify(EmailSettings1)

Part 4: Implementation

Two possible approaches for implementation:

  1. Parsing the input into a data structure (effectively an AST), storing all the input into predefined fields in a custom class and then creating a list of instances where each instance represents a task. Configuration blocks and variables will be stored in the environment.
  2. Eval loop processing each task and returning a tuple of two lambdas: 1st lambda being the monitor function and the 2nd lambda being the resolved chain call to the resource retrieval, text extraction, notification actions. Configuration blocks and variables will be stored in the environment.

With the above approaches the following aspects will be implemented as follows:

· Frontend

With the first approach, a parser will be used to construct an ast that will be passed to the interpreter. With the second approach, eval loop will interpret each action and keep constructing the two lambda’s (monitor, chain call).

· The core language

Python will be our core language where scheduling thread and utility functions will be created. The actions will be implemented as separate python files with the semantics that the python file and function inside the file should have the same name as the action in the task. For instance, get(\localhost\logs.txt) call would expect a python file with name “get.py” to exist in the current directory and that such file has function get with one parameter. In addition, plugins will have to implement the following functions to help with syntax and runtime error checking: boolean isNotifier(), boolean isNavigator(),boolean isContentProcessor(),boolean isMonitor(),… .

· Internal representation

The first approach will generate AST as internal representation of a program. The second approach will construct a list of tuples with lambdas for monitor and chain action calls. In both approaches, variables and configuration blocks will be stored in the environment.

· Interpreter/Compiler

In both approaches program will not be compiled but will be interpreted using driver written in Python.

· Debugging

The interpreter and parser will be providing error details in the case of an issue in the program. Syntax errors will be detected before the program starts monitoring. Plugin errors will be detected upon test task execution which will invalidate such task from further execution, report once in the email and create error log but will not stop program from executing. Runtime errors in data extraction or resource retrieval will be logged in the error log but will neither invalidate task nor terminate the program.