Newsfilter Project

Note: This document is a description and a working reference for the Newsfilter project aimed at creating a public, open-source service for quality-based filtering and recommendation of Usenet articles. For more information please review newsfilter group discussion archives and references at the end of this document. Please contact Sasha Chislenko with any suggestions or criticisms.

Project Goal

The project is intended to create a technical framework for collaborative document filtering and recommendations, an information infrastructure, and a basic set of services allowing to use people's assessments of online documents for improved navigation, and apply them to Usenet messages.

Approach and aspirations

The project aims to create a set of open standards for storage and transmission of semantic encodings and client user interfaces, and the first implementations of crucial architectural components. This will create an easy-to-use infrastructure that should allow further rapid development of the system and smooth integration of additional services.

So far, Net tools concentrated on information storage, transmission, and representation functions, while all semantic analysis has been done by humans. Now, we can build another level of standards for representing, processing, and targeting documents based on their semantic encoding, and elevate the Web to a new level. In the resulting environment, the development of intelligent agents and symbolic AI may become as profitable for third-party services as text retrieval became with the invention of simple document-storage systems.

Project structure and deliverables

The project aims to provide design and deliver the first implementations of:

Formats and storage facilities for rating profiles and other user data
Data request functions
Collaborative mechanisms for combining multiple sources of ratings into a recommendation/filtering stream.
Client interface
Algorithms for generating, matching and aggregating ratings profiles
Algorithms and interfaces for intercommunication of distributed recommendation and prediction services.
Help files, online documentation, and readable published code

General scheme and information flows

In the traditional system, the user exchanges content with the content repository, querying it to obtain relevant documents. The content search engine knows nothing about the user. Client interface stores some simple user settings and allows browsing and querieng of the content repository and posting of user messages.

The ratings-enriched system introduces additional elements: user's own ratings profile, general ratings repository, and recommendation service. The right (ratings) wing of the following diagram of the ratings-based system is very similar to the left (content) wing, except that the advice it gives are based on the user-expressed semantics of the documents rather than their content.

        
        
             Content             Profile (Ratings)        
            repository            repository         
               |                    |        
               |                    |        
            Content                 |               
            search              recommendation        
            engine                service        
                \                   /          
                 \                 /           
                  \               /        
                   \             /        
                   Client interface        
                         |        
                         |        
                      User [+ user profile]

An important concept in the proposed architecture is one of an advisor. An advisor is a human or machine generator of message ratings that the user decides to rely upon. A user can have multiple advisors whose recommendations may be combined. Each advisor has a named area of expertise and a reputation weight relating the relative utility of their recommendations to this user.

Every user can be an advisor if he

1) enters ratings
and
2) has anybody who is willing to follow his opinions.

The advisors can be also automatic (kill file or imported spam filter) or synthetic (e.g., an average of all human advisors is the "community" advisor, and can have a name like "comp.ai.alife.human_community")

The aggregation of advisor ratings (in a given named "sense", area, e.g. "humor" ) is relatively simple, except for the confidence calculation.

Suppose that :

Wt(A) is Advisor Weight designated by the user;
Rating(A,M) is advisor A's rating for message M;
Conf(A,M) is advisor A's confidence for message M.

Then the aggregated rating can be computed as an average rating by all advisors, taking weights and confidences into account:

      Sum [ Wt(A) * Rating(A,M) * Conf(A,M) ]
R(I)= ---------------------------------------       
          Sum [ Wt(A) * Conf(A,M) ]

Confidence computation is more complex, and depends on all advosor confidences, weights, number of advisors that suggested their ratings, and diversity/deviation of their opinions. The exact formulas can be selected based on statistical analysis of the recommendations, to optimize the recommendation quality (the accuracy of predicting user's ratings).

Basic operations of the service

The system should allow the following operations:

Selecting the advisor set
The user can directly list good advisors and their weights, or request N most reputable advisors [in a given area], or ask for advisors of his friends, or ask the recommendation service to find appropriate advisors based on analysis of profiles and reputations, or a combination of these.
Getting a general recommendation
The user sends the advisor set [and an area of interest] to the recommendation service. The service retrieves ratings of the advisors, aggregates them, and returns a selected set of messages that according to the opinion of these advisors (or worst, or most/least popular, or most controversial) This function can be also performed on the client if it stores the advisor profiles. Then the selected messages are fetched from the content repository.
Getting a restricted recommendation
The user sends a query to the search engine and gets a reply with a list of messages and indicators of their relevance. This list is then passed through the advisor set, and for each message the recommendation system suggests its expected quality, with confidence factors. Then, the messages get resorted based on all indicators of relevance, quality and confidence.
(This allows taking into account personalized quality expectation every time search is conducted)
Improved navigation
Before presenting the user with a list of messages corresponding to a given newsgroup or a search list, client software should consult the recommendation service, and reorder the list, putting definitely good messages on the top of the list, definitely bad messages on the bottom, and everything else in between.
Rating feedback
The user should be able to enter feedback to each reviewed message. The feedback may include a named rating for the message, a confidence factor, and a free-form comment. The user may also request to see who recommended this message, and adjust the reputation factors for the those advisors. This function can also be performed automatically.

Data structures

The "semantic data" should represent features of users, advisors, and messages, as well as their relations.
Data will be kept in standard records (database or XML) allow easy extensions.

Sample data formats

The two basic types of data records are object description records and relation (rating) description records.

Object [User/advisor/message] description record

An object profile consists of multiple object records, describing various features of users, advisors, and messages, such as name, age, preferred language, URNs, etc. Each record has the following structure:

Object Id
Field name
Field value
Time Stamp
[possibly, Expiration time]

Relation record

Relation records allow to store user and advisor ratings as well as advisor records.

Object1 Id User or advisor Id
Object2 Id Advisor or message Id
Relation type "Advisor", or "rated message"
Relation name area of expertise, or message feature, e.g. "funny"
Relation value Reputation/weight/rating
Value Confidence
Time Stamp
[Free-form comment]
[possibly, Expiration time]

The confidence reflects the degree to which the source is confident that the relation value is correct. The confidence may be stronger if the record was derived from combining a large number of opinions of reliable agents that agreed on this value (low std.dev), and lower if there were only a few not very reliable agents that deviated from each other, or was derived implicitly, etc.

The reason for storing confidence explicitly is that different users have different degrees of tolerance to false positive and false negative recommendations.
Also, people sometimes can be interested in messages with low confidence as these indicate controversial or under-researched objects.

Data repositories

The data records may be stored in databases that may serve records on request, or published as standard formatted files.

Data requests and transports

Data transport mechanism transfers semantic data, content, and requests between data repositories, knowledge servers, and user client software.
The transport can be HTTP, remote database interface, postings on a designated newsgroup (i.e., alt.newsgroups.ratings), or email. Each of these mechanisms has its own advantages in terms of delivery speed, privacy and efficiency. We will start with the Web interface that appears more immediately useful and easy to implement.

Data request examples

Get/write object description records
get a list of unrated messages among {message list}.
get a list of most popular messages among {user set}
(combination of number of ratings and average rating)

We also need to specify formats of requests to the data depository. As we agreed in principle on the structures of requests and data record formats, the request formats seem to be a matter of protocol rather than architecture, so I'll skip them here, except for the opinion that they should also be human-readable, at least in one of representations.

The communication standard should also allow transparent extensions: if the services on two sides of the interface can use various extensions or subsets of the protocol, they should just get whatever parts of the record are available and process what they can understand.

We need to specify the exact transport syntax of the above records, as well as field lengths, and then, basically, we'll have the needed interface - at least for the architectural purposes.

Client Software

Client software should improve the users' navigation in the document space. It should allow the user to annotate existing documents (or will annotate them automatically, based on the user's reading pattern), communicate annotations to data repositories, and request recommendation from knowledge servers. The recommendations will be used to filter and reorder the documents.

Algorithms

The semantic services (recommendation servers, reputation brokers, etc. - need a better generic name!) aggregate data from multiple users and software agents (this data is received from the data repositories described above) and form recommendations that should be used by the client software to improve selection and presentation of information to the user.

It is also possible to transmit a generic set of data and then perform the last personalization round on the client, such as weigh recommendations according to this user's affinity to the recommenders. This allows to preserve privacy of user data, reduces message traffic, and shifts part of the computational load to the client.

Requests to semantic servers may include

Predict ratings for a given message by user X (e.g., "funny: 0.6, confidence: 0.7; intelligent: 0.1, confidence: 0.9")
Filter a given list of messages for a user X with given thresholds
Get a set of "like-minded users"/advisors for user X [for criterion C]
Sort a given list of messages by predicted rating/confidence combination
get a list of most controversial messages among user set {X} (combination of number of ratings, their standard deviation and confidence)
Get a list of messages that a given set of users considers similar to message I
Compute reputation of an advisor X (utility of their advice) among user set {Y}

Some of these functions can be iterative. For example, at the beginning of session a user can request a list of like-minded users, and then use this list repeatedly to filter search results or listings for different groups. The user feedback will be used to adjust the similarity/reputation factors for the selected advisors.

The results of these functions should have the same structure as object and relation records.

First stage of the project

The first stage of development should create a collaborative message filtering framework and a basic functional service utilizing it.

This framework service should include:

User registration facility (URF)
A user should be assigned, minimally, a unique Id and a password. The registration can also include a questionnaire.
URF includes client and server sides. Client side is an HTML-form, server side is a database and a CGI-script.
Use spam filters and kill files as advisors. They will also be given names (e.g. "picture" filter that leaves only pictures). A conversion utility should turn spam filter's message lists and results of kill files application to rating value (e.g., name="picture"; rating="0.1"; confidence ="0.85")
Interface functions allowing users to manually select, exchange, and merge advisor sets.
Facility for expressing message selection criteria for a user. The selection criteria should include the maximal number of messages a user wants to see in each area of interest, and threshold values for message quality and aggregated advisor confidence.
A mechanism for aggregating rating streams from several advisors into a collaborative recommendation filter.
A web-based news browsing facility that displays messages based on this filter and collects message ratings from the user that will be used in the system.
Storage and retrieval facilities for messages.
Every message has:
- Id
- Poster
- Date
- Size
- Body
- [optionally, other fields, like keywords]
Storage and retrieval facilities for user profiles and ratings. (the stucture of user profile and rating records is listed above)
Sample utilities converting widely accepted message filters (keyword search, spam filters, kill files) into rating streams.
Online documentation, including description of the project goals, list of contributors, current status, online help, to-do list, and readable published code.
A minimal facility for user feedback (at least, email or a guestbook)
The project should be beta-tested by a limited group of people and their suggestions should be taken into account in the document describing further development plans.

The first stage of development should result in the creation of a basic, immediately useful service in a short time frame (counting on 2 developers * 3 months of work) that will be scaleable and will allow multiple extensions.

The extensions, to be developed and/or integrated into the service during the following stages of the project, should include complex message evaluation schemes, automated selection of advisors for a given person, complex content search utilities in addition to browsing, additional sources of information, etc. There selection for the next stages of the project will be determined during the implementation of the first stage, and depending on its results and people's feedback.

Interface specification for stage 1

Page 1: Welcome screen.

A short text describing the service and latest announcement.

links to:

new user registration
existing user login
(the above may be combined)
online documentation

Page 2. New user registration

Minimally:

name (should be unique)
password (some simple restriction, like at least 4 letters)

Possibly more - a simple questionnaire: Age, gender, education level, a few keywords describing interests, "want to be on update mailing list"?

Page 3: User login

Could be the same as registration. Name, Password. Cookies if we manage.

Page 4. Configuration screen

(people get here from login)

newsgroup selector (User's usual newsgroup set, plus ability to add)
topic selector (user's usual interests, plus ability to add from other topics mentioned in this newsgroup)
advisor selector
The user sees a list of his usual advisors [relevant to these topics and newsgroup(s)] and also a list of community advisors that he can add to his own list. The advisor selector record looks like:
<advisor name>
<checkmark> - check to include, uncheck to exclude
<rating name>
<reputation value> - [0 to 1] User’s assigned reputation, or community reputation.
E.g.:

<Andrei><x> <general><0.7>
<Sasha> <x> <science> <0.6>
<Sasha> <-> <Culture> <0.4> (community-suggested value)
The checked fields represent user’s advisors; the rest are taken from the most reputable advisors in the group, for user’s consideration, and may be included.
All selections should be stored in user profile for later use, so that the user doesn't have to re-specify them every time.
Selection conditions: minimal rating, minimal total confidence for each message to be displayed.
Search button. When this button is pressed server searches for all messages that satisfy selected news groups and advisors, browsing page is loaded.
Links to browsing and documentation pages.
Page 5. Browsing page
Four horizontal areas, from top down:
- Topmost line: newsgroup/topic selector
  fields (editable):
  - Name of browsing newsgroup (e.g. "misc.philosophy")
  - Name of interest (e.g. "religion")
- Top frame (under the top line): message title (scrollable) Contains normal message title fields: poster, date, size[?], subject, and also predicted quality (averaged rating and confidence). (Maybe, also - top advisors?) Sortable by any field - or at least, by date and rating. The frame should have adjustable lower bound. Default height: 30%.
- Bottom frame: Message body (scrollable)
  The message whose title was selected from the top frame.
- Bottom line: Feedback fields.
  - Interest (default: browsing interest from the top field)
  - rating value;
  - confidence factor (with default value, to be stored in user profile)
  - "Reply" button - calls a new screen for writing a reply with parameters taken from the message.
- Links to: configuration page, main page.
Documentation screens:
- Project description - why is it necessary, how it is useful, etc.
- Help (how to use it)
- FAQ
- to do list (known problems and planned changes)
- feedback form - user feedback goes to developer list and/or guestbook.
- code to download
  (one Zip file with all HTML pages, scripts, and installation instructions)

More general notes on communications between parts of the service

In the mature service (beyond the first stage) we will have the following agents producing and consuming ratings data:

rating agency.
This is an agent that takes a single message and produces a rating record (e.g., human; killfile; word search; any other message analysis mechanism). Also known as advisor or expert.
rating repository (profile server)
Stores and serves profiles (groups of rating records) on request
recommendation server
Analyses profiles, matches users and advisors, aggregates ratings values for different profiles, produces composite indicators of quality, popularity, controversialness, etc. of messages. Also, it should use statistical analysis to optimize its algorithms for various metrics of service quality.
This is the most complex part of the ratings processing mechanism, and the one that will be barely present in the first stage (except, mostly, for merging advisor profiles)
User Client
(somewhat overlaps with rating agency where a human user is concerned; the emphasis here is on consumption, rather than production, of ratings) Issues requests for retrieval of most appropriate advisors and messages in given categories.

Each of these agencies may be viewed as an Interactive Agent that can exchange requests with others. The request types may partially overlap between these agents. I can suggest the following types of communications (no claim about completeness of this list):

1. Requests [typically] directed at the record server:
- profile data query (directed at the profiles/ratings database)
  For descriptive (non-rating) part of user profile, the query may specify a UserId, and receive user data, or specify a condition on user data, and receive a set of qualifying records.
  For ratings part, the query can specify any condition on rater Id, User Id, rating name, value, confidence, and time, and receive matching ratings records.
  For example, a query may request all ratings by the given set of advisors for a selected message.
  We can also have pending queries, with an expiration time, for agents who want to be notified when new data appears that matches their request.
- Data storage request.
  Sends an attribute or rating record for storage (typically, this can be sent by a rater or recommendation server to a rating server)
  Receives a confirmation.
- Data removal request.
  Sends a condition on the data to be removed, and authorization.
  Receives a confirmation.
2. Requests to the recommendation/computation server
- 2a. Simple requests
  - profile merger request
    sends a set of rating profiles; receives a single profile with combined ratings and confidence factors.
    The simplest case here is that for computation of a predicted message rating by a set of advisors.
  - ?
- 2b. Complex requests (trigger a series of consecutive operations)
  - advisor set [re]computation
    sends a user profile, existing advisor set, relation name, and number of advisors required.
    Updates advisor set with the advisors taken from the existing advisors' lists, most reputable community advisors, advisors with highest affinity to the user, etc., and returns a combined set with optimized weights.
  - prediction request:
    Starts with a user Id, an message Id, and rating relation name. Gets a list of advisors for the user.
    If there are any, gets their ratings of the message.
    If there are any, merges them.
    If either advisors or their ratings for the message are missing, starts the "fallback method": gets an average rating for the message. If there are no ratings, returns average rating value with confidence 0.
  - message list reordering:
    Sends a user Id, a list of message Ids, and rating relation name. Goes through a series of operations similar to described above; Returns an ordered message list (by quality, in descending order. Or by any other factor, if one is interested in controversial, popular, unknown, etc. messages)
- 2c. Special computation requests:
  Calls to special-purpose functions.
  For example, consistency check on the database, calculation of average prediction efficiency, optimization of algorithm parameters, reclustering of synthetic profiles, etc.
3. Other action requests:
request to perform certain actions, such as publish data, back up the database, etc.

We can notice that the above communications are all called requests. In fact, request only originates a [sequence of] communication(s) listed above.

The sequence of communications usually starts with a request and goes through action, transfer of stored or computed data, and completion confirmation. These communication sequences may loop, chain or extend in time to form dialogs and other transaction sequences.

We should be able to define transaction syntax after we settle on Agent Interaction Protocol, Webmind SQL interface, and other related things.

Of course, we do not expect to implement all of these things in the first stage of Newsfilter. Hopefully though, this discussion can help us define the service structure that can be extended into more complex services and be compatible with Webmind.

Appendix: Web resources related to the project

Current Interface spec
Stage 2 spec
newsfilter group discussion archives.
Analysis of predictive algorithms
An Experiment in writing Collaborative Filtering Software
Select - European project
Jester - Collaborative jokes recommendation project
The Knowsys project
SHOE - Simple HTML Ontology Extensions
Collaborative Information Filtering and Semantic Transports
Semantic Web - position paper
Information Filtering Resources at Medlab
Platform for Privacy Preferences
Upcoming conferences on Information filtering
RDF - Resource Description Framework
W3c document on RDF
Meta-Content Framework using XML
Hypereconomy development group page
Internet filtering software
Spam mail filtering
Usenet spam filters
Freely Available Information Filtering Systems
realize.com - a new Collaborative Usenet Spam filter
Intelligenesis Corporation - developer of Webmind
KQML (Knowledge Query and Manipulation langauge) specification
http://www.sims.berkeley.edu/resources/collab/ - an extensive collection of references to ACF- related resources on the Web compiled by Hal. R. Varian.
Project Aristotle(sm): Automated Categorization of Web Resources
Consumer Democracy - consumer opinion and product ratings site.
Qualitative Decision Theory page
Machine learning and applied statistics group's page at Microsoft