Bag O' Regex

Bag O' Regex (BOR) is a document explorer that uses regular expressions and text mining techniques to search a set of documents for content. The results show matching documents, number of matches and the significance of the matches. You can also look for correlations between different content categories. This is done by comparing the content in a particular document or cluster of documents with the rest of the set.

How to use BOR

BOR works with sets of plain text documents: docsets. Before you do anything else you must select a folder containing text documents and add them all from the settings panel (folder button with a '+'). Click the Add button and browse to a folder with your set of documents. Select a document and click the Open button. Now you are ready to do some basic searches.

Basic searches

Regex

Plugins

Plugin reports run multiple regular expressions defined in files in the plugins folder. Select "Characters" from the Plugins menu. The report shows results for 20 categories of character related content. To see the matches in context, click on a category name. When run from a cluster the results compare the cluster to the rest of the documents in the set and highlight significant differences.

If a line in a plugin file starts with an image icon name and the icon is located in /settings/img then the icon is displayed in the report.

format example
dog.gif,(dogs|pupp(y|ies)|canine)

Profile

For sets of dreams, diaries, biographies and related genres, select the Profile item from the Profile menu. The report attempts to guess information about the author's family, gender, age and era.

Names

The Names report (also found under the Profile menu) simply tries to identify all of the names in a docset.

Network

The Network report finds linkage by co-occurrence in document of the matches for the entered regex. This is especially useful for lists of names, places and roles, where the report generates a social network. If you don't enter a regex the report uses a special expression that searches for capitalized words and initials, not occurring at the start of a sentence and filtered by the stop_words.txt list. Available under the Net menu. (More details.)

Concordance

The Concordance report is available under the Net menu. The result window shows you the matches for your regular expression in context.

Proximity

The Proximity report, also under the Net menu, shows you the co-occurrence of matches for your regular expression within a window of about ten words. Documents are scored by the number of paired matches and the closeness of the matches.

Concept net

The Concept net report, also under the Net menu, finds all neighbors (within ± 25 characters) of the entered regex (usually a single word). It then runs a proximity search on non-stopwords in the result and orders them by proximity score with an additional boost if a pair of terms occurs in both directions (e,g.: "I parked my car" and "my car was parked"). The results are displayed in a table where link terms trigger a new search.

wet <> feet 1.76
wet <  getting 1.67
wet <> sand 1.28
wet <  pants 1.17
wet <  soaking 1.0
wet <> dripping 0.83
wet <> surface 0.75
 
 
==>
 
 
 
 
sand <> dune5.42
sand <> bar4.44
sand <> yellow3.82
sand <> gray2.56
sand  > dunes2.5
sand <> piles2.15
sand <> beach1.49
 
 
 
==>
 
 
 
gray <  dark6.67
gray <  light3.09
gray  > hair2.63
gray <> sand2.56
gray <  short2.0
gray  > metal1.5
gray  > granite1.33
Three steps along a concept network
In this partial result you can see that "getting wet" and "soaking wet" occur in one direction only, but that "wet sand" and "sand (was|is) wet" are both present.

If you don't supply a regex a random seed word is chosen for you.

Related

The Related search finds the key terms of a document and then matches other documents as similar based on a proximity search of the same terms. Run this report from the search menu of individual documents.

Cloud

The Cloud report is also run from individual docs. It shows words in a font size proportional to their importance in the doc.

Field Guide

Like a naturalist's field guide, this report gives you a compact list of distinguishing features of a document or word.
For documents the report lists For single terms, select the term and right-click to pick "Field Guide" from the pop-up menu. This version shows the term, its closest neighbors, words similar in spelling or by sound and idiomatic usage. It also lists all documents containing the term.

Pictogram

The Pictogram report searches for about 250 content categories using a list of regular expressions found in settings/img/iconList.txt. The same directory contains 250 small image files (icons) that are displayed in the report. When run from the Search tab, the report show a table of icons and counts which indicate the percent of words that matched the expression. When run from a cluster or single document the percents found in just those docs are also displayed. If the number of matches is lower or higher than expected by chance, the count is shown in blue or red. You can edit or add categories and icons to suit your needs.

Illustrate

Available from the Search menu of individual documents. Illustrate redisplays the document text but inserts icons, where they matched Pictogram categories. Allows you to see how well the categories are matching.

Matches as text

Available on the Search Panel's Sampling menu. Searches for the entered expression and displays the matches (only) in a plain text document window. Useful if you want to compare the matches to the document set as a whole. (First save the result and add it to the document set.) One idea is to search for a particular feature in the context of plus or minus 10 words, save and add the text results and compare it to the document set as a whole. If you wanted to answer a question like "Do introverts carry their waking aversion to social interactions into their dreams?", you could target text neighborhoods where (I|me) occurs with (he|him|she|they|man|woman|etc.) and then look for fear and avoidance in the match doc. Then compare with results for extroverts.

"social neighborhood" expression: \b.{0,25}\b((I|me)\b.{0,25}\b(they|them|he|him|she|man|woman|boy|girl)|(they|them|he|him|she|man|woman|boy|girl)\b.{0,25}\b(I|me))\b.{0,25}\b

Thirds

Thirds allows you to test for content by position in a text. It is located in the Sampling menu. Thirds concatenates all current documents into a single file while adding document markers so that each document text is divided into three equal sized pieces. Save the "thirds" document in a new directory and then use the Split command to produce individual documents. (You'll end up with three times as many documents as you started with. Remember to discard the large thirds document before adding the new set. Each document is marked with a special term; _FIRST, _MIDDLE or _LAST.

Picker

The picker is found on the Search tab (small button on the lower right). When you click the button a window displaying all of the pictogram icons listed in settings/img/iconList.txt are displayed. Click on an icon to trigger a search for the category. If the Dual RE tab is selected, clicking on an icon enters its expression into the top search field.

Checkbox options on the Search tab

G2 (a.k.a. log-likelihood): Ted Dunning, Accurate Methods for the Statistics of Surprise and Coincidence.
(http://www.comp.lancs.ac.uk/ucrel/papers/tedstats.pdf)
Chi square versus log-likelihood statistic for word significance in different corpora.
Using G2: Rayson and Garside, Comparing Corpora using Frequency Profiling.
(http://www.comp.lancs.ac.uk/computing/users/paul/publications/rg_acl2000.pdf)
Using log-likelihood of semantic categories to find the key words in a set of documents.
The h score: Schneider and Domhoff, Our Statistical Approach.
(http://www.dreamresearch.net/Info/statistics.html)
Using percent difference in features found in dream reports.

Search Panel Menus

Commands (upper menu bar)

Regular expressions (lower menu bar)

These contain single regular expressions to identify a specific content category.

Regular expressions (regex)

A regular expression or regex is a pattern that matches text in a document. The goal is to match everthing you want without matching anything you don't.

The simplest regular expression is a single word like television. If you enter this expression in the Search field and then click Find, BOR will return all documents that contain matches.

Although a single word is a regular expression, most searches usually involve two or more related words. If you were doing a search for "pets" on the internet you mght type in "pets cat dog". In BOR's search field, this would be treated as a regular expression and you would only match a document containing the literal string. To search for the individual terms you need to put them in an alternation.

Alternation

Alternations are the most common type of RE used in BOR. An alternation is simply a big or statement.

Enter a list of words you are searching for:

cat,dog,pet,kitten,puppy,rex,fluffy

Put these into alternating form by selecting Alternation form the "RE" Menu.

(cat|dog|pet|kitten|puppy|rex|fluffy)

If you search for this regular expression you will match documents containing one or more of the 7 strings in the alternation.

Notice that you will find some matches for things you didn't intend:

cat matches "catch"

pet matches "carpet"

Word borders

To match only on word boundaries add "\b" to the start and end of the RE.

\b(cat|dog|pet|kitten|puppy|rex|fluffy)\b

Since fluffy and rex are names you should capitalize them in the RE. (Regular expressions are case-sensitive.)

\b(cat|dog|pet|kitten|puppy|Rex|Fluffy)\b

Case insensitivity

You can also add (?i) to the start of your RE.

(?i)\b(cat|dog|pet|kitten|puppy|rex|fluffy)\b

This tells the search engine to match without case-sensitivity. You will match "the fluffy pillow" which you probably don't want.

If you use this RE on a set of documents on pets you'll notice that you match "cat" but not "cats".

To match plurals add an optional suffix:

(?i)\b(cat|dog|pet|kitten|puppy|rex|fluffy)s?\b

The question mark tells the search engine that the s is optional.

This won't match "puppies" so add a specal case:

(?i)\b(cat|dog|pet|kitten|pupp(y|ies)|rex|fluffy)s?\b

Example: flying dreams

In order to identify all dream reports with flying you need to find phrases like "I'm flying", "I fly", "I flew", "I was floating", etc.

A simple solution is:

\b(I|'m| was) fl(ew|y|oat)(ing)?\b

But this could miss cases like "I began to fly", "I started to float", "I seemed to be flying" So try this:

\bI (\w+\s+){0,3}fl(ew|y|oat)(ing)?\b

The subexpression:(\w+\s+){0,3} matches up to 3 of "word followed by space". "word" is defined by \w+: any word character

Character sets

[a-z]: any lowercase letter

[a-zA-Z]:any letter

[aeiuo]:just the lowercase vowels

[^aeiou]: anything but vowels (even numbers and punctuation)

[a-z&&[^aeiou]]: any lowercase letters, but not vowels

[aeiou]+: any string of 1 or more vowels

[0-9||[^\w]]: any numbers or non-word charcaters (spaces, numbers, punctuation)

[a-z]{2,4}: any string of 2 to 4 lowercase letters

Lookahead

x(?=exp) matches x only if followed by exp

x(?!exp) matches x if not followed by exp

Lookbehind

(?<=exp)x matches x only if it is preceeded by exp

(?<!exp)x matches x if it is not preceeded by exp

Quantifiers

The following regexes, when applied to a text containing "Tiff and Tiffany" match;

(Tiff[a-z]+) only matches Tiffany

(Tiff[a-z]?) only matches Tiffa

(Tiff[a-z]{0,3}) matches both Tiff and Tiffany

A few more regular expressions

my (mother|father|mom|dad)
meaning: any sequence starting with "my ", followed by one of "mother" or "father" or "mom" or "dad".
matches: "my mother", "my father", "my mom", "my dad"

fl(y|oat){1}(ing|s){0,1}
meaning: any sequence starting with "fl", followed by exactly one of "y" or "oat", followed by zero or one of "ing" or "s".
matches: "fly", "flys", "float", "floats", "flying" and "floating"

[^a-c]at
meaning: any sequence NOT starting with "a", "b" or "c" and ending with "at".
matches: "rat" and "hat" but doesn't match "cat" or "bat"

[mf][oa]ther(.in.law)?
meaning: any sequence starting with "m" or "f", followed by "o" or "a", followed by "ther", followed by an optional ".in.law", where the "." character can be any character.
matches: "mother", "mother-in-law", "father", "father in law" (no hyphens) and even "fatherXinXlaw"

\bdog\b
meaning: the sequence "dog", but only if found on a word boundary.
matches: "dog" but not "doggerel" or "hotdog"
(To match the last two use \bdog\B and \Bdog\b.)

\b((switch|change|turn)s{0,1} (out|into|to))|((bec(a|o)me)s{0,1}\b (a|an|the))\b|\b(now (is|a|an))\b
matches: metamorphosis

\b(I|I'm) .{2,34}[.?!]("|')?
I do X.

\b(she|he) says.{2,51}[.?!]("|')?
he says/she says

\b(w+) \1\b
\1 captures the match in the first set of parentheses.
repeated word

Because regular expression searches can return anything (the whole text of a document, parts of words, all of the punctuation characters in a document) there's no simple way to score the document for how well they matched the search pattern. I've opted to report matches as a percent of total words in a document or set. Any of the statistical measures I use make this assumption too. Beware if you're pattern matches non-wordlike features.

A regular expression tutorial.


Formatter ("RE") menu

The Formatter is the small "RE" menu next to the search regex fields. It performs several functions.

The most important function is to put a comma or space separated list of words into an alternating regular expression.

Some of the formatter commands also stem the list words and then add a pattern to match common suffixes.

cats,dogs,pet,fur,feed

becomes

\b(cat|dog|pet|fur|feed)(s|es|...)?

This will match dog and dogs pet and feeding, etc.

The suffix formatters add specialized suffix patterns to your regular expression;

Prefix formatters prefix your re with adjective classes
Farah Benamara et al, Sentiment Analysis: Adjectives and Adverbs are better than Adjectives Alone
(http://oasys.umiacs.umd.edu/oasys/papers/icwsmV2.pdf)
Source for the idea of the prefixes.
M. F. Porter, An algorithm for suffix stripping.
(http://www.eden.rutgers.edu/~luliu/nlp/porter/porter.txt)
Source for the idea of the suffixes.

Cluster tab

The Cluster tab contains reports that automatically cluster a set of documents according to related features.

Pictogram

The Pictogram cluster uses the pictogram report categories as feature space. These are listed in settings/img/iconList.txt. The report weights each feature by its expected frequency (p = 0.0 to 1.0) in the document set and adjusted by the size of a given document. Missing features are given zero weight unless the Lack checkbox is checked. In this case, lack of a very common feature is given a weight of -1 * (1-1/p), adjusted for document size. This tries to give missing features importance when they are common enough to be expected in a particular document. The clusters for this and the next two reports are found using a form of the k-means algorithm. You should enter a "Number of clusters to find" that is around 5 to 20% of the total number of documents.

Key terms

This report identifies the key terms in a document set by a form of TFIDF (term-frequency, inverse document frequency). These are words that occur in more than just a few documents, but less than most documents. The TFIDF weighting gives more weight to terms that are rare in all documents or frequent in a given document. Terms that don't occur in a document are given zero weight.
If the Lack option is checked, missing terms that are very common but lacking in a given document are given a negative weight.

Word list

This cluster uses a user-supplied list of words found in settings/wordlist.txt. Like the Pictogram cluster, it uses the expected frequency (p) of each word as a feature weight. Lack of a feature is weighed as -1*(1- 1/p).
The Lack checkbox has no effect on the weighting.

R-cluster

The R-cluster is a little different from the previous clusters. It uses the feature space of the last run cluster report to create a similarity matrix of all documents. Next it sets all similarity scores that are below the user-set r-threshold, to zero. This breaks the similarity matrix into clusters of closely related documents which are then displayed.
The Lack checkbox has no effect on the weighting.

All of the cluster reports feature Pictogram and Best terms links. Click on them to see which features were notable in a particular cluster. The Detach cluster links "undock" a cluster from the results so you can run other reports available under the Search menu.


Dual RE tab

The Dual RE tab allows you to combine, compare and search using two regular expressions at the same time.

Add (+) find docs that match both expressions.

Subtract (-) finds documents that match the top RE (RE 1) but not RE 2.

Percent (%) showns the matches RE 1 as a percentage of those for RE 2.

You can use this to do things like get a rough idea of the percent of male characters;
Set RE 1 to "(he|him|guy|boy|men|man)" and RE 2 to "(he|him|guy|boy|men|man|she|her|woman|women|girl|lady)"; Turn on word borders!

The button labeled "(1)|(2)" replaces RE 2 with an alternation of RE 1 and RE 2. For most expressions this should match everything that the two expressions matched by themselves.

Clicking "(1)near(2)" will set RE 1 to "((RE 1) ([\w,\-]+ ){0,5}(RE 2))|((RE 2) ([\w,\-]+ ){0,5}(RE 1))", with no change to RE 2.
Use this to create a "neighborhood expression"...
Set RE 2 to \b(?i)(her|she|girl|gal|lady|woman|women|he|him|boy|guy|man|men)s?\b [non-specific character expression].

Set RE 1 to (afraid|alarmed|angry|anxiety|anxious|appalled|apprehensive|arguing|bad|badly|cried|cry|danger|dead|death|depressed|die|dirty|
disappointed|disgusted|distraught|embarrassed|enemy|evil|exhausted|fear|furious|hate|hurt|ill|kill|lifeless|mad|nervous|numb|sad|scared|scary|scream|shit|shoot|sick|sorry|
suspicious|tears|terrible|terrified|upset|worried)
[most common negative words.]

Click (1)near(2) and RE 1 will be set to an expression which might be described as "bad social interactions". You could then click the % button to see the percent of bad interactions in different dreamers.
Then start over, but use a family character expression like (my (\w+ )?(mother|mom|father|dad|brother|sister|son|daughter|husband|wife)).
Does the dreamer have more unpleasant interactions with family members in particular?

The "Applies to" selector allows you to select the directory that you want the report to test. Useful when you have more than one set of documents added.

You can use the picker to enter expressions directly into the RE 1 field of the Dual RE panel. Note that you can use the swap button (↑↓) to move the expression from RE 1 to RE 2.

Tutorials

This tutorial shows you how to work with multiple directories and compare multiple sets of dreams.

This tutorial shows you how to profile a dreamer.

A short study on factors that can identify the gender and age of a dreamer: Blogfactors.

How to use pictograms to visually identify content.


Download

Download, unzip and double-click the bor.jar file to run.

Requires Java JRE (http://www.java.com/getjava/).

Optional thesaurus files: download thesaurusfiles.zip, unzip and place in the settings folder.


Build History