Bag O' Regex
Bag O' Regex (BOR) is a document explorer that uses regular expressions and text mining techniques to
search a set of documents for content.
The results show matching documents, number of matches and the significance of the matches.
You can also look for correlations between different content categories.
This is done by comparing the content in a particular document or cluster of documents with the rest of the set.
How to use BOR
BOR works with sets of plain text documents: docsets. Before you do anything else you must
select a folder containing text documents and add them all from the settings panel (folder button with a '+').
Click the Add button and browse to a folder with your set of documents. Select a document and click the Open button.
Now you are ready to do some basic searches.
Basic searches
Regex
- Enter a regular expression (regex) in the text field of the search tab.
- Set options for the case sensitivity, word borders, and the format of the result.
- Click the Regex button (the one with the magnifing glass icon).
- The results are sorted by either the h or G2 statistic with a visual indicator for the strength of the matches in each document.
- Links in the result window show you the matches in each document.
- Result windows can be saved as html documents.
- Run another report from the result window to restrict its results to the documents found in the first search.
Plugins
Plugin reports run multiple regular expressions defined in files in the plugins folder.
Select "Characters" from the Plugins menu. The report shows results for 20 categories of character related content.
To see the matches in context, click on a category name. When run from a cluster the results compare the cluster to
the rest of the documents in the set and highlight significant differences.
If a line in a plugin file starts with an image icon name and the icon is located in /settings/img then the icon is
displayed in the report.
format example
dog.gif,(dogs|pupp(y|ies)|canine)
Profile
For sets of dreams, diaries, biographies and related genres, select the Profile item from the Profile menu.
The report attempts to guess information about the author's family, gender, age and era.
Names
The Names report (also found under the Profile menu) simply tries to identify all of the names in a docset.
Network
The Network report finds linkage by co-occurrence in document of the matches for the entered regex.
This is especially useful for lists of names, places and roles, where the report generates a social network.
If you don't enter a regex the report uses a special expression that searches for capitalized words and initials, not occurring at the start of a sentence and filtered by the stop_words.txt list.
Available under the Net menu. (More details.)
Concordance
The Concordance report is available under the Net menu.
The result window shows you the matches for your regular expression in context.
Proximity
The Proximity report, also under the Net menu, shows you the co-occurrence of matches for your regular expression
within a window of about ten words. Documents are scored by the number of paired matches and the closeness of the matches.
Concept net
The Concept net report, also under the Net menu, finds all neighbors (within ± 25 characters) of the entered regex (usually a single word).
It then runs a proximity search on non-stopwords in the result and orders them by proximity score with an additional boost
if a pair of terms occurs in both directions (e,g.: "I parked my car" and "my car was parked").
The results are displayed in a table where link terms trigger a new search.
|
|
|
|
|
| Three steps along a concept network |
|---|
In this partial result you can see that "getting wet" and "soaking wet" occur in one direction only, but that "wet sand" and "sand (was|is) wet" are both present.
If you don't supply a regex a random seed word is chosen for you.
Related
The Related search finds the key terms of a document and then matches other documents as similar
based on a proximity search of the same terms. Run this report from the search menu of individual documents.
Cloud
The Cloud report is also run from individual docs. It shows words in a font size proportional to their importance in the doc.
Field Guide
Like a naturalist's field guide, this report gives you a compact list of distinguishing
features of a document or word.
For documents the report lists
- Key terms (by term frequency, inverse document frequency),
- The classes of the key terms,
- Names and places (capitalized words not at the start of a sentence or stopwords),
- Arabic and Roman numerals,
- Emotional highlights of the document,
- Pictogram showing iconic features of the document,
- Related documents.
For single terms, select the term and right-click to pick "Field Guide" from the pop-up menu.
This version shows the term, its closest neighbors, words similar in spelling or by sound and idiomatic usage.
It also lists all documents containing the term.
Pictogram
The Pictogram report searches for about 250 content categories using a list of regular expressions found in settings/img/iconList.txt.
The same directory contains 250 small image files (icons) that are displayed in the report. When run from the Search tab, the report
show a table of icons and counts which indicate the percent of words that matched the expression. When run from a cluster or single document the percents found in
just those docs are also displayed. If the number of matches is lower or higher than expected by chance, the count is shown in blue or red.
You can edit or add categories and icons to suit your needs.
Illustrate
Available from the Search menu of individual documents.
Illustrate redisplays the document text but inserts icons, where they matched Pictogram categories.
Allows you to see how well the categories are matching.
Matches as text
Available on the Search Panel's Sampling menu.
Searches for the entered expression and displays the matches (only) in a plain text document window.
Useful if you want to compare the matches to the document set as a whole. (First save the result and add it to the document set.)
One idea is to search for a particular feature in the context of plus or minus 10 words, save and add the text results and compare it to the document set as a whole.
If you wanted to answer a question like "Do introverts carry their waking aversion to social interactions into their dreams?",
you could target text neighborhoods where (I|me) occurs with (he|him|she|they|man|woman|etc.) and then look for fear and avoidance in the match doc. Then compare with results for extroverts.
"social neighborhood" expression: \b.{0,25}\b((I|me)\b.{0,25}\b(they|them|he|him|she|man|woman|boy|girl)|(they|them|he|him|she|man|woman|boy|girl)\b.{0,25}\b(I|me))\b.{0,25}\b
Thirds
Thirds allows you to test for content by position in a text. It is located in the Sampling menu. Thirds concatenates all current documents into a single file while adding document markers so that
each document text is divided into three equal sized pieces. Save the "thirds" document in a new directory and then use the
Split command to produce individual documents. (You'll end up with three times as many documents as you started with. Remember
to discard the large thirds document before adding the new set. Each document is marked with a special term; _FIRST, _MIDDLE or
_LAST.
Picker
The picker is found on the Search tab (small button on the lower right). When you click the button a window displaying all of the
pictogram icons listed in settings/img/iconList.txt are displayed. Click on an icon to trigger a search for the category. If the Dual RE tab is
selected, clicking on an icon enters its expression into the top search field.
Checkbox options on the Search tab
- Anti (anti-regex) Show documents where the regex did not match.
- Counts Show only the counts for each match.
- (?i) Makes your search case insensitive.
- \b Assumes you want word borders around your regex.
- Limits Use date and document size limits.
- Batch Uncheck if you want to shut off batch searching of plugins reports. You get just the categories so you can search for the ones you're interested in.
- a/A toggles the font size of reports and some text components.
- h/G2 Select either the h or G2 statistic.
- Significant results For some searches, limits results to the most significant matches.
G2 (a.k.a. log-likelihood): Ted Dunning, Accurate Methods for the Statistics of Surprise and Coincidence.
(http://www.comp.lancs.ac.uk/ucrel/papers/tedstats.pdf)
Chi square versus log-likelihood statistic for word significance in different corpora.
|
Using G2: Rayson and Garside, Comparing Corpora using Frequency Profiling.
(http://www.comp.lancs.ac.uk/computing/users/paul/publications/rg_acl2000.pdf)
Using log-likelihood of semantic categories to find the key words in a set of documents.
|
The h score: Schneider and Domhoff, Our Statistical Approach.
(http://www.dreamresearch.net/Info/statistics.html)
Using percent difference in features found in dream reports.
|
Search Panel Menus
Commands (upper menu bar)
- File Add a set of docs, Open a doc, etc.
- Plugins A set of user editable batch searches for related content categories.
- Profile Reports that are identify something about the author of the docs, of global features of the docs.
- Docs Displays information about the document set.
- Net Reports based on the co-occurrence of terms, concordance, proximity and network.
- Sampling Take random samplings of the documents and terms.
- Tools stand-alone jar files and utility functions.
Regular expressions (lower menu bar)
These contain single regular expressions to identify a specific content category.
- Modes Emotions, Sex, and various other cognitive categories.
- Characters Find characters by category.
- Body References to body and sensation.
- Settings Indoor, outdoor, weather and related features.
- Typical Typical dream scenarios
- Personal User entered themes or features.
- Style non-content features.
- User more user defined categories.
Regular expressions (regex)
A regular expression or regex is a pattern that matches text in a document. The goal is to match everthing you want without matching anything you don't.
The simplest regular expression is a single word like television.
If you enter this expression in the Search field and then click Find, BOR will return all documents that contain matches.
Although a single word is a regular expression, most searches usually involve two or more related words. If you were doing a search for "pets" on the internet you mght type in "pets cat dog".
In BOR's search field, this would be treated as a regular expression and you would only match a document containing the literal string. To search for the individual terms you need to put them in an alternation.
Alternation
Alternations are the most common type of RE used in BOR. An alternation is simply a big or statement.
Enter a list of words you are searching for:
cat,dog,pet,kitten,puppy,rex,fluffy
Put these into alternating form by selecting Alternation form the "RE" Menu.
(cat|dog|pet|kitten|puppy|rex|fluffy)
If you search for this regular expression you will match documents containing one or more of the 7 strings in the alternation.
Notice that you will find some matches for things you didn't intend:
cat matches "catch"
pet matches "carpet"
Word borders
To match only on word boundaries add "\b" to the start and end of the RE.
\b(cat|dog|pet|kitten|puppy|rex|fluffy)\b
Since fluffy and rex are names you should capitalize them in the RE. (Regular expressions are case-sensitive.)
\b(cat|dog|pet|kitten|puppy|Rex|Fluffy)\b
Case insensitivity
You can also add (?i) to the start of your RE.
(?i)\b(cat|dog|pet|kitten|puppy|rex|fluffy)\b
This tells the search engine to match without case-sensitivity. You will match "the fluffy pillow" which you probably don't want.
If you use this RE on a set of documents on pets you'll notice that you match "cat" but not "cats".
To match plurals add an optional suffix:
(?i)\b(cat|dog|pet|kitten|puppy|rex|fluffy)s?\b
The question mark tells the search engine that the s is optional.
This won't match "puppies" so add a specal case:
(?i)\b(cat|dog|pet|kitten|pupp(y|ies)|rex|fluffy)s?\b
Example: flying dreams
In order to identify all dream reports with flying you need to find phrases like "I'm flying", "I fly", "I flew", "I was floating", etc.
A simple solution is:
\b(I|'m| was) fl(ew|y|oat)(ing)?\b
But this could miss cases like "I began to fly", "I started to float", "I seemed to be flying" So try this:
\bI (\w+\s+){0,3}fl(ew|y|oat)(ing)?\b
The subexpression:(\w+\s+){0,3} matches up to 3 of "word followed by space".
"word" is defined by \w+: any word character
Character sets
[a-z]: any lowercase letter
[a-zA-Z]:any letter
[aeiuo]:just the lowercase vowels
[^aeiou]: anything but vowels (even numbers and punctuation)
[a-z&&[^aeiou]]: any lowercase letters, but not vowels
[aeiou]+: any string of 1 or more vowels
[0-9||[^\w]]: any numbers or non-word charcaters (spaces, numbers, punctuation)
[a-z]{2,4}: any string of 2 to 4 lowercase letters
Lookahead
x(?=exp) matches x only if followed by exp
x(?!exp) matches x if not followed by exp
Lookbehind
(?<=exp)x matches x only if it is preceeded by exp
(?<!exp)x matches x if it is not preceeded by exp
Quantifiers
The following regexes, when applied to a text containing "Tiff and Tiffany" match;
(Tiff[a-z]+) only matches Tiffany
(Tiff[a-z]?) only matches Tiffa
(Tiff[a-z]{0,3}) matches both Tiff and Tiffany
A few more regular expressions
my (mother|father|mom|dad)
meaning: any sequence starting with "my ", followed by one of "mother" or "father" or "mom" or "dad".
matches: "my mother", "my father", "my mom", "my dad"
fl(y|oat){1}(ing|s){0,1}
meaning: any sequence starting with "fl", followed by exactly one of "y" or "oat", followed by zero or one of "ing" or "s".
matches: "fly", "flys", "float", "floats", "flying" and "floating"
[^a-c]at
meaning: any sequence NOT starting with "a", "b" or "c" and ending with "at".
matches: "rat" and "hat" but doesn't match "cat" or "bat"
[mf][oa]ther(.in.law)?
meaning: any sequence starting with "m" or "f", followed by "o" or "a", followed by "ther", followed by an optional ".in.law", where the "." character can be any character.
matches: "mother", "mother-in-law", "father", "father in law" (no hyphens) and even "fatherXinXlaw"
\bdog\b
meaning: the sequence "dog", but only if found on a word boundary.
matches: "dog" but not "doggerel" or "hotdog"
(To match the last two use \bdog\B and \Bdog\b.)
\b((switch|change|turn)s{0,1} (out|into|to))|((bec(a|o)me)s{0,1}\b (a|an|the))\b|\b(now (is|a|an))\b
matches: metamorphosis
\b(I|I'm) .{2,34}[.?!]("|')?
I do X.
\b(she|he) says.{2,51}[.?!]("|')?
he says/she says
\b(w+) \1\b
\1 captures the match in the first set of parentheses.
repeated word
Because regular expression searches can return anything (the whole text of a document,
parts of words, all of the punctuation characters in a document) there's no simple way to
score the document for how well they matched the search pattern.
I've opted to report matches as a percent of total words in a document or set. Any of the
statistical measures I use make this assumption too. Beware if you're pattern matches non-wordlike features.
A regular expression tutorial.
Formatter ("RE") menu
The Formatter is the small "RE" menu next to the search regex fields. It performs several functions.
The most important function is to put a comma or space separated list of words into an alternating regular expression.
Some of the formatter commands also stem the list words and then add a pattern to match common suffixes.
cats,dogs,pet,fur,feed
becomes
\b(cat|dog|pet|fur|feed)(s|es|...)?
This will match dog and dogs pet and feeding, etc.
The suffix formatters add specialized suffix patterns to your regular expression;
- -most common suffixes matches most common suffixes
- -verb suffixes matches verb forms
- -object matches object forms
- -modifiers matched adjective and adverb forms
Prefix formatters prefix your re with adjective classes
- Intensifiers- really, very, absolutely
- Doubt- somewhat, almost, kind of, seemingly, a little
- Minimizers- scarcely, hardly
- Negators- never, not
- Alternate puts your list in a plain alternation without stemming.
- Stem and alternate puts your list in a plain alternation with stemming.
- In context pads your regular expression to match ± 20 characters. Follow this with a Regex search and you get your original matches in context.
- ±Word pads your regular expression to match ± word.
- Proximity duplicates your regular expression and inserts a pattern that matches 0 to 4 words between.
- Neighbors finds all non-stopwords within 25 characters of the entered regex.
- Expand does a query expansion
Cluster tab
The Cluster tab contains reports that automatically cluster a set of documents according to related features.
Pictogram
The Pictogram cluster uses the pictogram report categories as feature space. These are listed in settings/img/iconList.txt.
The report weights each feature by its expected frequency (p = 0.0 to 1.0) in the document set and adjusted by the size of a given document.
Missing features are given zero weight unless the Lack checkbox is checked. In this case, lack of a very common feature is given a weight of
-1 * (1-1/p), adjusted for document size. This tries to give missing features importance when they are common enough to be expected in a particular document.
The clusters for this and the next two reports are found using a form of the k-means algorithm.
You should enter a "Number of clusters to find" that is around 5 to 20% of the total number of documents.
Key terms
This report identifies the key terms in a document set by a form of TFIDF (term-frequency, inverse document frequency).
These are words that occur in more than just a few documents, but less than most documents.
The TFIDF weighting gives more weight to terms that are rare in all documents or frequent in a given document.
Terms that don't occur in a document are given zero weight.
If the Lack option is checked, missing terms that are very common but lacking in a given document are given a negative weight.
Word list
This cluster uses a user-supplied list of words found in settings/wordlist.txt.
Like the Pictogram cluster, it uses the expected frequency (p) of each word as a feature weight.
Lack of a feature is weighed as -1*(1- 1/p).
The Lack checkbox has no effect on the weighting.
R-cluster
The R-cluster is a little different from the previous clusters. It uses the feature space of the last run cluster report to create a similarity matrix of all documents.
Next it sets all similarity scores that are below the user-set r-threshold, to zero. This breaks the similarity matrix into clusters of closely related documents which are then displayed.
The Lack checkbox has no effect on the weighting.
All of the cluster reports feature Pictogram and Best terms links. Click on them to see which features were notable in a particular cluster.
The Detach cluster links "undock" a cluster from the results so you can run other reports available under the Search menu.
Dual RE tab
The Dual RE tab allows you to combine, compare and search using two regular expressions at the same time.
Add (+) find docs that match both expressions.
Subtract (-) finds documents that match the top RE (RE 1) but not RE 2.
Percent (%) showns the matches RE 1 as a percentage of those for RE 2.
You can use this to do things like get a rough idea of the percent of male characters;
Set RE 1 to "(he|him|guy|boy|men|man)" and RE 2 to "(he|him|guy|boy|men|man|she|her|woman|women|girl|lady)"; Turn on word borders!
The button labeled "(1)|(2)" replaces RE 2 with an alternation of RE 1 and RE 2. For most expressions this should match everything that the two expressions matched by themselves.
Clicking "(1)near(2)" will set RE 1 to "((RE 1) ([\w,\-]+ ){0,5}(RE 2))|((RE 2) ([\w,\-]+ ){0,5}(RE 1))", with no change to RE 2.
Use this to create a "neighborhood expression"...
Set RE 2 to \b(?i)(her|she|girl|gal|lady|woman|women|he|him|boy|guy|man|men)s?\b [non-specific character expression].
Set RE 1 to (afraid|alarmed|angry|anxiety|anxious|appalled|apprehensive|arguing|bad|badly|cried|cry|danger|dead|death|depressed|die|dirty|
disappointed|disgusted|distraught|embarrassed|enemy|evil|exhausted|fear|furious|hate|hurt|ill|kill|lifeless|mad|nervous|numb|sad|scared|scary|scream|shit|shoot|sick|sorry|
suspicious|tears|terrible|terrified|upset|worried) [most common negative words.]
Click (1)near(2) and RE 1 will be set to an expression which might be described as "bad social interactions". You could then click the % button to see the percent of bad interactions in different dreamers.
Then start over, but use a family character expression like (my (\w+ )?(mother|mom|father|dad|brother|sister|son|daughter|husband|wife)).
Does the dreamer have more unpleasant interactions with family members in particular?
The "Applies to" selector allows you to select the directory that you want the report to test. Useful when you have more than one set of documents added.
You can use the picker to enter expressions directly into the RE 1 field of the Dual RE panel. Note that you can use the swap button (↑↓) to move the expression from RE 1 to RE 2.
Tutorials
This tutorial shows you how to work with multiple directories and compare multiple sets of dreams.
This tutorial shows you how to profile a dreamer.
A short study on factors that can identify the gender and age of a dreamer: Blogfactors.
How to use pictograms to visually identify content.
Download
Download, unzip and double-click the bor.jar file to run.
Requires Java JRE (http://www.java.com/getjava/).
Optional thesaurus files: download thesaurusfiles.zip, unzip and place in the settings folder.
Build History
- 11-25-2009: fixes/changes
- The Dual Expression tab's percent report can now show matches as % of words or % of documents matched. Useful for "at least one dream with" measures.
- I added expected results back to the pictogram report - in results from cluster context. Also removed the global counts from cluster results.
- The pictogram report - saved as html and viewed in a browser - was displaying all of the matched document links when they should have be hidden. Fixed.
- 11-20-2009: fixes/changes
- The pictogram report now shows matching results as a percent of words in a set or cluster. It no longer shows the expected count.
I did this because percent of words is, ultimately, what is being compared in this report.
- The pictogram would show the wrong global counts for a category after a set of dreams was removed (when using more than one set) - Fixed.
- The plain Regex reports were showing matches as percent of words incorrectly in some instances - Fixed.
- Moved many of the Verbal plugin categories to the pictogram since these seem to be most revealing of age, gender and (probably) personality.
- Added several new categories;
- Nom: expression for finding Names: capitalized words and single letters, not occurring at the start of a sentence.
- Male words, female words: the words listed in the Profile report.
- Young words, adult words: also from the Profile report, though not visible to the user there.
- Body image: matches fat, skinny, diet, etc.
- Removed categories; ufo and tubes.
- Added "RE as plugin" to the search tab (under the plugin menu). This shows the matches of a regular expression in plugin report format.
Useful here, now that you can selects an "Applies to" set on the search tab.
- 11-05-2009: fixes/changes
- Fixed more reports that didn't work if date/size limits were active. I think I got them all now!
- 10-29-2009: fixes/changes
- The pictogram shows fewer matches when "show only significant results" is checked.
- The Names report was broken. Fixed it.
- 10-23-2009: fixes/changes
- The pictogram version of the automatic cluster report was needlessly recalculating all document counts whenever new documents were added. Same fix as in previous build.
- 10-20-2009: fixes/changes
- The pictogram report was needlessly recalculating all document counts whenever new documents were added. Now it only updates the new docs.
- The global count links on the pictogram were incorrectly limited to the selected "Applies to" set. Fixed.
- Eliminated or consolidated pictogram categories that didn't match very much. Consolidated categories; food, shapes, clothes (except for underwear), containers.
Eliminated; mirror, rug, oven, washing machine, fountain, pier, keyboard (under computer), key (under lock), spider (under bugs).
- Added four new emotion categories for positive and negative emotion occurring close to self and other characters.
- 10-06-2009: fixes/changes
- You can now use BOR to gather dreams from various internet sites by entering special commands or specific URLs in the text field at the top of any BOR html window and then selecting Find from the Search menu.
1) google:author:[author]
Searches the newsgroup alt.dreams for all dreams posted by the user [author].
Results are returned in your web browser - not ready for use in BOR.
Example:
google:author:Robert E. Lewis
2) Sawlogs online dreamlog (http://www.sawlogs.net)
Enter the URL below to get 100 dreams starting at id = 200
http://www.sawlogs.net/explore/dream.php?dream_id=200
This will create a text document with dreams number 200 - 299 (with some gaps) posted at sawlogs.
The document can be saved in a new directory and split into individual documents.
Use the Split command under the Tools menu on the BOR Search tab.
3) DreamJournal online (http://www.dreamjournal.net)
Get up to 100 dreams from DreamJournal, starting at dream 200 as in the previous example.
http://www.dreamjournal.net/index.cfm/do/journal.getdream/dream_id/200
As in the previos example most, but not all, html tags and characters are removed or
replaced and the resulting document can be saved and split into individual dreams.
4) DreamBank (http://www.dreambank.net)
The base URL for DreamBank searches is: "http://www.dreambank.net/random_sample.cgi".
You have to add the series name and max and min document length. Set n = 0 to get
all of the documents in the set.
The following example retrieves all of the Hall/Vad de Castle female norms set.
http://www.dreambank.net/random_sample.cgi?series=norms-f&min=1&max=5000&n=0
[The series parameter can be determined by looking at the html source for the frame that
contains the list of available dream series.]
- 09-26-2009: fixes/changes
- Set a maximum width for "Applies to" selector on the Search tab so the trash/delete button will always show.
Note that the trash/delete button only applies to the "Applies to" selection.
- Fixed a bug in the Docs by directory report.
- New pictogram icon category for old man or woman.
- 09-20-2009: fixes/changes
- Added an "Applies to" selector to the Search tab. If you have multiple added directories and select one here, the report only applies to those documents.
- Added a Picker button to the Dual RE tab.
- Fixed some reports that didn't work if date/size limits were active.
- 09-16-2009: fixes/changes
- The Subtract tab has been renamed "Dual RE" (dual regular expression).
- Added new % (percent) report, + (add) report and - (subtract) reports.
- Subtract is the old Negation report from the Subtract tab. It find documents that match the top RE (RE 1) but not RE 2.
- Add find docs that match both expressions.
- Percent shown the matches RE 1 as a percentage of those for RE 2.
You can use this to do things like get a rough idea of the percent of male characters;
Set RE 1 to "(he|him|guy|boy|men|man)" and RE 2 to "(he|him|guy|boy|men|man|she|her|woman|women|girl|lady)"; Turn on word borders!
- There's also a new button labeled "(1)|(2)". It replaces RE 2 with an alternation of RE 1 and RE 2. For most expressions this should match everything that the two expressions matched by themselves.
- "(1)near(2)" will set RE 1 to "((RE 1) ([\w,\-]+ ){0,5}(RE 2))|((RE 2) ([\w,\-]+ ){0,5}(RE 1))", with no change to RE 2.
Use this to create a neighborhood expression...
Set RE 2 to \b(?i)(her|she|girl|gal|lady|woman|women|he|him|boy|guy|man|men)s?\b [non-specific character expression].
Set RE 1 to (afraid|alarmed|angry|anxiety|anxious|appalled|apprehensive|arguing|bad|badly|cried|cry|danger|dead|death|depressed|die|dirty|
disappointed|disgusted|distraught|embarrassed|enemy|evil|exhausted|fear|furious|hate|hurt|ill|kill|lifeless|mad|nervous|numb|sad|scared|scary|scream|shit|shoot|sick|sorry|
suspicious|tears|terrible|terrified|upset|worried) [most common negative words.]
Click (1)near(2) and RE 1 will be set to an expression which might be described as "bad social interactions". You could then click the % button to see the percent of bad interactions in different dreamers.
Then start over, but use a family character expression like (my (\w+ )?(mother|mom|father|dad|brother|sister|son|daughter|husband|wife)).
Does the dreamer have more unpleasant interactions with family members in particular?
- New "Applies to" selector allows you to select the directory that you want the report to test. Useful when you have more than one set of documents added.
- You can now use the picker to enter expressions directly into the RE 1 field of the Dual RE panel. Note that you can use the swap button to move the expression from RE 1 to RE 2.
- 09-02-2009: fixes/changes
- The Dirs menu has been folded into the File menu.
- Added new icons; military, agriculture, medical. Also expanded the school expression to include more academic terms.
- 08-31-2009: fixes/changes
- The Names report is now searchable. In particular the Network link now applies only to the documents from the cluster from which it was run.
- The recently added directories under the Dirs menu now behave a little differently. If you select a previously added (but now grayed out) set you are no longer prompted to select a document from the directory. The docs are just added directly.
- The Verbal plugin's pronoun_subject was actually reporting on pronouns as object. I added a new category for this and fixed the expression from both categories.
- 08-28-2009: fixes/changes
- New "Matches by set" report. With multiple document sets added, this report shows matches as a % of words across each set. Highest and lowest scores are highlighted in red and blue respectively.
The report also displays a similarity matrix based on the cosine similarity of the matches' frequencies.
Useful for comparing multiple sets. Available in the Search tab's Sampling menu.
- Added an "Add set..." item to the Dirs menu on the Search tab.
- Still updating Pictogram/picker categories and expressions.
- 08-14-2009: fixes/changes
- The icons that show up in most clustered results (after you've run a Pictogram) are now search links. Click on an icon to see the matches within the cluster.
- Moved a few menu commands around on the Search panel.
- Updated and consolidated several pictogram categories.
- 07-31-2009: fixes/changes
- The picker now uses the iconList.txt file when the picker.txt file isn't found. This way you don't have to maintain a separate file for each.
- The Illustrate command (Search menu of individual documents) now asks you to select a plugin file. It only uses the categories found in the file for the report.
I did this so that I could narrow down the number of categories when testing a new plugin.
If you want to illustrate using all categories, select the /settings/img/iconList.txt file.
Note: this only works for categories that occur in the iconList.txt file.
- The Find button on the Search tab is now smaller so that the search field is a bit longer.
- Plugin menus are now identified by a plug icon. On document and result windows the plugins now have their own menu.
- New pictogram categories mostly trying to identify "story structure" elements;
- stage direction: identifies many places where a character arrives or leaves a scene.
- new scene: identifies some change of scene: "Now, we're at my father's house".
- causation: causation words: how tolerant is the dream narrative to unexplained elements? (Consider results of agnosis and accident categories.)
- voice: one simply hears a voice, sometimes of a narrator who explains the dream's events.
The following found in the new Story plugin.
- parenthetical: remarks and asides that the dreamer makes to explain the dream to an imagined audience.
- IRL: "in real life" a direct reference or comparison to events outside of the dream.
- duck-rabbit: ambigious identifcation: "a path or hallway".
- The Story plugin is an attempt to compare dreams to multi-authored improvised dramas. See Garrett:
Offer, Accept, Block, Yield: the poetics of open scene additive improvisation.
The Story plugin might begin to help answer questions like the following;
If dreams created on-the-fly, rather than as received whole visions, then why does it seem like their production never stalls?
Are dreams produced in multiple-drafts, with the occasional fixing in consciousness of the most "likely" narrative, or are they,
all put together at the moment of waking?
Dreams don't seem to have a lot of reincorporation (Johnston's term for tying up loose ends - where seemingly meaningless details from early
in the story are found to be significant in light of event that followed). Are there reincorporations at a symbolic level?
Dreams do seem to be very accepting of "offers" - we're thrown into a dream and a lot of background information about what has been going on
is assumed without question: "for some reason" is a typical phrase found in dreams.
Do some dreamers require more specific explanations for the state of affairs in the dream?
Do their dreams reflect this in expressions of definite causation?
Why don't we guess that we are dreaming more often? Are dreams "involving" in the way that a good work of fiction is?
We can ignore a weak plot if the acting is very compelling. Over-use of special effects and plot twists make us aware
that we are being manipulated and decrease our involvement in a film. Do dreams that are experienced as "like a movie" reflect
the dreamers tacit awareness of the dream's artificiality?
- 07-15-2009: fixes/changes
- Pluin profile reports can now use the settings/img/iconList.txt categories directly. In a plugin file, if you supply a valid icon file name (one that occurs in iconList.txt) and don't follow this by a regular expression, then the expression from the iconList.txt file will be used.
This eliminates the problem of editing category expressions in two places.
If you supply an expression in a plugin file it is used instead of the one in iconList.txt.
- Revised several Pictogram icons and their expressions. Depreciated a few as well. Added word borders to a few expressions that needed them.
- The relative size icon has been replaced by icons for larger and smaller.
- Vase/urn and scissors have been depreciated.
- Fixed a typo in religion: "budda" is now "buddha"..
- Added mirror category.
- Leaf now includes vegetation.
- ... and many other small tweaks.
- The Picker is now available from a button on the Search tab. I revised the default list (in settings/picker.txt) to include all of the current Pictogram categories.
This gives you a compact visual search pallet.
- Moved some menus on the Search panel.
- The N^2 report incorrectly showed a "matches" link when nothing was matched. - Fixed.
- 06-19-2009: fixes/changes
- Pictogram now runs faster the second time you use it from cluster context.
- The "Show only significant matches" option now shows matching categories only for Pictogram reports. Previously it showed only statistically significant matches.
- Redesigned a few icons.
- 06-15-2009: fixes/changes
- Added "best icons" to most of the reports which return clusters of documents. These only show after you've run a Pictogram or an automatic cluster Pictogram.
You can think of the Pictogram report as adding content tags to each document. Compare to "best terms".
The "best icons" are sorted by relevance.
- All of the automatic cluster reports now have "Cloud" links.
- Best terms links would show irrelevant terms when the highest ranked terms were all ranked the same (and low). Fixed.
- 06-02-2009: fixes/changes
- The Pictogram version of the automatic cluster report now shows the icons of the most relevant matches.
- 05-17-2009: fixes/changes
- 05-09-2009: fixes/changes
- Best terms in the autocluster reports have been improved. In most cases there should be more terms and they should be more relevant to the content of a cluster.
- Added Best terms links to the bottom of the basic Regex searches.
- The automatic cluster reports enforce the maximum cluster size (rather than just not show > max size clusters).
- 05-06-2009: fixes/changes
- Changed the First occurrence report, when run from an individual document, with no search expression, the report shows the position of words graphically.
fly
(For possible uses see Carpena et al http://bioinfo2.ugr.es/publi/PRE09.pdf)
Only shows words which occur more than once.
- New Shuffle report available from the Search menu of individual documents. Creates a new document with all of the original document's words in random order. Useful for testing methods which rely on the order of words in a document.
- 04-18-2009: fixes/changes
- Changes TO Classifier reports.
- "Classifier" renamed "FV classifier". FV stands for feature vector.
- New NB classifier (Naive Bayes).
- Run a plugin (or RE as plugin) from cluster context to train a classifier. For NB your target directory should be about the same size as non-target.
- Use the links at the bottom of the plugin result to run the classifier.
- NB works best with load words.
- NB responds to "% of word" checkbox by using % of words as probability. Check % of total to use % of document with a feature instead.
- Docs by size report is now sorted by size!
- Key terms report now has a Search link at the bottom of the report, listing all matches.
- Run plungin item moved to the top of the Plugin menu.
- 03-29-2009: fixes/changes
- New First occurrence report on the Search tab's Net menu and the Search menu of results windows. Lists first occurrence of matches in each document. Also shows average first and last occurrence.
The position of each match is shown as a % of characters in the document.
Might be useful for exploring narrative features and style. Consider the first occurrence of "I" in the following four dream openings;
- "I dreamed I was in a large factory..."
- "I was in a large factory..."
- "I'm in a large factory..."
- "In a large factory..."
A novice dream journalist might record the unnecessary "I dreamed". Later, the same dreamer might switch from past to present tense - a more direct style. Finally, the now experienced dreamer, can omit "I" since most dreams are recounted as first person narratives. "I" decreases and arrives later in the dream text. On the other hand, leaving out self reference, could indicate a more hurried style - the dreamer skipping narrative niceties, trying to list as many details as possible before they fade from memory.
- The Split command was naming documents incorrectly. Fixed.
- 03-22-2009: fixes/changes
- New Classifier menu on the Search tab. Lists recently used classifiers as a shortcut for running them.
- Classifier report (renamed from Cluster by results) has been redesigned.
- Now shows confusion matrix near the top.
- New table showing the contribution of each feature.
- The "Classify without" link now removes features that make a negative or low contribution to classification.
- Matching documents now listed at the bottom of the report.
- 03-17-2009: fixes/changes
- Added a confusion matrix to the Classifier report.
- Checking "Show only significant results" no longer filters Classifier results. All results above 0 are always shown.
- Added "Matched docs" item to the File menu of clustered results. Not all reports show the docs that matched. "Matched docs" lets you see/save them as a cluster.
- Changed the name "Import classifier" to "Run classifier".
- Added "Run plugin" item to the Search tab's Plugin menu.
- Global-context plugin reports now show only matched documents.
- 03-10-2009: fixes/changes
- Added the Bigrams and Trigrams to all search contexts.
- Increased minimum number of matches to 4 when run from Search tab or clusters.
- The Significant only checkbox limits results to the top 100.
- AutoAdd would sometimes prompt you to add the wrong set - fixed.
- URL links weren't working under Unix OS. These now work if you have Firefox installed.
- I added a few new measures to the Readability report and changed the output format when doing comparisons: cluster to "other docs" instead of "all docs".
- 03-04-2009: fixes/changes
- Added the Bigrams report to clustered results. Birgams lists all two word phrases which occur more than two times in the cluster.
- New Trigrams report for clustered results. Lists all three word phrases occurring more than twice.
The report includes a link: "Matches as plugin", which runs an RE as plugin report on the cluster's bigrams.
- 03-03-2009: fixes/changes
- The Dirs menu now acts as a recently used directory list.
- If you Add a directory it shows in the list.
- If you select a directory it lists the documents in the directory.
- If the directory's documents are no longer added then the menu item is shown in gray.
- If you select a gray item you are prompted to add the directory.
- Select the "Clear unadded" to remove unadded items from the menu list.
- The menu items are saved in settings/recentdirs.txt.
- I also moved the "Doc by directory" report to the Dirs menu.
- 03-01-2009: fixes/changes
- New Dirs (directories) menu on the Search tab. Shows the directories of all added documents.
When you select one the directory's documents are shown in a cluster.
- 02-27-2009: fixes/changes
- Changed the Crud factor report to use a file (settings/crudlist.txt) as a source for the words used in the report. This allows you to customize the crud file to match the vocabulary of the regular expressions you are testing. It turns out that sampling from multiple added document sets gives an uneven crud result when one author has a larger vocabulary. The default crudlist.txt file consists of rather common load words. (More on this here.)
- Added average score to RE as plugin report. Since plugin category results report an aggregate score it is important to run an RE as plugin on each significant result to see if a single match is carrying the whole weight of the score. Averaging the scores can help reveal weak terms in category expressions.
- 02-22-2009: fixes/changes
- The scores shown on plugin reports are now links. Click and you get plugin results for the matches in the selected category.
- The plugin report is now a bit faster, when run from a large document cluster.
- Changed the Pictogram report and two plugins; Emotion and cognitive, and Body and Clothes.
- I added a negative look-behind expression to several of the emotion categories so that negated expressions like "not happy" won't match in the wrong category.
The categories for ear, eye and eye, now match only those body parts. The senses of sight, hearing, taste and smell are now separate categories.
The "brain" icon has been renamed "thinking".
- 02-16-2009: fixes/changes
- The Doc counts report (Search tab/Profile menu) behaves differently when you run it with an expression. When you enter an expression it reports the doc count for any words (but not phrases) matched. If you leave the search field blank, it works as before: showing all word's doc counts.
- 02-14-2009: fixes/changes
- Both the Plugin and RE as plugin reports now show the actual G2 or h scores in a new column. I did this because the blue and red significance bars were log-scaled to keep the width of reports reasonably small and so their size doesn't fully show the strength of a match.
Showing the scores can also help you decide how significant a match is, when compared to the background correlation noise present in any large sampling. (See: Meehl.)
- New "Run plugin..." item from the Search menu of any clustered result. Use it to run a plugin format search file.
- I've created a "Crud factor" report (Search tab/Plugin menu) that can give you an idea about how much background correlation exists in a set of documents.
It generates lists of random terms of varying lengths in a plugin file format. If you save and run the file as a plugin from a cluster of documents you will see how often even randomly selected terms result in significant matches.
Use the |max score| as a reasonable minimum significance level - when compared to the results of a "real" plugin report.
Run the Crud report as a plugin from a cluster of interest using the same settings (word borders, case sensitivity, G2 or h, etc.).
May not apply well if "% total" is checked unless the Crud categories are mutually exclusive.
- "Import..." has been renamed "Import classifier...". (See the build note for 01-10-2009 to see what Import does.)
- 02-05-2009: fixes/changes
- The Concordance report, when run from a cluster, only shows results for documents in the cluster. (Previously it showed global results.)
- The Plugin reports and "RE as plugin" now list the number of documents matched. They also show a "Complement" link to display the non-matching documents in a cluster.
When you do a search from either of these reports, the results are now limited to the documents that were matched.
- The Pictogram report now shows a Complement link when run from a cluster.
- Most results that show links to matching documents now show the full path of each. I did this because I am working with multiple document sets with automatically generated names like "103.txt".
It becomes difficult to tell one from another without the path.
- The Network report now shows the sum of degree and degree/match. (I'm looking at this with named character results as a possible approximate measure of sociability.)
- 02-01-2009: fixes/changes
- The Split command would sometimes not create the first document in a series. After this the remaining documents were misnamed. Fixed.
- 01-28-2009: fixes/changes
- New % words|% total checkbox option for plugin reports on the Search panel.
If you check % words, results are calculated as a % of total words.
If % total, results are calculated as % of total matches.
(Use the second option if the categories in your plugin file are mutually exclusive.)
Only applies to plugin reports and the RE as plugin report.
- Got rid of the "short" checkbox, which used to change the format of the plugin reports, because I never use it.
- 01-23-2009: fixes/changes
- New "Doc counts" report under the Profile menu. Shows all words in a documents set, ranked by the number of documents they occur in. An illustration of Zipf's law.
- 01-16-2009: fixes/changes
- Changed "Load words" report to only show words that occur in less than 1/3 of documents.
- Added "Structure words" report under Search tab/Profile menu. Shows words that occur in greater than 1/3 of documents.
- 01-10-2009: fixes/changes
- Added "Significant only" link to the bottom of "Cluster by results" reports. This removes less relevant terms from the "next" search.
- Added Import and Export for Cluster by result.
- To export: use the Export... link at the bottom of reports that show a Cluster by results link. Save as a file.
- To import: select the Import... item from the Plugins menu. Browse to and select the saved file. When you open it a Cluster by results report is run.
- 12-30-2008: fixes/changes
- The RE as plugin report only shows matches with p > 0.01 when "Show only significant results" is checked.
- It also displays a "Cluster by results" link at the bottom of the report.
Note If you've added multiple document sets and then run a "Load words" report (Search tab/Profile menu) the results make a prety good (though slow) classifier when filtered through Re as plugin/Cluster by results link.
- Add two samples of 50 or 100 documents from two larger sets. (training sets)
- Run a Load words report.
- Copy the "Search" term expression from the report.
- Run a "Docs by directory" report (Search tab/Docs menu).
- Display either of the document samples in a cluster (click one of the links in the report).
- Paste in the Load terms expression in the cluster's search field.
- Check "Show only significant results" on the Search tab.
- Run a RE as plugin report from the cluster.
- Clear your document sets.
- Add All documents from both sets.
- Click the "Cluster by results" link at the bottom of the RE as plugin report.
- The results should list significantly more documents from the target set.
Next step: a way save and reuse the "classifier"? Right now you can save the html document containing the link and reuse it. It would be nice if you could export the information in the link to a text file which could then be used like a plugin.
Use a winnowing algorithm to filter out irrelevant terms/categories. For example, in an attempt to distinguish male from female dreamers I used 10 dreams from 5 male and female dreamers (50 dreams for each sex) and found that the load words contained proper names, which would never match in new test sets.
- 12-28-2008: fixes/changes
- New In situ report under the Net menu. Shows matches in sentence context.
- Added illustrate links to the document Field guide report.
- 12-23-2008: fixes/changes
- Cluster by results would fail when run from a plugin result that didn't use an icon file name - fixed.
- View Source now shows the real html code of each report. Previously it showed the Java.editorPane's version of the html, which is slightly different.
- You can now do a random sampling of 500 documents.
- Depreciated several pictogram categories (increase, decrease, consume, create and all of the individual colors). The icons are still present, but their lines in the iconList.txt file have been commented out.
- 12-20-2008: fixes/changes
- New Cluster by results report.
Run from a link at the bottom of Plugin reports and Pictogram reports with significant matches.
Clusters documents that according to how well they match a pattern of category matches.
Example
The Barb Sanders dream sets (1 and 2) have a different pattern of verbal style. (See results near the end of this document.)
Add both dream sets.
Run the Verbal plugin report on the second set.
Run a Cluster by results from the Verbal results.
This returns, mostly, documents from the second set.
Explore these documents for other content categories.
How it works
Both Plugin and Pictogram reports score multiple categories at once. The scores are put in a vector - one result per category.
example from a cluster of flying dreams
[flying, 23.8]
[collective, 9.7]
[conflict, -11.0]
[male characters, -8.2]
[streets, 12.7]
[above, 17.1]
[neg. emotion, -6.8]
Next, every document is scored for the same categories.
The document vectors are compared to the target vector by cosine similarity.
The scored documents are sorted and displayed.
The top documents (in this example) should have more flying, collective characters, streets, and less of conflict, male characters, etc.
- New Complement report (link at the bottom of most reports that return a list of documents).
Click on the link to get the documents that didn't match.
Allows you to more easily gather and characterize documents that are missing a feature.
- New Picker (Tools menu). A little windows with 21 common content category icons.
Click on an icon to run a search for the category.
Useful for storing commonly used queries.
Uses the regular expression associated with an icon in the /settings/img/iconList.txt file.
Customize by editing the list of icon names in /settings/picker.txt.
- New plugin: Three Rs (reading, writing and math).
See: Ernest Hartmann, We Do Not Dream of the 3 Rs Implications for the Nature of Dreaming and Mentation (http://www.tufts.edu/~ehartm01/We Do Not Dream of the 3 Rs Implications for the Nature of Dreaming and Mentation 2000 Dreaming 10 103to111.doc)
I often see discussion which start out with "Some scientist says that we never dream of reading, writing or math...". They never give a source for this assertion. I have heard that Carl Sagan speculated about this in one of his books. Hartmann is one of the only researchers to actually look at this question in some detail. He doesn't claim that the three Rs are never reported, only rarely. This plugin should help you find most instances of the three Rs.
- 11-24-2008: fixes/changes
- Added a Swap button to the Subtract panel. This simply swaps the text of the two text fields.
- 11-21-2008: fixes/changes
- Some search link reports indicated positive significance (using h-score) when they should have indicated a significantly negative result. Fixed.
- Links on the Docs by directory report returned the wrong docs with some directory paths. Fixed.
- 11-18-2008: fixes/changes
- Removed Doc words report.
- Added Repetition measure to the Readability report. It looks at the first 50 words of each document (of at least 50 words) and reports the average ratio of words to unique words.
- 11-16-2008: fixes/changes
- Removed some verbal categories from the pictogram icon list. These have been moved to a new plugin file. The icons have been modified - most are in a speech bubble.
- Rearranged, updated and added new pictogram icons and categories. Commented out a few as well.
- The standard plugin files are all icon format.
- Changed the test for significance in the Pictogram. It won't mark a result as significant unless one of the expected or observed count is > 20.
- The flag (checkbox) for case insensitivity was flipped in a few reports - fixed.
- New Doc words report under the Docs menu. Shows ratio of words to unique words in each document. This is a placeholder for a report that will measure the amount of repetition in a document set.
- 11-07-2008: fixes/changes
- Plugin report didn't accept icon names with spaces - fixed.
- Exception when you used the return key to Add a set of documents at startup on Ubuntu (Linux) - fixed.
- 11-03-2008: fixes/changes
- Added Pictogram to the Search menu of individual documents.
- Related report was accidentally removed a couple of builds back - restored it.
- When using h score the Pictogram report would not show significantly absent results - fixed.
- Progress indicator windows would hang when a report was run from a zero size cluster on Ubuntu (Linux) - fixed.
- Search menus now deactivated for Matches as text and Thirds reports. These reports do not return clusters of documents, so shouldn't be searchable.
- 11-01-2008: fixes/changes
- Added Gender Genie section to bottom of the Profile report.
Uses the method of Gender Genie (http://www.bookblog.net/gender/genie.html) to guess the gender of an author.
It uses a weighted count of two lists of non-content words that can distinguish male and female authors with about 85% accuracy for 25+ dreams.
- 10-29-2008: fixes/changes
- New Illustrate report, available from document windows under the Search menu.
This report redisplays the document text but inserts icons, where they matched Pictogram categories.
Allows you to see how well the categories are matching.
- New Matches as text report. Available on the Search Panel's Sampling menu.
Searches for the entered expression and displays the matches (only) in a plain text document window.
Useful if you want to compare the matches to the document set as a whole. (First save the result and add it to the document set.)
One idea is to search for a particular feature in the context of plus or minus 10 words, save and add the text results and compare it to the document set as a whole.
If you wanted to answer a question like "Do introverts carry their waking aversion to social interactions into their dreams?",
you could target text neighborhoods where (I|me) occurs with (he|him|she|they|man|woman|etc.) and then look for fear and avoidance in the match doc. Then compare with results for extroverts.
- 10-25-2008: fixes/changes
- The Pictogram report now responds to the "Show only significant results" checkbox. When checked only icons for significant matches are shown. This works only when run from a result window.
- New Thirds report, under the Sampling menu. This concatenates all current documents into a single file while adding document markers so that each document text is divided into three equal sized pieces.
Save the "thirds" document in a new directory and then use the Split command to produce individual documents. (You'll end up with three times as many documents as you started with. Remember to discard the large thirds document before adding the new set.
Each document is marked with a special term; _FIRST, _MIDDLE or _LAST.
Do a search for one of these followed by a Pictogram report to see if any categories occur in a preferred position. In my own dreams dialog seems to be more frequent at the end of dreams.
- 10-23-2008: fixes/changes
- Pictogram report: you can comment out lines in the pictograph file (settings/iconList.txt) for categories that you don't want. Just put a # infront of the line to exclude it.
- The Era section of the Profile report now just shows that latest era matched. Also added post-2000 era.
- Changes to a few reports that weren't acting well with Date/Size limits.
- Minor tweaks to pictogram icons and expressions.
- 10-12-2008: fixes/changes
- The Key terms version of the automatic cluster report (on the Cluster panel) has been slightly improved. The setting that determine which words are included in the model and the term weighting have been tweaked, The number of "Best terms" shown has been increased: helpful for smaller clusters.
- Some of the pictogram icons and regular expressions have been redesigned. New icon for "radio".
- 10-06-2008: fixes/changes
- The Pictogram report now uses word counts instead of document counts to determine significance of matching. I believe these are more appropriate for the log-likelihood and h-scores used.
- The calculation of the total number of words in the document set was wrong if you added multiple sets of documents. Fixed.
- 10-04-2008: fixes/changes
- New Subtract tab and panel. This is a place for me to try some ideas on using vector negation to disambiguate search expressions (see Widdows).
If you enter expressions in the two search fields and then click "Find" dual searches are run. These are scored for relevance to the two expressions. The final score is the target minus the "without" score.
If you wanted to search for "trunk" without "tree" you'd get results for suitcases or cars only.
Click "Copy" to copies the target expression to the Search panel.
You can use the Neighbors or Expand button to expand the expressions in both search fields.
- New Load words report on the Search panel, under the Profile menu. This simply lists words that occur in less that half but more than 5% of the documents.
For most document sets, this produces relatively small a list of words that covers the basic content present.
- 09-23-2008: fixes/changes
- The document version of the Field Guide report now shows the document pictograms in the order found.
Showing the pictograms sequentially can often give you some idea about a dream's content - at a glance.
- Check "Show only most significant results" to see the pictograms with no repeated categories.
- Changed the Names report so that it matches a run of capitalized words as a single name.
Previously "Little Red Riding Hood" would have resulted in 4 matches.
- The regular expression used to match names will occasionally match quoted capitalized words such as Yes, Oh, etc. If you add these to the /settings/stop_words.txt file (in lower case) they will not be matched.
- 09-14-2008: fixes/changes
- Fixed 3 or 4 reports that weren't following the font size settings.
- Made the Dictionary panel responsive to the font size settings.
- 09-12-2008: fixes/changes
- Plugin reports now support Pictogram icons.
If a line in a plugin file starts with an image icon name and the icon is located in /settings/img then the icon is
displayed in the report.
format example
dog.gif,(dogs|pupp(y|ies)|canine)
The Pictogarm now contains close to 250 categories. If you split the iconList.txt file into separate categories; settings, characters, emotions, etc., you can run each one more quickly than the whole Pictogram report.
In addition, the Plugin reports use a better statistical test of a category's significance - based on word counts instead of document counts.
I've included some sample plugins, derived from iconList.txt. Drop them in the plugins folder and restart BOR to use them.
- Plugin reports now use the word border checkbox's setting.
- The Concordance report's font size is no longer fixed so it can change size with the checkbox setting.
- 09-05-2008: fixes/changes
- The font size option is now a checkbox on the Search panel.
- Removed "Font: big/normal" from document and report windows.
- 09-04-2008: fixes/changes
- New font size option for smaller screen devices. Select "Font: big" from the File menu of any document or result window to increase the size of some text items.
The text of document and report windows is increased to 18pt. The Search field text on the search tab and on doc and result windows are also increased.
- Select "Font: normal" to change it back to the default size.
- 08-16-2008: fixes/changes
- New "Curve type" measurement added to the Query Analysis report. The report scores each document for how well it matches the entered regular expression. The scores are divided into six ranges. The plot shows the number of documents that match in each range.
My hope is that curve type might be useful in characterizing a query or in deciding when to filter irrelevant matches from the results.
Three "curve types" are defined as follows;
- Peak yield in the 1st range. Characteristic of a query consisting of common terms (Family category).
- Peak yield is in the 2nd or 3rd range. Characteristic of a query of a few related terms (Metamorphosis category).
- Peak yield in > 3rd range. Characteristic of a query of related and/or specific terms ("red").
Any other patterns are marked as "unknown type".
| Curve Type
|
| ≤score
|
# of docs
|
|
| 5.2
|
6
|
|
| 8.5
|
28
|
|
| 11.8
|
13
|
|
| 15.1
|
10
|
|
| 18.4
|
1
|
|
| 21.7
|
1
|
|
- Added pictogram for sleep paralysis. Updated a few other icons and category expressions.
- Pictogram now shows number of icons found.
- 08-06-2008: fixes/changes
- I left out the "img" files in the previous build, which broke all Pictogram reports. Fixed.
- Updated the Pictogram layout. Now wider.
- Added about 20 new icons and categories. Including 9-11, American politics, World issues, Economic, Xenophobia, Counterfactuals, Pregnancy, Meat, Wood, Plastic, Mechanical, Horse, Seeds, Dreaming, Non-participation and Roles.
- Added Poem report to Sampling menu. Generates a short dadaesk word poem of about 50 words. (Really just the random sampling report in a different format.)
- 07-13-2008: fixes/changes
- Added a 2nd pass to the N^2 report which removes very high-frequency words from consideration as neighbors. This speeds up the search slightly.
- 07-06-2008: fixes/changes
- New N^2 report (Neighbors squared): available under the Net menu. Uses ideas from Widdows and Dorow's paper,
A Graph Model for Unsupervised Lexical Acquisition (http://infomap.stanford.edu/papers/lexical-graphs.pdf),
to rank the neighbors of a search target (usually a single word) according to the intersection of their neighbors and the original target's neighbors.
To use, enter a single word, then select N^2 from the Search tab's Net menu.
The results show words which have a high affinity to the neighbors of the target.
|N(n) ∩ N(target)|
affinity = 100 * ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
|N(n)|
where N(x) are the neighbors of x,
X ∩ Y is the intersection of X and Y,
|X| is the size of X, and
n is a neighbor of the target.
In practice, words with a score of > 90 are closely related to the target word.
The "matches" link at the bottom of the report is an alternation of only these terms.
The algorithm is slow. (A faster version is likely in a future build.)
Check "Significant results only" to reduce the number of neighbors considered in the algorithm.
- 06-26-2008: fixes/changes
- The Card report now shows runes.
- Consolidated the image files into the img folder to get rid of the extra folders of the previous release.
This means you need to update your "/settings/img" directory.
- Added runes icons.
- Changed the regular expression that is used to generate captions for the cards.
- Example card and interpretation to show that even randomly assembled elements can be seen as meaningful.
 |
Top images;
Toothbrush with three icons; day, beach, night,
suggests cyclic, recurring features - especially if beach is tide.
Then the toothbrush may suggest that this is about
"the routine": you brush your teeth every AM and PM.
Aquarius icon = water: supports beach and washing/brushing.
Bottom image is a fence. Read in the light of the top part
of the card (routine) this suggests borders, boundaries and order.
Time icon supports periodic events.
Generic animal icon suggests;
impulsive, active, untamed, domesticated, instinctive.
Animal and fence together suggests: domestic animal pen.
Boat icon supports the card's title "the sailor".
Hexagram is "parting" or "resoluteness" and
sailor's often "depart".
So a possible message is "break out of your routine"
or "don't be fenced in by a routine". |
- 06-17-2008: fixes/changes
- New Card report: for all of your divination needs. Select Card from the Sampling menu. It dispays a randomly generated divination card with caption selected from the current document set.
- New Query Analysis report: analyzes a user entered regular expression and tries to suggest a better one.
The analysis shows the yield of the query, the redundancy of matching and removes low-matching and subsumed terms.
Found under the Sampling menu on the Search panel.
- New Subsumption report: tests all of the matches of a regular expression and reports on any which co-occur in more than 70% of documents.
Found under the Net menu on the Search panel.
Example: in the regular expression; (her|she|girl|woman), "woman" and "girl" will seldom occur without "her".
Note: the subsumption test used in the Query Analysis report uses a higher threshold of 85%.
- 05-19-2008: fixes/changes
The term version of the Field guide report now shows Transpositions and Common expressions related to the search word.
- Transpositions shows words that use the same letters as the selected word but in different orders.
- Common expressions uses a list of over 3000 common English idiomatic expressions (found in settings/idiom.txt).
- Reminder: to run this version of the Field guide, select a word in any result window, right-click and then select the report from the pop-up menu.
- The term Field guide is now displayed in the same window from which it was run (instead of a new window).
This cuts down the number of open windows and might be applied to other reports in the future.
- Because you might want to get back to your original report, undo is now supported for this report.
- The Concept Net report also now updates in the same result window.
- 05-09-2008: fixes/changes
- New "R-cluster" (related cluster) report on the Cluster panel. This one creates a similarity matrix between all documents.
The similarity measure is based on the last automatic cluster report that was run. (e.g. If you run a Pictogram cluster,
the similarity is based on the 225 Pictogram categories.)
Once the matrix is created, all similarities below the user-entered "r-threshold" are set to zero.
This cuts the matrix into pieces: clusters of similar documents.
Finally. the resulting clusters are displayed in the report.
The r-threshold field should be set to a value between 10 to 30. Higher values produce smaller clusters.
The "R-cluster" report button is not enabled until one of the other three automatic cluster reports is run.
- 04-26-2008: fixes/changes
- The Pictogram report categories have been regrouped a bit. There are several new and redesigned icons. Icons now show in rows of nine.
- A few minor tweaks to the automatic cluster reports.
- 04-12-2008: fixes/changes
- The Key terms Automatic cluster reports now uses a form of TFIDF (term frequency, inverse document frequency) scoring.
- The upper and lower bounds for feature inclusion have been changed slightly. These now show in the summary section of the reports.
- The Pictogram Automatic cluster now has "Best terms" links. (These are only calculated on demand, unlike the other two Auto-clusters.)
- Two new icons for the Pictogram reports for visual quality; darkness and brightness.
- Two new icons; anger and fear.
- 04-02-2008: fixes/changes
- Automatic cluster reports now show "Spotlight" documents: a list of the first (highest scoring) document in each cluster.
- 03-28-2008: fixes/changes
- New Cluster tab/panel with three automatic clustering reports;
- Pictogram cluster" cluster by the 200+ categories defined in the Pictogram report.
- Key terms cluster: cluster by the automatically determined Key terms of a set of documents.
- Word list cluster: user defined list of words or regular expressions.
The clusters are determined by the presenece or absence of each feature using k-means. More details here.
- The Pictogram report now shows over and under-represented categories in pink or blue. Select the G2 or h statistic as a test of significance.
- The Pictogram counts are now hyperlinks. Click to ran a Regex report for the category in a cluster of documents or the document set.
- 03-13-2008: fixes/changes
- The Pictograms results could get out of sync with the icons displayed if you had unused icons in the img folder - fixed.
- Added a few more icons to the Pictogram and a few more Emotional highlights terms.
- 03-04-2008: fixes/changes
- I've added a "Clear all documents" button to the Search panel as an alternative to the File menu's "Clear set" item.
- The Emotional highlights section of the Field Guide now matches a few dozen more terms.
- 02-27-2008: fixes/changes
- New sections in the document version of the Field Guide.
- Names and Places section, shows capitalized words not occurring at the start of a sentence. These usually indicate names of people, places, etc.
- Numbers secton lists Arabic and Roman numerals.
- Emotional highlights section, displays most sentences containing emotion words, intensifiers, unusual and dramatic content.
- 02-19-2008: fixes/changes
- New, "stand-alone" version of the Pictogram report from the Profile menu of the Search tab.
Also available from result windows' Search menu.
- The Search tab version gives global document counts for each category.
- The results window version compares the counts for the result docs withe the global count and uses G2 to score significance (red or blue highlight color).
- 02-10-2008: fixes/changes
- The layout of both of the Field Guide reports has changed a little. For the document version, pictograms are now shown in the top right.
- There are a total of 190 icons now.
- 02-07-2008: fixes/changes
- The Pictogram icons are now displayed with a bit more padding.
- There are a total of 180 icons now.
- 02-04-2008: fixes/changes
- I broke the Plugin reports for individual documents a couple of builds ago - fixed.
- 02-02-2008: fixes/changes
- The "key" link in the Pictogram now displays only the icons found in he document.
- Search links on the key page allow you to see what matched.
- Revised and added new icons - over 160 total.
- 01-29-2008: fixes/changes
- The Pictogram section of the document Field guide report now has a "key" link that displays a table of all of the icons and their regular expressions.
- 01-27-2008: fixes/changes
- Pictograms now uses regular expressions instead of comma-separated words for searching.
- It also searches the complete text of a document instead of just the "key terms".
These changes mean you can search for phrases and more complex expressions.
- The icons are now grouped by category.
- I've added another 30 or so icons since the initial release.
- I've bundled the "img" folder with the regular release so you no longer have to download it separately.
- 01-20-2008: fixes/changes
- New Pictograms section added to the doc version of the "Field Guide" report.
- Matches about 2400 words to 99 content categories represented by icons:
- Read more about it here.
- 01-15-2008: fixes/changes
- The term version of the "Field Guide" report now shows the Porter stem of the term - if it is different.
Reminds you that the lookup of Class and Idiom usually only work for the root form of a term.
- Fixed a bug in the Cloud reports that prevented you from using the popup serach menu on the first and last words of the report.
- 01-12-2008: fixes/changes
- The document version of the "Field Guide" report now shows the key terms that matched each of the "classes" found.
- 01-05-2008: fixes/changes
- New document version of the "Field Guide" report. Shows the key terms of the document and related documents in Field Guide format.
Available in document window Search menus and as a link (fg) in Field Guide reports.
- Added cloud and intersections links to the documents section of Field Guide.
- Fixed thesaurus lookup so that the matches are a little more focused.
- 12-31-2008: fixes/changes
- New "Field Guide" report. Displays the "ecosystem" of a selected single word.
- Lists documents containing the word,
- The content class of the word,
- Common neighbors of the word,
- Words with similar spelling,
- Words that sound similar, and
- Idiom (thesaurus look-up).
- Available in document and result windows in the popup menu.
- Select a word, then right-click to display the popup.
- Requires that the thesaurus files be present in the settings/thesaurusfiles folder: download thesaurusfiles.zip, unzip and place in the settings folder.
- Replaced the "Content words" report with a global scope Cloud report.
- 12-22-2007: fixes/changes
- The "Content words" report now uses the "Show only significant matches" checkbox to raise the threshold for words to be considered content words.
- Fixed a bug in the html code of the "Regex by date", which was visable if you opened a saved report in a web browser.
- 12-19-2007: fixes/changes
- New "Content words" report: lists words that occur in at least two documents but also less than half of all documents. Also excludes words in the stop list.
- The count shown is the number of documents with the term.
- Very rare and very common terms are excluded.
- Available under the Sampling menu on the Search panel.
- Also in results windows under the Search menu.
- When run from a cluster the document count if limited to the cluster's docs.
- For single document windows, the Term table report gives equivalent results.
- 12-13-2007: fixes/changes
- The "Short" format plugin report was not giving results when called from the Search panel: fixed.
- 12-12-2007: fixes/changes
- The Intersections report now shows document links with the found intersections.
- The Search menu on Intersections result windows is now active. Searches apply to the docs listed in the results.
- 12-11-2007: fixes/changes
- Added the Intersections report to document and result windows.
- 12-07-2007: fixes/changes
- The Intersections report now lists duplicate intersections only once.
- 12-03-2007: fixes/changes
- New "Intersections" report: shows words that co-occur in two of more documents.
- Finds key words of each document, then lists any matches of 2 or more words in 2 or more documents.
- Useful for identifying recurring content and for general profiling.
- Checking the "Show only significant matches" shows intersections of at least 3 words.
- Located under the Net menu.
- The "Key terms" report now uses a better method to determine the importance of terms.
- Consolidated "TFIDF" and "Most sig. terms" into "Key terms" report on document Search menus.
- The "Regex by date" report now shows documents and matches in the last column.
- 11-30-2007: fixes/changes
- New "Regex by date" report: shows matches in all docs, sorted by date. Also indicates number days between docs.
This requires that the documents have dates (mm-dd-yyyy) on the title line of each document.
Could be useful if you're trying to study the dream lag effect.
Located under the Sampling menu.
- 11-24-2007: fixes/changes
- 11-22-2007: fixes/changes
- Right click text for popup search menu. Select some text in any window and then right-click to see a popup menu with three search items.
- Regex
- Concordance
- Concept net
Runs the selected report in global context with the current selection as the search regex.
- The Dictionary lookup was broken - fixed.
- 11-06-2007: fixes/changes
- New "Link rank" report. This one applies the
PageRank
algorithm to the entered regular expression terms, where proximity is treated as linkage.
Example: a very important name (husband, wife, children) will occur in proximity to other names.
For the Barb Sanders set (http://www.dreambank.net/) I ran a Names report followed by a Link rank to get
the top names:
| Link rank: C:\build\Dreamers\barbsanders1 |
| term | rank | in links | out links |
| Ellie | 0.0054 | 29.0 | 41.0 |
| Ginny | 0.0040 | 20.0 | 30.0 |
| Paulina | 0.0030 | 20.0 | 14.0 |
| Dwight | 0.0030 | 20.0 | 19.0 |
| Charla | 0.0030 | 22.0 | 10.0 |
| Ernie | 0.0025 | 17.0 | 12.0 |
| Bonnie | 0.0025 | 15.0 | 15.0 |
| Lucy | 0.0021 | 12.0 | 13.0 |
| Dovre | 0.0020 | 11.0 | 19.0 |
| Lydia | 0.0018 | 11.0 | 8.0 |
| Howard | 0.0017 | 8.0 | 16.0 |
| Darryl | 0.0017 | 8.0 | 5.0 |
| Merle | 0.0015 | 6.0 | 6.0 |
| Nate | 0.0014 | 9.0 | 8.0 |
| Derek | 0.0014 | 7.0 | 9.0 |
| Tyler | 0.0013 | 6.0 | 9.0 |
| Abner | 0.0011 | 5.0 | 7.0 |
| Mary | 0.0011 | 6.0 | 5.0 |
|
| The top names are her daughters, her brother, sister, closest friends, ex-husband, etc.
See the cast of characters page for Barb Sanders at Dreambank for comparison.
"in links" is the number of names that occur before the target name - "out links" is the count of names which follow the target.
The rank is automatically reduced if a term's out link count is less than 4 to exclude "link sinks".
Because the direction of links matters, you can sometimes see an effect where a "secondary" character has an
elevated rank when the character is always mentioned last. This "Mutt and Jeff" effect can reveal parts of speech; nouns have low in to out linkage, verbs have high in to out.
|
- 11-03-2007: fixes/changes
- The automatic neighbors expansion used in the Concept net and cluster reports now responds to the "show only significant" checkbox.
If you check it, only neighbors occurring 3 or more times are shown.
If you don't check it, the threshold is 2 or more.
Check it if you want to reduce the number of clusters in a Concept cluster.
- Created a new menu: "Net". Moved reports from the Sampling menu to the Net menu if they relied on proximity, co-occurrence or graph-related methods.
- The Concept cluster report now tries to merge smaller clusters while eliminating larger ones.
- 11-01-2007: fixes/changes
- 10-22-2007: fixes/changes
- Changes to the "Concept net" report.
- More weight is given to closeness between two terms.
- More weight is given to bi-directional matches.
- Less weight is given to multiple occurrences of a pair.
- The linkage direction arrows are now links which run a regex showing the words between two terms.
- 10-18-2007: fixes/changes
- New "Concept net" report. This is a combination of the Neighborhood report and the counts-only Proximity report.
It finds the neighbors of the entered term and then does a Proximity search to show related words, scored, by proximity.
Each found term is a link that, when clicked, sets off a new Concept net. Use this report to explore a network of related
concepts.
If you don't supply a search term, a random term is selected.
Arrows in the results show the direction of linkage between terms.
- 10-15-2007: fixes/changes
- Changed the behavior of the Network and Proximity reports when "Show only significant results" is checked.
When it is checked the reports only show results for the first item in the regular expression.
- Proximity report (Counts checked) now displays a table of "symetric" terms. Two terms are symetric when they
cooccur in a window in order
term1 ... term2 and
term2 ... term1.
This is a possible first step towards a method for discovering conecpt classes and relationships using the methods
of Davidov et al.
- Fixed a bug in the "Expand" feature: it was failing because it included the semicolon as a word character.
- 10-11-2007: fixes/changes
- "Neighbors" command now available under all "RE" menus (next to search fields).
This version replaces the search regex with the neighbors found.
- 10-10-2007: fixes/changes
- New "Neighbors" report: shows words occurring within a +/- 30 character window of the entered regex.
Available under the Net menu on the Search tab.
- New "proximity" links in the results of Network reports. If "Counts" is checked, these links show only the
matches where the node's name is present.
- The "Network" link in the Names report now includes a few more roles: (girl|boy)?friend and grand(mother|father).
- Fixed a bug in the Proximity report: when both "Show only significant results" and "Counts" were checked, the
results displayed the scores for low-scoring matches - but not the matches. Now shows neither.
- 10-08-2007: fixes/changes
- The Proximity report now has a "counts only" option. If you check the Counts checkbox the results show
the matches, ordered by the sum of proximity scores.
- Added Proximity and Concordance reports to result window Search menus.
- 10-04-2007: fixes/changes
- Another tweak to the Expand terms function.
- Modified a couple of categories in the "Blog Factors" plugin.
- 10-02-2007: fixes/changes
- 09-22-2007: fixes/changes
- New "Similar" item in the Dictionary tab. Finds the closest matches for the entered word using the Levenshtein distance (a measure of how many letters would have to be changed to transform one word into another).
It only looks at words with the same first letter as the entered word. Useful when you don't know the exact spelling of a word.
- 09-18-2007: fixes/changes
- Added "delete" links to the Docs by directory report. Click on one to remove all of the docs in that directory from the docset.
- When you added a doc to the docset and it was already open, running a report from its Search menu would open a second copy of the doc: fixed.
- 09-11-2007: fixes/changes
- Expand terms inserted a word border (\b) at the end of the regex you entered, even if you didn't ask it to - fixed.
- The Dictionary panel combo box only worked if you changed the selection - fixed.
- 09-08-2007: fixes/changes
- The Network report now shows a list of the top (20) words found - rated by degree.
- 09-07-2007: fixes/changes
- The result links in the Network report now run a regex report that shows matches of all docs where the first name on the list occur. (Previously there were two seperate result links, for the name and its neighbors.)
- Fixed the default search of the Network report to always be case sensitive.
- Results of the Network report, which have lots of neighbors are now highlighted.
- 09-04-2007: fixes/changes
- The "Network" report now uses a default regular expression if you don't supply it with one. This one searches for all capatilized words that don't occur at the beginning of a sentence. It will find last names and place names that the Names report doesn't.
- Added a Network report link to the Name report result window. Click it to run a Network report with all of the Names results and a regex for common roles: my (mother|father|son|daughter|boss|workplace|etc.).
- 08-28-2007: fixes/changes
- New "Network" report finds common neighbors of the regex's matches. Especially useful for lists of names, places and roles.
- 08-20-2007: fixes/changes
- New "RE as plugin" report (regular expression as plugin report) added to the Search menu of clustered results.
This one take the regular expression entered in the text field of a clustered result and checks each, individual match for significance in the cluster.
The result uses the same format and statistical tests as a plugin report.
- Removed the "Least sig. terms report" from individual docs because I didn't think it was useful.
- 08-16-2007: fixes/changes
- Added the "breadcrumbs" (call history) to a few more reports. If you call a report from the Search panel where it will apply to all docs it uses the first added directory as the starting place.
- Plugin reports, when called from a cluster, now show counts and percents for the other docs - everything not in the cluster. Previously it showed counts and percents for all docs.
- Add "Initials" category to the Names report to find characters identified by their initials only.
- A new tutorial shows you how to profile a dreamer.
- 08-12-2007: fixes/changes
- Added "Profile", "Names" and "Key terms" reports to clustered results.
- 08-11-2007: fixes/changes
- The "key" header in most reports now wraps at about 60 charcaters. Previously, very long regular expressions, which didn't wrap, caused the dream text to display on one line per paragraph. This made it hard to see the whole text. Fixed.
- New tutorial shows you how to work with multiple directories and compare multiple sets of dreams.
- 08-10-2007: fixes/changes
- You can now add multiple directories of documents to BOR. This makes it much easier to compare two authors. You can also add 25 (50, 100,...) docs from a variety of directories to generate a random corpus.
- New "Docs by directory" report under the Docs menu. Lists all of the directories added to BOR. Click on a path link in the report to get a cluster of the docs in that directory. Plugin reports run from a directory result compare those docs with the rest of set.
- New "Clear set" item under the File menu.
- New "breadcrumbs" feature in header of plugin and other reports reports. This lets you track where the report was called from.
For example; you add 50 docs each from three directories, you run a Docs by directory and then select one of the three directories (say "C:\Dreamers\Joe"), run an Emotions plugin, then run a Characters plugin from that result and you get a breadcrumbs header: "Characters:Emotions:C:\Dreamers\Joe".
- Some "tag matches" results failed if the match occurred at the start of a document: fixed.
- The Proximity search now uses the stop words list to remove overly common words.
- 08-04-2007: fixes/changes
- Concordance report now has a "ref" link next to each match. Click it and the match is displayed in the original document.
- 07-31-2007: fixes/changes
- Fixed a bug in the Regex report: some resuts were not being displayed in the non-counts-only version of the report.
- Added Random terms items to the "Sampling" menu. Returns an alternation of 10, 20 or 30 random terms. (Uses the stop list to exclude very common words.)
- 07-25-2007: fixes/changes
- Added "Sampling" menu.
- Story: grabs random snippets from a document set and displays them as a fragmentary recollection.
- Random (1, 10, 50, 100): these items show a random document or cluster.
- Concordance: displays a visual concordance of a regex.
- 07-21-2007: fixes/changes
- Re-added G2 and h-score indicators to plain Regex searches.
- 07-16-2007: fixes/changes
- "Avg. sentence length" in the Readability report was wrong: Fixed.
- 07-15-2007: fixes/changes
- If you check both "Counts only" and "Anti" on a Regex search the results attempt to show what didn't match. This is only useful if your regular expression is a simple alternation.
- Minor adjustments to the Profile report. Added "What am I" link: finds most self-declarations; "I'm a bird", "I am a police officer", "I'm an Asian boy", etc.
- 07-05-2007: fixes/changes
- Added the Profile report under the Profile menu of the Search panel. This one tries to gather some basic information about the author; family structure, age, gender, etc.
- 07-04-2007: fixes/changes
- Fixed the layout of the Dictionary panel text area. It was clipped off at the top.
- 07-01-2007: fixes/changes
- The precent column was wrong in the Regex report when "Counts only" was checked. It now shows the correct % of words.
- 06-22-2007: fixes/changes
- Replaced the Search panel's "Regex" menu with a "Style" menu. Everything is a regular expression so there's no need for the old menu.
The Style menu should be used for stylistic features of a document rather than content.
- 06-21-2007: fixes/changes
- Selecting Alternate or Stem and alternate from the RE menu converted the list of terms to lower case. Fixed.
- 06-20-2007: fixes/changes
- Saving a search menu item was causing the RE menu to stop working. Fixed.
- The Settings panel is gone. Add a document set from the bottom row of Search panel now, or use the File menu.
- Swapped the two menu bars on the Search panel.
- Secondary searches from the Charset report searched all dreams. This was incorrect for single documents and clustered results. Fixed.
- Known bug of Charset report: it can't show counts for the following charcaters; # > <.
I probably won't fix this.
- 06-18-2007: fixes/changes
- New Charset report, under the Profiles menu on the Search panel (also in document and cluster results).
Shows a table of 98 characters as percent of all characters.
Results are color-highlighted for significance as in the "Short" plugin report.
- 06-14-2007: fixes/changes
- Removed G2/h statistic indicators from "plain" Regex reports. These were not reporting the correct result for non-global searches.
Results are now sorted by % of words.
- Most of the checkbox options on the Search panel are now saved and restored between starts. Not the Anti or Limits options.
- New short format for plugin reports. This shows results' significance by the background color of the category's table cell. Higher scores are dark red. White is neutral. Blue is low.
- New "Short" checkbox, to select the above.
- Plugin reports' search links did not work for some result categories: fixed.
- Changed the Related report. It now uses the 25 best terms by TFIDF score in a proximity search.
- 06-08-2007: fixes/changes
- Plugin report changes.
- If you run a non-batch plugin report and then click on one of the result links a "mini" result table is shown for the one category only. This way you can see if the category is significantly different in a doc or cluster.
- Previously a plain regex report would be run. You can still run the plain regex report from the link in the mini-plugin result.
- 06-06-2007: fixes/changes
- New word borders checkbox option \b: if checked, assumes you want to match on word borders.
If you search for cat without borders you'll match;
category, cataclysm, catch, etc.
- The document Term table report now reads "Vocabulary" instead of the misleading "Unique terms".
- 05-30-2007: fixes/changes
- The RE menu is now editable. You can add prefixes and suffixes to the menusRE.txt file in the settings directory. The format is:
<RE>
menu name
reqular expression
example:
<RE>
Negators-
\b(never|not( so| very)?|hardly|scarcely)\b
- The menu names for prefixes must be followed by a hyphen.
- Suffix names must be preceeded by a hyphen.
- Don't edit the other items (Aternate, Expand, Save).
- The RE menu items now have full names.
- 05-28-2007: fixes/changes
- Added Cloud report to document windows only. This one shows words in a font size that is proportional to their tfidf scores.
- 05-26-2007: fixes/changes
- Added Batch search checkbox for plugins reports. Uncheck if you don't want batch searches: instead you will get category search links only.
- 05-25-2007: fixes/changes
- Fixed the message that tells you when your regex is incorrect. It wasn't showing.
- Some regex links now carry over to the search field of report windows that they triggered.
- 05-22-2007: fixes/changes
- Added four adjective prefixes to the RE menus.
- Changed the behavior of the RE, suffix commands: they no longer change lists into alternations.
- Added two versions of alternation (|) to the RE menu; one stems the terms before alternation s(|).
- The Thesaurus now works even if no documents have been added.
- Added "tfidf" (term frequency - inverse document frequency) report to documents. This shows the important words of a document compared to a set of documents.
Example: the Unabomber's manifesto (placed in a set of pet-theorists documents) gives the following top 20 terms by tfidf:
society industrial technology leftist technological social leftists leftism
surrogate goals Paragraph system autonomy freedom revolutionaries psychological problems
revolution process paragraphs
- 05-12-2007: fixes/changes
- Added "Export as "plugins" file" link to the end of the Key terms report.
Click on it to get a plain text file that has each of the key terms regexes on a separate line.
Save it in the plugins folder and restart BOR to load the file under the Plugins and Search menus.
You should edit out lines that are repetitive or unthematic. The more lines you keep, the slower the report will run.
- Results for least, most and unique terms reports now include link for the terms found.
- 05-10-2007: fixes/changes
- Added Most sig. terms report to documents.
- Added Least sig. terms report to documents.
- Disable File menu on Search panel when running a report.
- 05-09-2007: fixes/changes
- The indicators for significance levels were not showing when the "h" statistic was being used. This caused the "show significant only" option to fail.
- Added indicator message (in red) for above when no results were found above significance level.
- 05-08-2007: fixes/changes
- For real this time! Added Bigrams report for documents.
- New checkbox option for significant results. When you check it, low scoring results are removed from the following reports;
- Regex: p value must be <= 0.05.
- Bigrams: only bigrams seen more than once.
- Proximity: stops showing when/if the score drops below 25% of the best matching document.
- Related: same as above.
- 05-07-2007: fixes/changes
- Term table was reporting twice as many word counts.
- Added Bigrams report for documents.
- Added Unique terms report for documents.
- Adjusted layout for option checkboxes which were obscured on WinXP.
- Formatter combos replaced with menus: "RE" menu.
- File menu now has "Add set" and "Add one doc" commands.
- 05-05-2007: Initial release.