16. Keyword Statistics

The Keywords tab gives detailed statistics about the keywords in a keyword list. The workflow is as simple as choosing a keyword list, specifying several calculation options, and clicking Calculate. This will produce a table showing the keyword list and several statistics for every keyword query in the list.

The nature of the information shown here potentially goes beyond what can be established in the Search tab.

Keyword stats

16.1. Configuration

All controls for configuring the calculation are placed on the left side of the tab. The options are divided into four groups:

  • Keyword list to use.
  • Filter to be applied.
  • Document fields to search in.
  • Statistics to be calculated.

At the top, the user can choose a previously uploaded keyword list or add one here. This uses the same collection of keyword lists as the Keyword Lists facet in the Search tab. Any list added in the facet can be used here and vice versa.

The second panel allows the searches performed to calculate keyword statistics to be filtered. When saved search is chosen as a filter, then the saved search is evaluated and its result is intersected with the result of each keyword search. For example, having keyword “letter” and saved search containing OCR search, including PDF documents and excluding a custodian would result in statistics calculated for items on which an OCR was done, containing the keyword “letter”, which is a PDF document and does not come from the custodian.

Although we call this functionality “keyword statistics”, the user can use the complete full-text search syntax here: wildcards, Boolean operators, phrase queries etc. are all available. Field-specific searches are also possible. When used in a query, these overrule the field settings set in the third panel.

The third panel offers the available search fields. These are the same as offered in the Search tab. By default, all fields are searched, but the user can choose to restrict searches to e.g. the document text, email headers, etc. Any combination of fields can be used.

The last panel offers five checkboxes that determine what information the table will contain:

  • The Items option adds columns indicating:

    • number of items containing the keyword,
    • corresponding percentage of items,
    • deduplicated number of items, and
    • exclusive items of a keyword, i.e. the deduplicated amount of items not returned by any of the other keywords. It shows how many extra items are returned when a keyword is added, or how many items are lost if it is removed. This can be used to measure the impact of a search keyword on the length of the review process.
  • The Hits option counts the number of occurrences of the search term in the texts. For example, when a keyword produces a document that contains the keyword 3 times and another document that contains the keyword 5 times, this column will show 8. The hits are counted across all the selected search fields, but only on the deduplicated items.

  • The Custodians option adds a column for every custodian in the case. Each custodian column indicates how many of the matching items originate from that custodian.

  • The Families option adds two columns: “Families” and “Family items”. A family is an item set consisting of a top-level item (e.g. a mail in a PST file) and all its nested items (e.g. attachments, embedded images, archive entries). Families are detected by traversing item’s location upwards in the hierarchy tree and finding family root. Items with the same family root are part of the same Family. Certain types of items are skipped when determining the family root, namely all folders, mail containers, disk images, load files and cellphone reports. The meaning of the two columns is then as follows:

    • The Families column shows in how many families the keyword occurs. For example, if a mail and two of its attachments all contain the keyword, that counts as a single family.
    • The Family Items column shows the total number of items that are contained in these families. This may (and usually will) include items that do not contain the keyword at all; they just belong to a family that has a hit in one of its other items. In cases where you are not directly exporting search results but rather their top-level parents (i.e. the default setting when exporting to PST), this will tell you how much of the case is conceptually being exported this way. This may give an indication of how well a certain search filters items in a case.

16.2. Calculation

When the Calculate button is clicked, Intella Connect will populate the table after finishing all calculations.

The time required for the calculation is dependent on several factors, including the size of the keyword list, the hardware, the chosen search options and the storage location and size of the case. While most options can benefit from indices that make the calculation fast regardless of case size, the Hits option will have a considerable impact on the search speed.

The progress of the calculation will be shown in the status panel above the table.

During calculation, the Calculate button will change into a Stop button, allowing for manually terminating the process.

When clicking Calculate again, the previous results will be discarded and the table will be populated from scratch, using the (possibly changed) configuration options.

16.3. Results

The table order is the same as the order in the keyword list.

The last row shows total amounts for each column. Columns Exclusive and Hits will show total amount as a sum of all rows.

The rest of total amounts are calculated from union of results of all keyword searches. It is important to note that they are not sum of all rows.

16.4. Exporting the results

Once calculation has completed, the table can be exported by clicking on the Export button above results table.

This will show optional description field related to PDF and DOCX export along with 4 buttons with the following action:

  • CSV and XLSX - export the table as comma separated values or as table into an Excel document.
  • PDF and DOCX - create a PDF or DOCX document containing keyword statistics report.

16.5. Keyword statistics report

The keyword statistics report contains information that can easily be given to general counsel.

Each page has a header composed of case name, keyword list name, date and time on which this report was created. Optionally, a description can be added. Description can be anything that the user wants to disclose in this report.

First page contains overview in form of bar chart of how the keyword list compares to all items in the case.

It contains following values:

  • all items that contain any of the keywords
  • deduplicated amount of all items that contain any of the keywords
  • deduplicated amount of all items that contain any of the keywords including the total number of items that are contained in how many families the keywords occur.
  • items without hits - equals to all items feature facet search result minus all items that contain any of the keywords.

Following pages contain bar chart(s) showing Deduplicated and Exclusive values for each keyword from the list.

Last pages contain table showing the following:

  • keyword
  • number of items containing the keyword
  • deduplicated number of items
  • total number of items that are contained in how many families the keyword occurs
  • deduplicated amount of items not returned by any of the other keywords