Performing a Query

An overview of Apache Lucene implementation within Datastreamer API

Datastreamer API results strongly depend on the query passed in the query parameter. A query is usually broken up into terms and operators. Here is how terms and operators can be used to retrieve the required result from the API.

Term & Phase

  • A Single Term is a single word such as "test" or "hello" (quotes not required).
  • A Phrase is a group of words surrounded by a slash and double quotes such as "hello john".
  • Multiple terms can be combined together with Boolean operators to form a more complex query.

Operator

Operators allow terms to be combined through logic operators. Search Query supports

AND / &&: The AND operator matches documents where both terms exist anywhere in the text of a single document. This is equivalent to an intersection using sets.

+ : The + or required operator requires that the term after the + symbol exist somewhere in the field of a single document.

OR / || : The OR operator is the default conjunction operator. This means that if there is no Boolean operator between two terms, the OR operator is used. The OR operator links two terms and finds a matching document if either of the terms exists in a document.

NOT / ! : The NOT operator excludes documents that contain the term after NOT. This is equivalent to a difference using sets. The symbol ! can be used in place of the word NOT.

- : The - or prohibit operator excludes documents that contain the term after the - symbol.

πŸ“˜

Escape Special Characater

The current list of special characters are + - && || ! ( ) { } [ ] ^ " ~ * ? : \

Term Modifiers

Apache Lucene supports modifying query terms to provide a wide range of searching options. Here are the search options available to consumers to design the request in order to retrieve the efficient results.

Wildcard Search

There is a support for single and multiple character wildcard searches within single terms. To perform a single character wildcard search developer can use the ? symbol. To perform a multiple character wildcard search the * symbol can be used.

The single character wildcard search looks for terms that match that with the single character replaced. For example, to search for "text" or "test" you can use the search:

te?t

Multiple character wildcard searches looks for 0 or more characters. For example, to search for test, tests or tester, you can use the search:

test*

You can also use the wildcard searches in the middle of a term.

te*t

🚧

Illegal Usage

You cannot use a * or ? symbol as the first character of a search.

Fuzzy Search

Our API supports fuzzy searches which are done by a tilde ~. For example to search for a term similar in spelling to "roam" use the fuzzy search:

roam~

This search will find terms like foam and roams.

Proximity Search

API supports finding words are a within a specific distance away. To do a proximity search use the tilde, ~ symbol at the end of a Phrase. For example to search for "apache" and "jakarta" within 10 words of each other in a document use the search. Click here more details on how Proximity Search works.

"jakarta apache"~10

Date Range Search

Date Range Queries allow one to filter by date range on any date/time field. By default all searches, will filter to the last 30 days. In this example, the content will be filtered to content with a published date/time between and including November 1st, 2021, and November 7th.

"query": "content.published:[2021-11-01 TO 2021-11-07]",

πŸ“˜

Searching beyond 30 days

You can search up to 180 days with the Datastreamer Search API, using the Document Date (doc_date) as described in the next section "Historical Searching".

Square Brackets [] means range search will be performed inclusive of the values given within the brackets. Curly Bracket {} will search the values exclusive of the parameters given within the brackets.

You can also add a timeframe in addition to the date.

query": "content.published:[2021-12-01T00:00:00 TO 2021-12-07T11:15:37]",

πŸ“˜

Query in API

Here are the APIs that uses Query

  • Search API

Historical Searching

Search API supports the searching of content historically.

By default, all search queries will return results from the last 30 days. To search up to to the full extent of data sources, simply include a "document date" (doc_date) field into your query specifying the range you would like to view documents from.

"doc_date" is a required field for every data source ingested.

"query": "content.body: movies comming next summer AND doc_date:[2021-10-01 TO 2022-01-31]"

🚧

Historical Search Limitations

The farther back you perform a historical search and the more complex the query, the slower your query will perform. If your query takes over 10 minutes, the query will not complete. It is recommended to apply a sort other than default relevance.

Highlighting

Search API supports the highlighting of terms within responses matching the query.

To enable highlighting, simply indicate using the code snippet below:

"highlight": true

Within results, a new field content.body.highlight will contain the content with tags appears before and after the content pieces matching the query.