Elasticsearch Series
Content Link Elasticsearch Basic Operations This Article Elasticsearch Query Operations https://blog.yexca.net/archives/227 RestClient Basic Operations https://blog.yexca.net/archives/228 RestClient Query Operations https://blog.yexca.net/archives/229 Elasticsearch Data Aggregation https://blog.yexca.net/archives/231 Elasticsearch Autocompletion https://blog.yexca.net/archives/232 Elasticsearch Data Sync https://blog.yexca.net/archives/234 Elasticsearch Cluster https://blog.yexca.net/archives/235
Elasticsearch is a super powerful open-source search engine. It helps us quickly find what we need in massive datasets. Combined with Kibana, Logstash, and Beats, it forms the Elastic Stack (ELK). It’s widely used in log data analysis, real-time monitoring, and more.
Elasticsearch is the core of the Elastic Stack, handling data storage, search, and analysis.
Under the hood, Elasticsearch is built on Lucene, a Java search engine library.
Forward Index
Traditional databases (like MySQL) use a forward index. Take this table:
| id | title | price |
|---|---|---|
| 1 | 小米手机 | 3499 |
| 2 | 华为手机 | 4999 |
| 3 | 华为小米充电器 | 49 |
| 4 | 小米手环 | 239 |
For exact queries based on id, an index makes it super fast.
But for fuzzy queries on title, you’re stuck with a row-by-row scan. Here’s how it goes:
- User searches for 手机 (phone), database condition
%手机%. - Fetch data row by row, e.g., the row with
id1. - Check if the
titlein the data matches the condition. - If it matches, keep it; otherwise, discard and move to the next row.
As data grows, row-by-row scanning gets less and less efficient.
Inverted Index
The inverted index concept is contrasted with forward indexes like those in MySQL.
Elasticsearch uses an inverted index. Key concepts:
- Document: Each piece of data is a document.
- Term: Words derived from documents through semantic analysis.
Building an inverted index is a specific way to process a forward index. The steps:
- Tokenize each document’s data into individual terms using an algorithm.
- Create a table where each row includes the term, document ID(s) where it appears, position, etc.
- Since terms are unique, you can index them, perhaps using a hash table structure.
For example, the table above could have an inverted index like this:
| Term | Doc ID |
|---|---|
| 小米 (Xiaomi) | 1, 3, 4 |
| 手机 (phone) | 1, 2 |
| 华为 (Huawei) | 2, 3 |
| 充电器 (charger) | 3 |
| 手环 (band) | 4 |
Inverted index search flow:
- User searches for
小米手机(Xiaomi phone). - Tokenize the search query, getting
小米(Xiaomi),手机(phone). - Use terms to search the inverted index, getting doc IDs containing the terms: 1, 2, 3, 4.
- Use doc IDs to find the actual documents in the forward index.
Document
Elasticsearch is document-oriented. A document can be a product record, an order, or similar data from a database. Document data is serialized into JSON format and stored in Elasticsearch.
The JSON for the forward index table mentioned above would look like this:
| |
A JSON document contains many fields, similar to columns in a database.
Index and Mapping
An index is a collection of documents of the same type.
Mapping defines the field constraints for documents within an index, similar to table structure constraints.
You can think of an index as a database table. Database tables have constraints defining their structure, field names, types, etc. Similarly, an index has a mapping, which describes the field constraints for its documents, much like a table’s schema.
MySQL vs. Elasticsearch
| MySQL | Elasticsearch | Explanation |
|---|---|---|
| Table | Index | An index is a collection of documents, similar to a database table. |
| Row | Document | A document is a single piece of data, like a database row. All documents are in JSON format. |
| Column | Field | A field is a key within a JSON document, similar to a database column. |
| Schema | Mapping | Mapping defines constraints for documents within an index, like field type constraints. It’s similar to a database schema. |
| SQL | DSL | DSL (Domain Specific Language) is Elasticsearch’s JSON-based query language used for CRUD operations. |
In enterprises, these two are often used together:
- For write operations requiring high security, use MySQL.
- For search needs requiring high query performance, use Elasticsearch.
- Data synchronization between them, using some method, ensures consistency.
Pros & Cons
Forward Index:
- Pros:
- Can create indexes on multiple fields.
- Searches and sorting based on indexed fields are very fast.
- Cons:
- Searching by non-indexed fields or partial terms within indexed fields requires a full table scan.
Inverted Index:
- Pros:
- Term-based and fuzzy searches are extremely fast.
- Cons:
- Can only create indexes on terms, not on entire fields directly.
- Cannot sort directly by fields.
Installation
Typically, Elasticsearch alone is sufficient. Kibana provides a visual interface for Elasticsearch, making it easier to learn and write DSL queries.
Elasticsearch
To link Elasticsearch and Kibana containers, first create a network.
| |
There are multiple ways to link containers, such as Docker Compose or direct IP (e.g., 172.17.0.1).
Pull Elasticsearch.
| |
Single-node deployment.
| |
Remember to adjust mapping directories. The above uses Docker volumes. Here’s a partial explanation:
-e "cluster.name=es-docker-cluster": Sets the cluster name.-e "http.host=0.0.0.0": The listening address, allowing external access.-e "ES_JAVA_OPTS=-Xms512m -Xmx512m": Memory allocation.-e "discovery.type=single-node": Single-node mode (not a cluster).-v es-data:/usr/share/elasticsearch/data: Mounts a volume, binding to ES data directory.-v es-logs:/usr/share/elasticsearch/logs: Mounts a volume, binding to ES logs directory.-v es-plugins:/usr/share/elasticsearch/plugins: Mounts a volume, binding to ES plugins directory.--privileged: Grants access rights to the volume.--network es-net: Joins a network namedes-net.
Visit localhost:9200 . If you see output similar to below, it’s successfully started.
| |
Kibana
Pull the same version image.
| |
Run.
| |
Here, -e ELASTICSEARCH_HOSTS=http://es:9200" sets the Elasticsearch address. Since Kibana and Elasticsearch are on the same network, you can access Elasticsearch directly by its container name.
Kibana typically takes a while to start. Wait a bit, and check the logs. If you see the port number, it’s successfully launched.
| |
Visit localhost:5601 to see the result.
IK Analyzer
ES needs to tokenize documents when creating an inverted index, and tokenize user input during searches. However, the default tokenization rules aren’t very friendly for Chinese. For example, test this:
| |
Syntax explanation:
- POST: Request method.
- /_analyze: Request path.
<http://localhost:9200>is omitted here, Kibana fills it in. - Request parameters use JSON.
- analyzer: Analyzer type,
standardby default. - text: Content to be tokenized.
Result:
| |
As you can see, the tokenization isn’t great. For Chinese tokenization, we usually use the IK Analyzer.
IK Analyzer Github: https://github.com/medcl/elasticsearch-analysis-ik
Online Installation
Ensure the installed version matches your ES version.
| |
Offline Installation
To install plugins, you need to know the Elasticsearch plugins directory. The above setup uses a volume mounted locally; you can check its location with this command:
| |
The Mountpoint in the output JSON is the directory.
Unzip the downloaded archive from Github, rename the folder to ik, and place it in the plugins directory.
Restart container.
| |
Test Effect
IK Analyzer has two modes:
ik_smart: Minimal segmentation.ik_max_word: Maximum segmentation.
Using the same example:
| |
Result:
| |
In this example, both tokenization modes yield the same result. You can test with longer sentences to see the difference.
Extend Dictionary
With the internet’s evolution, new words constantly emerge that aren’t in existing vocabulary lists. Thus, these lists need continuous updates. To extend the IK Analyzer dictionary, simply modify the IKAnalyzer.cfg.xml file in the config directory within your IK Analyzer installation.
| |
As shown above, custom words go into ./ext.dic, and stop words into ./stopwords.dic.
Stop words are typically meaningless words like 的 (de), 啊 (a), etc. (common Chinese particles).
After configuration, restart ES.
DSL Index Operations
An index is like a database table. Before storing data in ES, you must first create an “index” (like a database) and its “mapping” (like a table schema).
Mapping Properties
Mapping defines constraints for documents within an index. Common mapping properties include:
type: Field data type. Common simple types include:- String:
text(tokenized text),keyword(exact values, e.g., brand, country, IP address). - Numeric:
long,integer,short,byte,double,float. - Boolean:
boolean. - Date:
date. - Object:
object.
- String:
index: Whether to create an index for the field. Defaults totrue.analyzer: Which analyzer to use.properties: Sub-fields of this field.
Create Index
- Request method:
PUT. - Request path:
/index-name(customizable). - Request parameters:
mappingdefinition.
| |
For example:
| |
If the response is similar to below after running, it’s successful.
| |
Get Index
- Request method:
GET. - Request path:
/index-name. - Request parameters: None.
Format:
| |
For example:
| |
Result:
| |
Update Index
Once an index and its mapping are created, they cannot be modified. However, you can add new fields.
| |
For example:
| |
If you encounter a read-only-allow-delete error, it’s usually because disk space is below 5%. You can resolve it with this request:
| |
Delete Index
- Request method:
DELETE. - Request path:
/index-name. - Request parameters: None.
Format:
| |
For example:
| |
Result:
| |
Index Operations Summary
- Create index:
PUT /index-name - Get index:
GET /index-name - Delete index:
DELETE /index-name - Add field:
PUT /index-name/_mapping
DSL Document Operations
Add Document
| |
Example:
| |
Result:
| |
Get Document
| |
Example:
| |
Result:
| |
Update Document
There are two ways to update a document: full update and partial update.
Full Update
A full update overwrites the original document. Essentially, it:
- Deletes the document with the specified ID.
- Adds a new document with the same ID.
If the ID doesn’t exist, it will still perform step 2, turning an update into an add (upsert).
| |
For example:
| |
Result:
| |
Check the document; the email has been updated.
Partial Update
A partial update modifies only specific fields within the document matching the given ID.
| |
For example:
| |
Result:
| |
Check the document; the email has been updated.
Delete Document
| |
For example:
| |
Result:
| |
Document Operations Summary
- Create document:
POST /{index-name}/_doc/doc-id { json-document } - Get document:
GET /{index-name}/_doc/doc-id - Delete document:
DELETE /{index-name}/_doc/doc-id - Update document:
- Full update:
PUT /{index-name}/_doc/doc-id { json-document } - Partial update:
POST /{index-name}/_update/doc-id { "doc": {field: value}}
- Full update: