Elasticsearch 部分匹配

时间：2017-12-16 20:59:55

参考：

部分匹配

环境:

系统版本：Win10 64位
Elasticsearch版本:6.0.0

部分匹配#

前缀匹配
实时搜索，根据输入信息实时搜索
长组合词

构造数据#

指定字段数据类型以及分析器

PUT /my_index/
{
"settings": {
    "number_of_shards": 1,  
    "analysis": {
            "filter": {
                "autocomplete_filter": {
                    "type":     "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 20
               }
             },
            "analyzer": {
                "autocomplete": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter" 
                    ]
                }
            }
        }
    },
    "mappings": {
        "address": {
            "properties": {
                "postcode": {
                    "type":  "keyword"
                },
                "name": {
                    "type":     "text",
                    "analyzer": "autocomplete"
                }
            }
        }
    }
}

注：autocomplete 分析器会把单词按如下形式分割 quick 会分解成 q qu qui quic quick。

插入数据

POST /my_index/address/_bulk
{ "index": { "_id": 1 }}
{ "postcode": "W1V 3DG","title": "Sue ate the alligator", "name": "Brown foxes"}
{ "index": { "_id": 2 }}
{ "postcode": "W2F 8HW" ,"title": "The alligator ate Sue", "name":"Kobe Briant"  }
{ "index": { "_id": 3 }}
{ "postcode": "WC1N 1LZ","title" : "Sue never goes anywhere without her alligator skin purse", "name": "Yellow furballs"}
{ "index": { "_id": 3 }}
{ "postcode": "SW5 0BE","title":"The alligator eat Sue", "name":"Sun Feilong" }

1	数据含义：这些数据是美国邮编，以 `W1V 3DG` 为例：

W1V: 这是邮编的外部，它定义了邮件的区域和行政区：

W 代表区域（ 1 或 2 个字母）
1V 代表行政区（ 1 或 2 个数字，可能跟着一个字符）

3DG: 内部定义了街道或建筑：

3 代表街区区块（ 1 个数字）
DG 代表单元（ 2 个字母）

查询示例#

前缀查询 prefix。根据前缀进行匹配。

GET /my_index/address/_search
{
    "query": {
        "prefix": {
            "postcode": "W"
        }
    }
}

模糊查询 wildcard。下面查询匹配以W开头，后面一个任意字符，后面是F，再后面是任意个任意字符后面跟着 HW 的数据。

GET /my_index/address/_search
{
    "query": {
        "wildcard": {
            "postcode": "W?F*HW" 
        }
    }
}

正则查询 regexp。下面查询匹配以W开头，后面跟一个数字，后面至少有一个任意字符的数据， W1 不会匹配。

GET /my_index/address/_search
{
    "query": {
        "regexp": {
            "postcode": "W[0-9].+" 
        }
    }
}

短语前缀匹配 match_phrase_prefix。下面查询会匹配 title 包含 the alligator at* 的内容。

GET /my_index/address/_search
{
  "query": {
    "match_phrase_prefix": {
      "title": "The alligator at"
    }
  }
}

短语前缀匹配 match_phrase_prefix 指定 slop，。只有最后一个词才能当前缀使用。

GET /my_index/address/_search
{
  "query": {
    "match_phrase_prefix": {
      "title": {
        "query": "The alligator at",
        "slop" :10
      }
    }
  }
}

查询结果：

"hits": [
  {
    "_index": "my_index",
    "_type": "address",
    "_id": "2",
    "_score": 0.8630463,
    "_source": {
      "postcode": "W2F 8HW",
      "title": "The alligator ate Sue"
    }
  },
  {
    "_index": "my_index",
    "_type": "address",
    "_id": "1",
    "_score": 0.3273624,
    "_source": {
      "postcode": "W1V 3DG",
      "title": "Sue ate the alligator"
    }
  }
]

短语前缀匹配 match_phrase_prefix 指定 slop，同时限制前缀查询查询数据量。

GET /my_index/address/_search
{
  "query": {
    "match_phrase_prefix": {
      "title": {
        "query": "The alligator at",
        "max_expansions": 50,
        "slop" :10
      }
    }
  }
}

n-grams 边界查询。对单词进行特殊的切分，如 quick 会分解成 q qu qui quic quick。

GET /my_index/address/_search
{
    "query": {
        "match": {
            "name": "brown fo"
        }
    }
}

查询结果：

"hits": [
  {
    "_index": "my_index",
    "_type": "address",
    "_id": "1",
    "_score": 2.0325346,
    "_source": {
      "postcode": "W1V 3DG",
      "title": "Sue ate the alligator",
      "name": "Brown foxes"
    }
  },
  {
    "_index": "my_index",
    "_type": "address",
    "_id": "2",
    "_score": 1.2379456,
    "_source": {
      "postcode": "W2F 8HW",
      "title": "The alligator ate Sue",
      "name": "Kobe Briant"
    }
  },
  {
    "_index": "my_index",
    "_type": "address",
    "_id": "3",
    "_score": 0.5361201,
    "_source": {
      "postcode": "SW5 0BE",
      "title": "The alligator eat Sue",
      "name": "Sun Feilong"
    }
  }
]

匹配结果和预期的不一样，是因为我们查询参数也是用了 autocomplete 分析器，使用下面的语句进行验证： json GET /my_index/address/_validate/query?explain { "query": { "match": { "name": "brown fo" } } }

验证结果：

"explanations": [
    {
      "index": "my_index",
      "valid": true,
      "explanation": "+(Synonym(name:b name:br name:bro name:brow name:brown) Synonym(name:f name:fo)) #*:*"
    }
  ]

为了查找到想要的结果，我们可以在查询时指定其他分析器或者再定义索引的时候，配置相应的查询解析器

方案一：查询时，指定查询参数的分析器

GET /my_index/address/_search
{
    "query": {
        "match": {
            "name": {
              "query": "brown fo",
              "analyzer": "standard" 
            }
        }
    }
}

方案二（推荐）：创建索引时指定对应的查询分析器。

PUT /my_index/address/_mapping
{
    "address": {
        "properties": {
            "name": {
                "type":            "string",
                "index_analyzer":  "autocomplete", 
                "search_analyzer": "standard" 
            }
        }
    }
}

解决长组合词问题，长组合词时多个单词包含连接起来组成的一个单词，例如 MyNameIsSunFeiLong 在分词的时候，会被分割成一个完整的单词，如果查询 name 或其包含的任意一个单词，并不会与之匹配，我们可以通过分词器把长词组合分割成一段段的单词，如 myn yna nam ame ... ，我们查询的词也被分割成 nam ame，这样就能和结果匹配，为了达到这种效果，我们需要自定义的分析器。

定义分析器：

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "trigrams_filter": {
                    "type":     "ngram",
                    "min_gram": 3,
                    "max_gram": 3
                }
            },
            "analyzer": {
                "trigrams": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter":   [
                        "lowercase",
                        "trigrams_filter"
                    ]
                }
            }
        }
    },
    "mappings": {
        "my_type": {
            "properties": {
                "text": {
                    "type":     "text",
                    "analyzer": "trigrams" 
                }
            }
        }
    }
}

插入数据

POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "text": "Aussprachewörterbuch" }
{ "index": { "_id": 2 }}
{ "text": "Militärgeschichte" }
{ "index": { "_id": 3 }}
{ "text": "Weißkopfseeadler" }
{ "index": { "_id": 4 }}
{ "text": "Weltgesundheitsorganisation" }
{ "index": { "_id": 5 }}
{ "text": "Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz" }

测试查询：

GET /my_index/my_type/_search
{
    "query": {
        "match": {
            "text": "Adler"
        }
    }
}

查询结果：

"hits": [
  {
    "_index": "my_index",
    "_type": "my_type",
    "_id": "3",
    "_score": 0.56437,
    "_source": {
      "text": "Weißkopfseeadler"
    }
  }
]