Elasticsearch Server: Second Edition
By Rafał Kuć and Rogoziński Marek
()
About this ebook
If you are a web developer or a user who wants to learn more about ElasticSearch, then this is the book for you. You do not need to know anything about ElastiSeach, Java, or Apache Lucene in order to use this book, though basic knowledge about databases and queries is required.
Read more from Rafał Kuć
Mastering Elasticsearch - Second Edition Rating: 0 out of 5 stars0 ratingsSolr Cookbook - Third Edition Rating: 0 out of 5 stars0 ratings
Related to Elasticsearch Server
Related ebooks
ElasticSearch Server Rating: 0 out of 5 stars0 ratingsMastering Apache Cassandra - Second Edition Rating: 0 out of 5 stars0 ratingsLearning ELK Stack Rating: 0 out of 5 stars0 ratingsElasticsearch for Hadoop Rating: 0 out of 5 stars0 ratingsPHP 5 CMS Framework Development - 2nd Edition Rating: 0 out of 5 stars0 ratingsSchematron: A language for validating XML Rating: 0 out of 5 stars0 ratingsElasticsearch Server - Third Edition Rating: 0 out of 5 stars0 ratingsElasticsearch Blueprints Rating: 0 out of 5 stars0 ratingsPostgreSQL Development Essentials Rating: 5 out of 5 stars5/5Learning Hadoop 2 Rating: 4 out of 5 stars4/5Mastering Elasticsearch 5.x - Third Edition Rating: 0 out of 5 stars0 ratingsJava for Data Science Rating: 0 out of 5 stars0 ratingsLearning Apache Cassandra Rating: 0 out of 5 stars0 ratingsStructured Search for Big Data: From Keywords to Key-objects Rating: 0 out of 5 stars0 ratingsEffective Amazon Machine Learning Rating: 0 out of 5 stars0 ratingsSphinx Search Beginner's Guide Rating: 4 out of 5 stars4/5Professional Hadoop Solutions Rating: 4 out of 5 stars4/5Administrating Solr Rating: 0 out of 5 stars0 ratingsApache Cassandra Essentials Rating: 4 out of 5 stars4/5Apache Solr Search Patterns Rating: 0 out of 5 stars0 ratingsMonitoring Elasticsearch Rating: 0 out of 5 stars0 ratingsPython Data Persistence Rating: 0 out of 5 stars0 ratingsMonitoring Hadoop Rating: 0 out of 5 stars0 ratingsCloud Development and Deployment with CloudBees Rating: 0 out of 5 stars0 ratingsNginx Troubleshooting Rating: 0 out of 5 stars0 ratingsAWS Certified Database Study Guide: Specialty (DBS-C01) Exam Rating: 0 out of 5 stars0 ratingsFast Data Processing with Spark 2 - Third Edition Rating: 0 out of 5 stars0 ratingsAn Introduction to Data Base Design Rating: 0 out of 5 stars0 ratingsPractical OneOps Rating: 0 out of 5 stars0 ratingsGetting Started with Big Data Query using Apache Impala Rating: 0 out of 5 stars0 ratings
Programming For You
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5HTML & CSS: Learn the Fundaments in 7 Days Rating: 4 out of 5 stars4/5Coding All-in-One For Dummies Rating: 4 out of 5 stars4/5Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1 Rating: 4 out of 5 stars4/5PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project Rating: 5 out of 5 stars5/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5SQL All-in-One For Dummies Rating: 3 out of 5 stars3/5Java for Beginners: A Crash Course to Learn Java Programming in 1 Week Rating: 5 out of 5 stars5/5Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS Rating: 0 out of 5 stars0 ratingsPython Projects for Beginners: A Ten-Week Bootcamp Approach to Python Programming Rating: 0 out of 5 stars0 ratingsThe Unofficial Guide to Open Broadcaster Software: OBS: The World's Most Popular Free Live-Streaming Application Rating: 0 out of 5 stars0 ratingsPokemon Go: Guide + 20 Tips and Tricks You Must Read Hints, Tricks, Tips, Secrets, Android, iOS Rating: 5 out of 5 stars5/5Teach Yourself C++ Rating: 4 out of 5 stars4/5SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days Rating: 5 out of 5 stars5/5The Little SAS Book: A Primer, Sixth Edition Rating: 5 out of 5 stars5/5Python: For Beginners A Crash Course Guide To Learn Python in 1 Week Rating: 4 out of 5 stars4/5Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5101 Amazing Nintendo NES Facts: Includes facts about the Famicom Rating: 4 out of 5 stars4/5
Reviews for Elasticsearch Server
0 ratings0 reviews
Book preview
Elasticsearch Server - Rafał Kuć
Table of Contents
Elasticsearch Server Second Edition
Credits
About the Author
Acknowledgments
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Getting Started with the Elasticsearch Cluster
Full-text searching
The Lucene glossary and architecture
Input data analysis
Indexing and querying
Scoring and query relevance
The basics of Elasticsearch
Key concepts of data architecture
Index
Document
Document type
Mapping
Key concepts of Elasticsearch
Node and cluster
Shard
Replica
Gateway
Indexing and searching
Installing and configuring your cluster
Installing Java
Installing Elasticsearch
Installing Elasticsearch from binary packages on Linux
Installing Elasticsearch using the RPM package
Installing Elasticsearch using the DEB package
The directory layout
Configuring Elasticsearch
Running Elasticsearch
Shutting down Elasticsearch
Running Elasticsearch as a system service
Elasticsearch as a system service on Linux
Elasticsearch as a system service on Windows
Manipulating data with the REST API
Understanding the Elasticsearch RESTful API
Storing data in Elasticsearch
Creating a new document
Automatic identifier creation
Retrieving documents
Updating documents
Deleting documents
Versioning
An example of versioning
Using the version provided by an external system
Searching with the URI request query
Sample data
The URI request
The Elasticsearch query response
Query analysis
URI query string parameters
The query
The default search field
Analyzer
The default operator
Query explanation
The fields returned
Sorting the results
The search timeout
The results window
The search type
Lowercasing the expanded terms
Analyzing the wildcard and prefixes
The Lucene query syntax
Summary
2. Indexing Your Data
Elasticsearch indexing
Shards and replicas
Creating indices
Altering automatic index creation
Settings for a newly created index
Mappings configuration
Type determining mechanism
Disabling field type guessing
Index structure mapping
Type definition
Fields
Core types
Common attributes
String
Number
Boolean
Binary
Date
Multifields
The IP address type
The token_count type
Using analyzers
Out-of-the-box analyzers
Defining your own analyzers
Analyzer fields
Default analyzers
Different similarity models
Setting per-field similarity
Available similarity models
Configuring DFR similarity
Configuring IB similarity
The postings format
Configuring the postings format
Doc values
Configuring the doc values
Doc values formats
Batch indexing to speed up your indexing process
Preparing data for bulk indexing
Indexing the data
Even quicker bulk requests
Extending your index structure with additional internal information
Identifier fields
The _type field
The _all field
The _source field
Exclusion and inclusion
The _index field
The _size field
The _timestamp field
The _ttl field
Introduction to segment merging
Segment merging
The need for segment merging
The merge policy
The merge scheduler
The merge factor
Throttling
Introduction to routing
Default indexing
Default searching
Routing
The routing parameters
Routing fields
Summary
3. Searching Your Data
Querying Elasticsearch
The example data
A simple query
Paging and result size
Returning the version value
Limiting the score
Choosing the fields that we want to return
The partial fields
Using the script fields
Passing parameters to the script fields
Understanding the querying process
Query logic
Search types
Search execution preferences
The Search shards API
Basic queries
The term query
The terms query
The match_all query
The common terms query
The match query
The Boolean match query
The match_phrase query
The match_phrase_prefix query
The multi_match query
The query_string query
Running the query_string query against multiple fields
The simple_query_string query
The identifiers query
The prefix query
The fuzzy_like_this query
The fuzzy_like_this_field query
The fuzzy query
The wildcard query
The more_like_this query
The more_like_this_field query
The range query
The dismax query
The regular expression query
Compound queries
The bool query
The boosting query
The constant_score query
The indices query
Filtering your results
Using filters
Filter types
The range filter
The exists filter
The missing filter
The script filter
The type filter
The limit filter
The identifiers filter
If this is not enough
Combining filters
A word about the bool filter
Named filters
Caching filters
Highlighting
Getting started with highlighting
Field configuration
Under the hood
Configuring HTML tags
Controlling the highlighted fragments
Global and local settings
Require matching
The postings highlighter
Validating your queries
Using the validate API
Sorting data
Default sorting
Selecting fields used for sorting
Specifying the behavior for missing fields
Dynamic criteria
Collation and national characters
Query rewrite
An example of the rewrite process
Query rewrite properties
Summary
4. Extending Your Index Structure
Indexing tree-like structures
Data structure
Analysis
Indexing data that is not flat
Data
Objects
Arrays
Mappings
Final mappings
Sending the mappings to Elasticsearch
To be or not to be dynamic
Using nested objects
Scoring and nested queries
Using the parent-child relationship
Index structure and data indexing
Parent mappings
Child mappings
The parent document
The child documents
Querying
Querying data in the child documents
The top children query
Querying data in the parent documents
The parent-child relationship and filtering
Performance considerations
Modifying your index structure with the update API
The mappings
Adding a new field
Modifying fields
Summary
5. Make Your Search Better
An introduction to Apache Lucene scoring
When a document is matched
Default scoring formula
Relevancy matters
Scripting capabilities of Elasticsearch
Objects available during script execution
MVEL
Using other languages
Using our own script library
Using native code
The factory implementation
Implementing the native script
Installing scripts
Running the script
Searching content in different languages
Handling languages differently
Handling multiple languages
Detecting the language of the documents
Sample document
The mappings
Querying
Queries with the identified language
Queries with unknown languages
Combining queries
Influencing scores with query boosts
The boost
Adding boost to queries
Modifying the score
The constant_score query
The boosting query
The function_score query
The structure of the function query
Deprecated queries
Replacing the custom_boost_factor query
Replacing the custom_score query
Replacing the custom_filters_score query
When does index-time boosting make sense?
Defining field boosting in input data
Defining boosting in mapping
Words with the same meaning
The synonym filter
Synonyms in the mappings
Synonyms stored in the filesystem
Defining synonym rules
Using Apache Solr synonyms
Explicit synonyms
Equivalent synonyms
Expanding synonyms
Using WordNet synonyms
Query- or index-time synonym expansion
Understanding the explain information
Understanding field analysis
Explaining the query
Summary
6. Beyond Full-text Searching
Aggregations
General query structure
Available aggregations
Metric aggregations
Min, max, sum, and avg aggregations
Using scripts
The value_count aggregation
The stats and extended_stats aggregations
Bucketing
The terms aggregation
The range aggregation
The date_range aggregation
IPv4 range aggregation
The missing aggregation
Nested aggregation
The histogram aggregation
The date_histogram aggregation
Time zones
The geo_distance aggregation
The geohash_grid aggregation
Nesting aggregations
Bucket ordering and nested aggregations
Global and subsets
Inclusions and exclusions
Faceting
The document structure
Returned results
Using queries for faceting calculations
Using filters for faceting calculations
Terms faceting
Ranges based faceting
Choosing different fields for an aggregated data calculation
Numerical and date histogram faceting
The date_histogram facet
Computing numerical field statistical data
Computing statistical data for terms
Geographical faceting
Filtering faceting results
Memory considerations
Using suggesters
Available suggester types
Including suggestions
The suggester response
The term suggester
The term suggester configuration options
Additional term suggester options
The phrase suggester
Configuration
The completion suggester
Indexing data
Querying the indexed completion suggester data
Custom weights
Percolator
The index
Percolator preparation
Getting deeper
Getting the number of matching queries
Indexed documents percolation
Handling files
Adding additional information about the file
Geo
Mappings preparation for spatial search
Example data
Sample queries
Distance-based sorting
Bounding box filtering
Limiting the distance
Arbitrary geo shapes
Point
Envelope
Polygon
Multipolygon
An example usage
Storing shapes in the index
The scroll API
Problem definition
Scrolling to the rescue
The terms filter
Terms lookup
The terms lookup query structure
Terms lookup cache settings
Summary
7. Elasticsearch Cluster in Detail
Node discovery
Discovery types
The master node
Configuring the master and data nodes
The master-election configuration
Setting the cluster name
Configuring multicast
Configuring unicast
Ping settings for nodes
The gateway and recovery modules
The gateway
Recovery control
Additional gateway recovery options
Preparing Elasticsearch cluster for high query and indexing throughput
The filter cache
The field data cache and circuit breaker
The circuit breaker
The store
Index buffers and the refresh rate
The index refresh rate
The thread pool configuration
Combining it all together – some general advice
Choosing the right store
The index refresh rate
Tuning the thread pools
Tuning your merge process
The field data cache and breaking the circuit
RAM buffer for indexing
Tuning transaction logging
Things to keep in mind
Templates and dynamic templates
Templates
An example of a template
Storing templates in files
Dynamic templates
The matching pattern
Field definitions
Summary
8. Administrating Your Cluster
The Elasticsearch time machine
Creating a snapshot repository
Creating snapshots
Additional parameters
Restoring a snapshot
Cleaning up – deleting old snapshots
Monitoring your cluster's state and health
The cluster health API
Controlling information details
Additional parameters
The indices stats API
Docs
Store
Indexing, get, and search
Additional information
The status API
The nodes info API
The nodes stats API
The cluster state API
The pending tasks API
The indices segments API
The cat API
Limiting returned information
Controlling cluster rebalancing
Rebalancing
Cluster being ready
The cluster rebalance settings
Controlling when rebalancing will start
Controlling the number of shards being moved between nodes concurrently
Controlling the number of shards initialized concurrently on a single node
Controlling the number of primary shards initialized concurrently on a single node
Controlling types of shards allocation
Controlling the number of concurrent streams on a single node
Controlling the shard and replica allocation
Explicitly controlling allocation
Specifying node parameters
Configuration
Index creation
Excluding nodes from allocation
Requiring node attributes
Using IP addresses for shard allocation
Disk-based shard allocation
Enabling disk-based shard allocation
Configuring disk-based shard allocation
Cluster wide allocation
Number of shards and replicas per node
Moving shards and replicas manually
Moving shards
Canceling shard allocation
Forcing shard allocation
Multiple commands per HTTP request
Warming up
Defining a new warming query
Retrieving the defined warming queries
Deleting a warming query
Disabling the warming up functionality
Choosing queries
Index aliasing and using it to simplify your everyday work
An alias
Creating an alias
Modifying aliases
Combining commands
Retrieving all aliases
Removing aliases
Filtering aliases
Aliases and routing
Elasticsearch plugins
The basics
Installing plugins
Removing plugins
The update settings API
Summary
Index
Elasticsearch Server Second Edition
Elasticsearch Server Second Edition
Copyright © 2014 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: February 2013
Second edition: April 2014
Production Reference: 1170414
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78398-052-9
www.packtpub.com
Cover Image by Kannan PM Palanisamy (<kannan.pmp@gmail.com>)
Credits
Authors
Rafał Kuć
Marek Rogoziński
Reviewers
John Boere
Jettro Coenradie
Clive Holloway
Surendra Mohan
Alberto Paro
Lukáš Vlček
Commissioning Editor
Anthony Alburqueque
Acquisition Editor
Neha Nagwekar
Content Development Editor
Shaon Basu
Technical Editors
Indrajit Das
Menza Mathew
Shali Sasidharan
Copy Editors
Dipti Kapadia
Insiya Morbiwala
Aditya Nair
Adithi Shetty
Project Coordinator
Amey Sawant
Proofreaders
Simran Bhogal
Maria Gould
Bernadette Watkins
Indexer
Priya Subramani
Graphics
Abhinash Sahu
Production Coordinator
Sushma Redkar
Cover Work
Sushma Redkar
About the Author
Rafał Kuć is a born team leader and software developer. He currently works as a consultant and a software engineer at Sematext Group, Inc., where he concentrates on open source technologies such as Apache Lucene and Solr, Elasticsearch, and Hadoop stack. He has more than 12 years of experience in various branches of software, from banking software to e-commerce products. He focuses mainly on Java but is open to every tool and programming language that will make the achievement of his goal easier and faster. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people with the problems they face with Solr and Lucene. Also, he has been a speaker at various conferences around the world, such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, and Lucene Revolution.
Rafał began his journey with Lucene in 2002, and it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then, Solr came along and this was it. He started working with Elasticsearch in the middle of 2010. Currently, Lucene, Solr, Elasticsearch, and information retrieval are his main points of interest.
Rafał is also the author of Apache Solr 3.1 Cookbook, and the update to it, Apache Solr 4 Cookbook. Also, he is the author of the previous edition of this book and Mastering ElasticSearch. All these books have been published by Packt Publishing.
Acknowledgments
The book you are holding in your hands is an update to ElasticSearch Server, published at the beginning of 2013. Since that time, Elasticsearch has changed a lot; there are numerous improvements and massive additions in terms of functionalities, both when it comes to cluster handling and searching. After completing Mastering ElasticSearch, which covered Version 0.90 of this great search server, we decided that Version 1.0 would be a perfect time to release the updated version of our first book about Elasticsearch. Again, just like with the original book, we were not able to cover all the topics in detail. We had to choose what to describe in detail, what to mention, and what to omit in order to have a book not more than 1,000 pages long. Nevertheless, I hope that by reading this book, you'll easily learn about Elasticsearch and the underlying Apache Lucene, and that you will get the desired knowledge easily and quickly.
I would like to thank my family for the support and patience during all those days and evenings when I was sitting in front of a screen instead of being with them.
I would also like to thank all the people I'm working with at Sematext, especially Otis, who took out his time and convinced me that Sematext is the right company for me.
Finally, I would like to thank all the people involved in creating, developing, and maintaining Elasticsearch and Lucene projects for their work and passion. Without them, this book wouldn't have been written and open source search would be less powerful.
Once again, thank you all!
About the Author
Marek Rogoziński is a software architect and consultant with more than 10 years of experience. He has specialized in solutions based on open source search engines such as Solr and Elasticsearch, and also the software stack for Big Data analytics including Hadoop, HBase, and Twitter Storm.
He is also the cofounder of the solr.pl site, which publishes information and tutorials about Solr and the Lucene library. He is also the co-author of some books published by Packt Publishing.
Currently, he holds the position of the Chief Technology Officer in a new company, designing architecture for a set of products that collect, process, and analyze large streams of input data.
Acknowledgments
This is our third book on Elasticsearch and the second edition of the first book, which was published a little over a year ago. This is quite a short period but this is also the year when Elasticsearch changed. Not more than a year ago, we used Version 0.20; now, Version 1.0.1 has been released. This is not only a number. Elasticsearch is now a well-known, widely used piece of software with built-in commercial support and ecosystem—just look at Logstash, Kibana, or any additional plugins. The functionality of this search server is also constantly growing. There are some new features such as the aggregation framework, which opens new use cases—this is where Elasticsearch shines. This development caused the previous book to get outdated quickly. It was also a great challenge to keep up with these changes. The differences between the beta release candidates and the final version caused us to introduce changes several times during the writing.
Now, it is time to say thank you.
Thanks to all the people involved in creating Elasticsearch, Lucene, and all of the libraries and modules published around these projects or used by these projects.
I would also like to thank the team working on this book. First of all, a thank you to the people who worked on the extermination of all my errors, typos, and ambiguities. Many thanks to all the people who send us remarks or write constructive reviews. I was surprised and encouraged by the fact that someone found our work useful.
Last but not least, thanks to all my friends who withstood me and understood my constant lack of time.
About the Reviewers
John Boere is an engineer with 22 years of experience in geospatial database design and development and 13 years of web development experience. He is the founder of two successful startups and has consulted at many others. He is the founder and CEO of Cliffhanger Solutions Inc., a company that offers a geospatial search engine for the companies that need mapping solutions.
John lives in Arizona with his family and enjoys the outdoors—hiking and biking. He can also solve a Rubik's cube.
Jettro Coenradie likes to try out new stuff. That is why he got his motorcycle driver's license recently. On a motorbike, you tend to explore different routes to get the best experience out of your bike and have fun while doing the things you need to do, such as going from A to B. In the past 15 years, while exploring new technologies, he has tried out new routes to find better and more interesting ways to accomplish his goal. Jettro rides an all-terrain bike; he does not like riding on the same ground over and over again. The same is true for his technical interests; he knows about backend (Elasticsearch, MongoDB, Axon Framework, Spring Data, and Spring Integration), as well as frontend (AngularJS, Sass, and Less), and mobile development (iOS and Sencha Touch).
Clive Holloway is a web application developer based in New York City. Over the past 18 years, he has worked on a variety of backend and frontend projects, focusing mainly on Perl and JavaScript.
He lives with his partner, Christine, and his cat, Blueberry (who would have been called Blackberry except for the intervention of his daughter, Abbey, after she pointed out that they could not name a cat after a phone).
In his spare time, he is involved as a part of Thisoneisonus, an international collective of music fans who work together to produce fan-created live show recordings. You can learn more about him at http://toiou.org.
Surendra Mohan, who has served a few top-notch software organizations invaried roles, is currently a freelance software consultant. He has been working on various cutting-edge technologies such as Drupal, Moodle, Apache Solr, and Elasticsearch for more than 9 years. He also delivers technical talks at various community events such as Drupal Meetups and Drupal Camps. To know more about him, his write-ups, technical blogs, and many more, log on to http://www.surendramohan.info/.
He has also authored the titles, Administrating Solr and Apache Solr High Performance, published by Packt Publishing, and there are many more in the pipeline to be published soon. He also contributes technical articles to a number of portals, for instance, sitepoint.com.
Additionally, he has reviewed other technical books, such as Drupal 7 Multi Sites Configuration and Drupal Search Engine Optimization, both by Packt Publishing. He has also reviewed titles on Drupal commerce, Elasticsearch, Drupal-related video tutorials, a title on OpsView, and many more.
I would like to thank my family and friends who supported and encouraged me to complete this book on time with good quality.
Alberto Paro is an engineer, project manager, and software developer. He currently works as a chief technology officer at The Net Planet Europe and as a freelance consultant on software engineering on Big Data and NoSQL Solutions. He loves studying the emerging solutions and applications mainly related to Big Data processing, NoSQL, natural language processing, and neural networks. He started programming in BASIC on a Sinclair Spectrum when he was 8 years old, and in his life, he has gained a lot of experience by using different operative systems, applications, and by doing programming.
In 2000, he graduated from a degree in Computer Science Engineering from Politecnico di Milano with a thesis on designing multiuser and multidevice web applications. He worked as a professor's helper at the university for about one year. Then, having come in contact with The Net Planet company and loving their innovative ideas, he started working on knowledge management solutions and advanced data-mining products.
In his spare time, when he is not playing with his children, he likes working on open source projects. When he was in high school, he started contributing to projects related to the Gnome environment (gtkmm). One of his preferred programming languages was Python, and he wrote one of the first NoSQL backend for Django MongoDB (django-mongodb-engine). In 2010, he started using Elasticsearch to provide search capabilities for some Django e-commerce sites and developed PyES (a pythonic client for Elasticsearch) and the initial part of Elasticsearch MongoDB River. Now, he mainly works on Scala, using the Typesafe Stack and Apache Spark project.
He is the author of ElasticSearch Cookbook, Packt Publishing, published in December 2013.
I would like to thank my wife and children for their support.
Lukáš Vlček is a professional open source fan. He has been working with Elasticsearch nearly from the day it was released and enjoys it till today. Currently, Lukáš works for Red Hat, where he uses Elasticsearch hand-in-hand with various JBoss Java technologies on a daily basis. He has been speaking on Elasticsearch and his work at several conferences around Europe. He is also heavy on client-side JavaScript and building frontends for full-text search services.
www.PacktPub.com
Support files, eBooks, discount offers, and more
You might want to visit www.PacktPub.com for support files and downloads related to your book.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
http://PacktLib.PacktPub.com
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print and bookmark content
On demand and accessible via web browser
Free access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.
Preface
Welcome to Elasticsearch Server Second Edition. In the second edition of the book, we decided not only to do the update to match the latest version of Elasticsearch but also to add some additional important sections that we didn't think of while writing the first book. While reading this book, you will be taken on a journey through a wonderful world of full-text search provided by the Elasticsearch server. We will start with a general introduction to Elasticsearch, which covers how to start and run Elasticsearch, what are the basic concepts of Elasticsearch, and how to index and search your data in the most basic way.
This book will also discuss the query language, so-called Querydsl, that allows you to create complicated queries and filter the returned results. In addition to all this, you'll see how you can use faceting to calculate aggregated data based on the results returned by your queries, and how to use the newly introduced aggregation framework (the analytics engine allows you to give meaning to your data). We will implement autocomplete functionality together and learn how to use Elasticsearch spatial capabilities and prospective search.
Finally, this book will show you Elasticsearch administration API capabilities with features such as shard placement control and cluster handling.
What this book covers
Chapter 1, Getting Started with the Elasticsearch Cluster, covers what full-text searching, Apache Lucene, and text analysis are, how to run and configure Elasticsearch, and finally, how to index and search your data in the most basic way.
Chapter 2, Indexing Your Data, shows how indexing works, how to prepare an index structure and what data types we are allowed to use, how to speed up indexing, what segments are, how merging works, and what routing is.
Chapter 3, Searching Your Data, introduces the full-text search capabilities of Elasticsearch by discussing how to query, how the querying process works, and what type of basic and compound queries are available. In addition to this, we will learn how to filter our results, use highlighting, and modify the sorting of returned results.
Chapter 4, Extending Your Index Structure, discusses how to index more complex data structures. We will learn how to index tree-like data types, index data with relationships between documents, and modify the structure of an index.
Chapter 5, Make Your Search Better, covers Apache Lucene scoring and how to influence it in Elasticsearch, the scripting capabilities of Elasticsearch, and language analysis.
Chapter 6, Beyond Full-text Searching, shows the details of the aggregation framework functionality, faceting, and how to implement spellchecking and autocomplete using Elasticsearch. In addition to this, readers will learn how to index binary files, work with geospatial data, and efficiently process large datasets.
Chapter 7, Elasticsearch Cluster in Detail, discusses the nodes discovery mechanism, recovery and gateway Elasticsearch modules, templates and cluster preparation for high indexing, and querying use cases.
Chapter 8, Administrating Your Cluster, covers the Elasticsearch backup functionality, cluster monitoring, rebalancing, and moving shards. In addition to this, you will learn how to use the warm-up functionality, work with aliases, install plugins, and update cluster settings with the update API.
What you need for this book
This book was written using Elasticsearch server Version 1.0.0, and all the examples and functions should work with it. In addition to this, you'll need a command that allows you to send HTTP requests such as cURL, which is available for most operating systems. Please note that all the examples in this book use the mentioned cURL tool. If you want to use another tool, please remember to format the request in an appropriate way that can be understood by the tool of your choice.
In addition to this, some chapters may require additional software such as Elasticsearch plugins, but it has been explicitly mentioned when certain types of software are needed.
Who this book is for
If you are a beginner to the world of full-text search and Elasticsearch, this book is for you. You will be guided through the basics of Elasticsearch, and you will learn how to use some of the advanced functionalities.
If you know Elasticsearch and have worked with it, you may find this book interesting as it provides a nice overview of all the functionalities with examples and description.
If you know the Apache Solr search engine, this book can also be used to compare some functionalities of Apache Solr and Elasticsearch. This may give you the knowledge about the tool, which is more appropriate for your use.
Conventions
In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows:
The postings format is a per-field property, just like type or name.
A block of code is set as follows:
{
status
: 200,
name
: es_server
,
version
: {
number
: 1.0.0
,
build_hash
: a46900e9c72c0a623d71b54016357d5f94c8ea32
,
build_timestamp
: 2014-02-12T16:18:34Z
,
build_snapshot
: false,
lucene_version
: 4.6
},
tagline
: You Know, for Search
}
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
{
mappings
: {
post
: {
properties
: {
id
: { type
: long
, store
: yes
, precision_step
: 0
},
name
: { type
: string
, store
: yes
, index
: analyzed
,
similarity
: BM25
},
contents
: { type
: string
, store
: no
, index
: analyzed
,
similarity
: BM25
}
}
}
}
}
Any command-line input or output is written as follows:
curl -XGET http://localhost:9200/blog/article/1
Note
Warnings or important notes appear in a box like this.
Tip
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.
To send us general feedback, simply send an e-mail to <feedback@packtpub.com>, and mention the book title via the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.
Piracy
Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <copyright@packtpub.com> with a link to the suspected pirated material.
We appreciate your help in protecting our authors, and our ability to bring you valuable content.
Questions
You can contact us at <questions@packtpub.com> if you are having a problem with any aspect of the book, and we will do our best to address it.
Chapter 1. Getting Started with the Elasticsearch Cluster
Welcome to the wonderful world of Elasticsearch—a great full text search and analytics engine. It doesn't matter if you are new to Elasticsearch and full text search in general or if you have experience. We hope that by reading this book you'll be able to learn and extend your knowledge of Elasticsearch. As this book is also dedicated to beginners, we decided to start with a short introduction to full text search in general and after that, a brief overview of Elasticsearch.
The first thing we need to do with Elasticsearch is install it. With many applications, you start with the installation and configuration and usually forget the importance of those steps. We will try to guide you through these steps so that it becomes easier to remember. In addition to this, we will show you the simplest way to index and retrieve data without getting into too many details. By the end of this chapter, you will have learned the following topics:
Full-text searching
Understanding Apache Lucene
Performing text analysis
Learning the basic concepts of Elasticsearch
Installing and configuring Elasticsearch
Using the Elasticsearch REST API to manipulate data
Searching using basic URI requests
Full-text searching
Back in the days when full-text searching was a term known to a small percentage of engineers, most of us used SQL databases to perform search operations. Of course, it is ok, at least to some extent. However, as you go deeper and deeper, you start to see the limits of such an approach. Just to mention some of them—lack of scalability, not enough flexibility, and lack of language analysis (of course there were additions that introduced full-text searching to SQL databases). These were the reasons why Apache Lucene (http://lucene.apache.org) was created—to provide a library of full text search capabilities. It is very fast, scalable, and provides analysis capabilities for different languages.
The Lucene glossary and architecture
Before going into the details of the analysis process, we would like to introduce you to the glossary for Apache Lucene and the overall architecture of Apache Lucene. The basic concepts of the mentioned library are as follows:
Document: This is a main data carrier used during indexing and searching, comprising one or more fields that contain the data we put in and get from Lucene.
Field: This is a section of the document which is built of two parts; the name and the value.
Term: This is a unit of search representing a word from the text.
Token: This is an occurrence of a term in the text of the field. It consists of the term text, start and end offsets, and a type.
Apache Lucene writes all the information to the structure called inverted index. It is a data structure that maps the terms in the index to the documents and not the other way around as the relational database does in its tables. You can think of an inverted index as a data structure where data is term-oriented rather than document-oriented. Let's see how a simple inverted index will look. For example, let's assume that we have the documents with only the title field to be indexed and they look as follows:
Elasticsearch Server 1.0 (document 1)
Mastering Elasticsearch (document 2)
Apache Solr 4 Cookbook (document 3)
So, the index (in a very simplified way) can be visualized as follows: