Ebook1,234 pages12 hours

Solr in Action

Name: Solr in Action
Brand: Manning
Rating: 3.0 (1 reviews)

By Timothy Potter and Trey Grainger

Rating: 3 out of 5 stars

3/5

()

Read preview

About this ebook

Summary

Solr in Action is a comprehensive guide to implementing scalable search using Apache Solr. This clearly written book walks you through well-documented examples ranging from basic keyword searching to scaling a system for billions of documents and queries. It will give you a deep understanding of how to implement core Solr capabilities.

About the Book

Whether you're handling big (or small) data, managing documents, or building a website, it is important to be able to quickly search through your content and discover meaning in it. Apache Solr is your tool: a ready-to-deploy, Lucene-based, open source, full-text search engine. Solr can scale across many servers to enable real-time queries and data analytics across billions of documents.

Solr in Action teaches you to implement scalable search using Apache Solr. This easy-to-read guide balances conceptual discussions with practical examples to show you how to implement all of Solr's core capabilities. You'll master topics like text analysis, faceted search, hit highlighting, result grouping, query suggestions, multilingual search, advanced geospatial and data operations, and relevancy tuning.

This book assumes basic knowledge of Java and standard database technology. No prior knowledge of Solr or Lucene is required.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

What's Inside

How to scale Solr for big data
Rich real-world examples
Solr as a NoSQL data store
Advanced multilingual, data, and relevancy tricks
Coverage of versions through Solr 4.7

About the Authors

Trey Grainger is a director of engineering at CareerBuilder. Timothy Potter is a senior member of the engineering team at LucidWorks. The authors work on the scalability and reliability of Solr, as well as on recommendation engine and big data analytics technologies.

Table of Contents

Introduction to Solr
Getting to know Solr
Key Solr concepts
Configuring Solr
Indexing
Text analysis
Performing queries and handling results
Faceted search
Hit highlighting
Query suggestions
Result grouping/field collapsing
Taking Solr to production
SolrCloud
Multilingual search
Complex query operations
Mastering relevancy

Skip carousel

LanguageEnglish

PublisherManning

Release dateMar 25, 2014

ISBN9781638351238

Author

Timothy Potter

Timothy Potter is an architect on the Big Data team at Dachis Group, where he focuses on large-scale machine learning, text mining, and social network analysis. Tim has worked extensively with Lucene and Solr technologies and has been a speaker at Lucene Revolution. He is a contributing author to Taming Text (Manning 2012) and holds several US Patents related to J2EE-based enterprise application integration. He blogs at thelabdude.blogspot.com.

Related authors

Skip carousel

Related to Solr in Action

Related ebooks

Skip carousel

Scalatra in Action
Ebook
Scalatra in Action
byRoss Baker
Rating: 0 out of 5 stars
0 ratings
Isomorphic Web Applications: Universal Development with React
Ebook
Isomorphic Web Applications: Universal Development with React
byElyse Gordon
Rating: 0 out of 5 stars
0 ratings
Spark in Action
Ebook
Spark in Action
byMarko Bonaci
Rating: 0 out of 5 stars
0 ratings
Neo4j in Action
Ebook
Neo4j in Action
byTareq Abedrabbo
Rating: 0 out of 5 stars
0 ratings
Play for Java
Ebook
Play for Java
byNicolas Leroux
Rating: 0 out of 5 stars
0 ratings
Lucene in Action
Ebook
Lucene in Action
byOtis Gospodnetic
Rating: 4 out of 5 stars
4/5
HTML5 for .NET Developers: Single page web apps, JavaScript, and semantic markup
Ebook
HTML5 for .NET Developers: Single page web apps, JavaScript, and semantic markup
byIan Gilman
Rating: 0 out of 5 stars
0 ratings
Relevant Search: With applications for Solr and Elasticsearch
Ebook
Relevant Search: With applications for Solr and Elasticsearch
byJohn Berryman
Rating: 5 out of 5 stars
5/5
Linked Data: Structured data on the Web
Ebook
Linked Data: Structured data on the Web
byLuke Ruth
Rating: 4 out of 5 stars
4/5
Android in Practice
Ebook
Android in Practice
byMatthias Kaeppler
Rating: 0 out of 5 stars
0 ratings
Collective Intelligence in Action
Ebook
Collective Intelligence in Action
bySatnam Alag
Rating: 4 out of 5 stars
4/5
Netty in Action
Ebook
Netty in Action
byNorman Maurer
Rating: 0 out of 5 stars
0 ratings
Getting MEAN with Mongo, Express, Angular, and Node
Ebook
Getting MEAN with Mongo, Express, Angular, and Node
bySimon Holmes
Rating: 5 out of 5 stars
5/5
Front-End Tooling with Gulp, Bower, and Yeoman
Ebook
Front-End Tooling with Gulp, Bower, and Yeoman
byStefan Baumgartner
Rating: 0 out of 5 stars
0 ratings
hapi.js in Action
Ebook
hapi.js in Action
byMatt Harrison
Rating: 0 out of 5 stars
0 ratings
HBase in Action
Ebook
HBase in Action
byAmandeep Khurana
Rating: 0 out of 5 stars
0 ratings
The Well-Grounded Java Developer: Vital techniques of Java 7 and polyglot programming
Ebook
The Well-Grounded Java Developer: Vital techniques of Java 7 and polyglot programming
byBenjamin Evans
Rating: 4 out of 5 stars
4/5
Object Design Style Guide
Ebook
Object Design Style Guide
byMatthias Noback
Rating: 0 out of 5 stars
0 ratings
Elasticsearch in Action
Ebook
Elasticsearch in Action
byRoy Russo
Rating: 0 out of 5 stars
0 ratings
Scala in Action
Ebook
Scala in Action
byNilanjan Raychaudhuri
Rating: 0 out of 5 stars
0 ratings
Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code
Ebook
Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code
byJohn Wolohan
Rating: 0 out of 5 stars
0 ratings
Practical Recommender Systems
Ebook
Practical Recommender Systems
byKim Falk
Rating: 5 out of 5 stars
5/5
Web Components in Action
Ebook
Web Components in Action
byBenjamin Farrell
Rating: 0 out of 5 stars
0 ratings
Dependency Injection: Design patterns using Spring and Guice
Ebook
Dependency Injection: Design patterns using Spring and Guice
byDhananjay Prasanna
Rating: 0 out of 5 stars
0 ratings
Web Performance in Action: Building Fast Web Pages
Ebook
Web Performance in Action: Building Fast Web Pages
byJeremy Wagner
Rating: 0 out of 5 stars
0 ratings
Redis in Action
Ebook
Redis in Action
byJosiah Carlson
Rating: 0 out of 5 stars
0 ratings
JavaScript Application Design: A Build First Approach
Ebook
JavaScript Application Design: A Build First Approach
byNicolas Bevacqua
Rating: 0 out of 5 stars
0 ratings
Erlang and OTP in Action
Ebook
Erlang and OTP in Action
byEric Merritt
Rating: 0 out of 5 stars
0 ratings
iOS Development with Swift
Ebook
iOS Development with Swift
byCraig Grummitt
Rating: 0 out of 5 stars
0 ratings
Java Testing with Spock
Ebook
Java Testing with Spock
byKonstantinos Kapelonis
Rating: 0 out of 5 stars
0 ratings

Computers For You

Skip carousel

The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 0 out of 5 stars
0 ratings
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 4 out of 5 stars
4/5
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byAaron Smith
Rating: 0 out of 5 stars
0 ratings
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
Ebook
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
byRizwan Virk
Rating: 5 out of 5 stars
5/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
CompTIA Security+ Practice Questions
Ebook
CompTIA Security+ Practice Questions
byIP Specialist
Rating: 2 out of 5 stars
2/5
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
People Skills for Analytical Thinkers
Ebook
People Skills for Analytical Thinkers
byGilbert Eijkelenboom
Rating: 5 out of 5 stars
5/5
Master Builder Roblox: The Essential Guide
Ebook
Master Builder Roblox: The Essential Guide
byTriumph Books
Rating: 4 out of 5 stars
4/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
Ebook
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
byTriumph Books
Rating: 5 out of 5 stars
5/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Dark Aeon: Transhumanism and the War Against Humanity
Ebook
Dark Aeon: Transhumanism and the War Against Humanity
byJoe Allen
Rating: 5 out of 5 stars
5/5
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
Ebook
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
byDavid Kadavy
Rating: 5 out of 5 stars
5/5
The Professional Voiceover Handbook: Voiceover training, #1
Ebook
The Professional Voiceover Handbook: Voiceover training, #1
byPeter Baker
Rating: 5 out of 5 stars
5/5
Artificial Intelligence: The Complete Beginner’s Guide to the Future of A.I.
Ebook
Artificial Intelligence: The Complete Beginner’s Guide to the Future of A.I.
byJohn Adamssen
Rating: 4 out of 5 stars
4/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
Ebook
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
byBruce Sterling
Rating: 4 out of 5 stars
4/5
Network+ Study Guide & Practice Exams
Ebook
Network+ Study Guide & Practice Exams
byRobert Shimonski
Rating: 4 out of 5 stars
4/5
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
Tor and the Dark Art of Anonymity
Ebook
Tor and the Dark Art of Anonymity
byLance Henderson
Rating: 5 out of 5 stars
5/5

Related podcast episodes

Skip carousel

20 JavaScript Array and Object Methods to make you a better developer: Wes and Scott rattle through ~20 different Object and Arra Methods that will make you a better JavaScript developer. Freshbooks - Sponsor This is episode Wes mentions the free book . Get a 30 day free trial of Freshbooks at . Netlify —...
Podcast episode
20 JavaScript Array and Object Methods to make you a better developer: Wes and Scott rattle through ~20 different Object and Arra Methods that will make you a better JavaScript developer. Freshbooks - Sponsor This is episode Wes mentions the free book . Get a 30 day free trial of Freshbooks at . Netlify —...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Hasty Treat - Hireable Skills for 2021: In this Hasty Treat, Scott and Wes talk about hireable skills or 2021 — what you need to know to get a job and grow in your career this year! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How did...
Podcast episode
Hasty Treat - Hireable Skills for 2021: In this Hasty Treat, Scott and Wes talk about hireable skills or 2021 — what you need to know to get a job and grow in your career this year! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How did...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
Podcast episode
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
byScreaming in the Cloud
0 ratings
0% found this document useful
The Startup Inside IBM with Sachin Agarwal: Sachin Agarwal is the worldwide product management lead at IBM Aspera. He’s also an organizer at Empower Platform and the founder and CEO of Braid, a project management solution built inside Gmail, Google Calendar, and Slack. Previously, Sachin worked as
Podcast episode
The Startup Inside IBM with Sachin Agarwal: Sachin Agarwal is the worldwide product management lead at IBM Aspera. He’s also an organizer at Empower Platform and the founder and CEO of Braid, a project management solution built inside Gmail, Google Calendar, and Slack. Previously, Sachin worked as
byScreaming in the Cloud
0 ratings
0% found this document useful
Yugabyte and Database Innovations with Karthik Ranganathan: This week Corey is joined by Karthik Ranganathan, CTO and Co-Founder of Yugabyte, to talk about databases of which YugabyteDB is one of the best. Karthik started at Facebook building distributed databases and now has moved onto building even more! Why? We
Podcast episode
Yugabyte and Database Innovations with Karthik Ranganathan: This week Corey is joined by Karthik Ranganathan, CTO and Co-Founder of Yugabyte, to talk about databases of which YugabyteDB is one of the best. Karthik started at Facebook building distributed databases and now has moved onto building even more! Why? We
byScreaming in the Cloud
0 ratings
0% found this document useful
Distributing Geospatial Data: Distributing Geospatial Data - Every wondered why you might what to do this? Or maybe you understand the why but are unsure about the how? Perhaps you have heard people talk about partitioning data or sharding data, you might have heard some of thes...
Podcast episode
Distributing Geospatial Data: Distributing Geospatial Data - Every wondered why you might what to do this? Or maybe you understand the why but are unsure about the how? Perhaps you have heard people talk about partitioning data or sharding data, you might have heard some of thes...
byThe MapScaping Podcast - GIS, Geospatial, Remote Sensing, earth observation and digital geography
0 ratings
0% found this document useful
DynamoDB The Database of Choice for Serverless Applications with Alex DeBrie: Alex DeBrie is the founder of DeBrie, LLC, a cloud-native training and AWS consulting company with a focus on DynamoDB and serverless technologies. He’s also the author of The DynamoDB Book, a 450-page tome that offers tips, strategies, and more about dat
Podcast episode
DynamoDB The Database of Choice for Serverless Applications with Alex DeBrie: Alex DeBrie is the founder of DeBrie, LLC, a cloud-native training and AWS consulting company with a focus on DynamoDB and serverless technologies. He’s also the author of The DynamoDB Book, a 450-page tome that offers tips, strategies, and more about dat
byScreaming in the Cloud
0 ratings
0% found this document useful
Dapr Distributed Application Runtime with Azure CTO Mark Russinovich: Dapr is a an event-driven, portable runtime for building microservices on cloud and edge. In this episode Scott talks to Azure CTO Mark Russinovich about what this means and why you should care? What are the responsibilities of a microservice, and what should YOU worry about and what a responsibilities better delegated to an open source project like Dapr?
Podcast episode
Dapr Distributed Application Runtime with Azure CTO Mark Russinovich: Dapr is a an event-driven, portable runtime for building microservices on cloud and edge. In this episode Scott talks to Azure CTO Mark Russinovich about what this means and why you should care? What are the responsibilities of a microservice, and what should YOU worry about and what a responsibilities better delegated to an open source project like Dapr?
byHanselminutes with Scott Hanselman
0 ratings
0% found this document useful
120: FastAPI & Typer - Sebastián Ramírez: Sebastián Ramírez is the developer behind FastAPI for Python REST APIs and Typer, for CLI applications. We discuss FastAPI, Typer, Swagger UI, interface design, autocompletion, and more.
Podcast episode
120: FastAPI & Typer - Sebastián Ramírez: Sebastián Ramírez is the developer behind FastAPI for Python REST APIs and Typer, for CLI applications. We discuss FastAPI, Typer, Swagger UI, interface design, autocompletion, and more.
byTest and Code
0 ratings
0% found this document useful
Distributed Systems Tradeoffs with Camille Fournier: Distributed systems products are often marketed with terms like “real-time data” and “hassle-free scaling”, but what do those terms actually mean? Is data in a distributed system ever reliably “real time”? Do we ever have strong enough plans about our ...
Podcast episode
Distributed Systems Tradeoffs with Camille Fournier: Distributed systems products are often marketed with terms like “real-time data” and “hassle-free scaling”, but what do those terms actually mean? Is data in a distributed system ever reliably “real time”? Do we ever have strong enough plans about our ...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Eureka moments with natural language processing: featuring Nicholas Mohnacky of bundleIQ
Podcast episode
Eureka moments with natural language processing: featuring Nicholas Mohnacky of bundleIQ
byPractical AI: Machine Learning, Data Science
0 ratings
0% found this document useful
Building Real Time Applications On Streaming Data With Eventador - Episode 129: An interview with Eventador CEO Kenny Gorman about the challenges of building a managed service for streaming data to simplify building real time applications
Podcast episode
Building Real Time Applications On Streaming Data With Eventador - Episode 129: An interview with Eventador CEO Kenny Gorman about the challenges of building a managed service for streaming data to simplify building real time applications
byData Engineering Podcast
0 ratings
0% found this document useful
Data Visualization and D3.js with Irene Ros: Scott talks to Data Visualization expert Irene Ros. When she isn't contributing to the Miso Project, teaching her d3.js class, or working on making OpenVis Conf the best data visualization conference it can be, she's working on projects that focus on creating engaging interactive visual displays of information.
Podcast episode
Data Visualization and D3.js with Irene Ros: Scott talks to Data Visualization expert Irene Ros. When she isn't contributing to the Miso Project, teaching her d3.js class, or working on making OpenVis Conf the best data visualization conference it can be, she's working on projects that focus on creating engaging interactive visual displays of information.
byHanselminutes with Scott Hanselman
0 ratings
0% found this document useful
gRPC & protocol buffers: with Askhay Shah
Podcast episode
gRPC & protocol buffers: with Askhay Shah
byGo Time: Golang, Software Engineering
0 ratings
0% found this document useful
Kafka Event Sourcing with Neha Narkhede: When a user of a social network updates her profile, that profile update needs to propagate to several databases that want to know about such an update–search indexes, user databases, caches, and other services. When Neha Narkhede was at LinkedIn,
Podcast episode
Kafka Event Sourcing with Neha Narkhede: When a user of a social network updates her profile, that profile update needs to propagate to several databases that want to know about such an update–search indexes, user databases, caches, and other services. When Neha Narkhede was at LinkedIn,
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Morgan Senkal: Using Epics to Improve Code Quality Within Sprints: Robby speaks with Morgan Senkal, Software Architect at Metal Toad. Morgan recalls a challenging 15-year-old legacy project that was reminiscent of a Stephen King story and explains what to think about when considering a software rewrite. Morgan and Robby keep a running analogy of technical debt and automotive repairs.
Podcast episode
Morgan Senkal: Using Epics to Improve Code Quality Within Sprints: Robby speaks with Morgan Senkal, Software Architect at Metal Toad. Morgan recalls a challenging 15-year-old legacy project that was reminiscent of a Stephen King story and explains what to think about when considering a software rewrite. Morgan and Robby keep a running analogy of technical debt and automotive repairs.
byMaintainable
0 ratings
0% found this document useful
Cloud Education Made Easy with Katie Bullard: Katie Bullard is the president of A Cloud Guru, a cloud education platform. She’s also a board member at Conservice, ChildCareCRM, and Journyx, Inc. Katie previously served as president and chief growth officer at ZoomInfo (formerly DiscoverOrg), VP of ma
Podcast episode
Cloud Education Made Easy with Katie Bullard: Katie Bullard is the president of A Cloud Guru, a cloud education platform. She’s also a board member at Conservice, ChildCareCRM, and Journyx, Inc. Katie previously served as president and chief growth officer at ZoomInfo (formerly DiscoverOrg), VP of ma
byScreaming in the Cloud
0 ratings
0% found this document useful
Taming Distributed Architecture with Caitie McCaffrey: Distributed systems programming will always be a world of tradeoffs -- there is no silver bullet in the future. But life can be made easier with tactics such as the actor pattern and the use of conflict-free replicated data types (CRDTs). -
Podcast episode
Taming Distributed Architecture with Caitie McCaffrey: Distributed systems programming will always be a world of tradeoffs -- there is no silver bullet in the future. But life can be made easier with tactics such as the actor pattern and the use of conflict-free replicated data types (CRDTs). -
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Working with Code: How Does a Coder at NASA Do His Job?
Podcast episode
Working with Code: How Does a Coder at NASA Do His Job?
byWorking
0 ratings
0% found this document useful
046 jsAir - React Native with Bonnie Eisenman, Ken Wheeler, and Tyler McGinnis: React Native with Bonnie Eisenman, Ken Wheeler, and Tyler McGinnis Description: JavaScript is taking the software world by storm, and we're going to talk about yet another enabling technology: React Native. Show sponsors:Egghead.io - Bite-size...
Podcast episode
046 jsAir - React Native with Bonnie Eisenman, Ken Wheeler, and Tyler McGinnis: React Native with Bonnie Eisenman, Ken Wheeler, and Tyler McGinnis Description: JavaScript is taking the software world by storm, and we're going to talk about yet another enabling technology: React Native. Show sponsors:Egghead.io - Bite-size...
byJavaScript Air
0 ratings
0% found this document useful
Microservices with Rafi Schloming: Microservices are a widely adopted pattern for breaking an application up into pieces that can be well-understood by the individual teams within the company. Microservices also allow these individual pieces to be scaled independently and updated in iso...
Podcast episode
Microservices with Rafi Schloming: Microservices are a widely adopted pattern for breaking an application up into pieces that can be well-understood by the individual teams within the company. Microservices also allow these individual pieces to be scaled independently and updated in iso...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Using FoundationDB As The Bedrock For Your Distributed Systems - Episode 80: An interview about the FoundationDB project and how it simplifies the work of building custom distributed systems applications
Podcast episode
Using FoundationDB As The Bedrock For Your Distributed Systems - Episode 80: An interview about the FoundationDB project and how it simplifies the work of building custom distributed systems applications
byData Engineering Podcast
0 ratings
0% found this document useful
#21 - Domain-Driven Design and Event-Driven Architecture - Vaughn Vernon
Podcast episode
#21 - Domain-Driven Design and Event-Driven Architecture - Vaughn Vernon
byTech Lead Journal
0 ratings
0% found this document useful
Running Databases on Kubernetes
Podcast episode
Running Databases on Kubernetes
byThe Cloudcast
0 ratings
0% found this document useful
React Hooks - 1 Year Later: In this episode of Syntax, Scott and Wes talk about React Hooks, one year later — what’s changed, how to use them, and more! Sanity - Sponsor is a real-time headless CMS with a fully customizable Content Studio built in React. Get a Sanity...
Podcast episode
React Hooks - 1 Year Later: In this episode of Syntax, Scott and Wes talk about React Hooks, one year later — what’s changed, how to use them, and more! Sanity - Sponsor is a real-time headless CMS with a fully customizable Content Studio built in React. Get a Sanity...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
Podcast episode
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
Serverless Event-Driven Architecture with Danilo Poccia: In an event driven application, each component of application logic emits events, which other parts of the application respond to. We have examined this pattern in previous shows that focus on pub/sub messaging, event sourcing, and CQRS.
Podcast episode
Serverless Event-Driven Architecture with Danilo Poccia: In an event driven application, each component of application logic emits events, which other parts of the application respond to. We have examined this pattern in previous shows that focus on pub/sub messaging, event sourcing, and CQRS.
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Ep. 33 - Code dependencies are the devil: Have you built your app on someone else's code? And beyond that, does the "secret sauce" of your product depend on external libraries or frameworks? While it's tempting to use the latest and greatest tech as soon as it comes out, that's not always a...
Podcast episode
Ep. 33 - Code dependencies are the devil: Have you built your app on someone else's code? And beyond that, does the "secret sauce" of your product depend on external libraries or frameworks? While it's tempting to use the latest and greatest tech as soon as it comes out, that's not always a...
byfreeCodeCamp Podcast
0 ratings
0% found this document useful
Managed Kafka with Tom Crayford: Kafka is a distributed log for producers and consumers to publish messages to each other. We’ve done many shows about Kafka as a key building block for distributed systems, but we often leave out the discussion of the complexities of setting up Kafka a...
Podcast episode
Managed Kafka with Tom Crayford: Kafka is a distributed log for producers and consumers to publish messages to each other. We’ve done many shows about Kafka as a key building block for distributed systems, but we often leave out the discussion of the complexities of setting up Kafka a...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
Podcast episode
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
byCoder Radio
0 ratings
0% found this document useful

Skip carousel

Usability
Linux Format
Article
Usability
Oct 19, 2021
3 min read
An Introduction To Rabbitmq
Linux Format
Article
An Introduction To Rabbitmq
Jun 29, 2021
RabbitMQ is a Message Broker, which means that it can safely hold messages generated by applications and make them available to other applications. The main advantages are reliability, support for clustering and high-availability queues, tracing capa
1 min read
Types Of Databases
Linux Format
Article
Types Of Databases
Aug 27, 2019
NoSQL databases provide the performance, scalability and stability that’s required by the modern data-driven apps we interact with these days. But that is where the similarity between NoSQL systems end. In fact, it wouldn’t be wrong to say that the o
1 min read
How This Startup is Making Mobile App Development Easier
Entrepreneur
Article
How This Startup is Making Mobile App Development Easier
Apr 1, 2016
1 min read
The Three Cornerstones of a Smart Business
Rotman Management
Article
The Three Cornerstones of a Smart Business
Jan 1, 2019
Adaptable Products. Algorithms cannot iterate without the products—the online consumer interface that delivers customer experience directly while gathering consumer feedback to adjust algorithm models. Google’s search bar is a classic example of prod
1 min read
Meet The Team
Linux Format
Article
Meet The Team
Jul 27, 2021
Come to think of it, a live music and coding session with Sonic Pi would be somewhat engaging. Viewers could submit their own code snippets via git in near real-time, and probably a chaotic cacophony would ensue, but it would make for a good experime
1 min read
Docker vs Podman
APC
Article
Docker vs Podman
Apr 19, 2021
When Cockpit was first developed, it had plug-in support for administering your Docker containers remotely via its user-friendly web interface. But then Red Hat OS became a major backer of Cockpit, and when Red Hat developed its own alternative to Do
1 min read
Three Low-code Options
PC Pro Magazine
Article
Three Low-code Options
Nov 12, 2020
Counting Intel, Vodafone and VW among its customers, OutSystems helps businesses create cloudbased, on-premises and hybrid applications for mobile and web. Its development environment is predominantly drag-and-drop, with views for processes, data and
3 min read
Metrics & Visuals In Go
Linux Format
Article
Metrics & Visuals In Go
Nov 17, 2020
Mihalis Tsoukalos is a DataOps engineer and a technical writer. He’s the author of Go Systems Programming and Mastering Go, 2nd edition. The subject of this tutorial is two-fold. First, it’s about creating a Go application that exports metrics to P
7 min read
Route Traffic Between Networks Using A Pi
Linux Format
Article
Route Traffic Between Networks Using A Pi
Jun 2, 2020
A deep-dive into Pi networking solutions resulted in this tutorial. The goal was to uncover a Pi configuration that would enable the routing of network traffic from a wired network to a wireless network. The aim is to build a network router using a R
10 min read
Best New Apps
TechLife
Article
Best New Apps
May 3, 2021
3 min read
Quantum Computing’s DISRUPTION IN Finance Industry
Techfastly
Article
Quantum Computing’s DISRUPTION IN Finance Industry
Oct 1, 2021
5 min read
Rise Of The Robots
Linux Format
Article
Rise Of The Robots
Jan 12, 2021
7 min read
We Need an FDA For Algorithms
Nautilus
Article
We Need an FDA For Algorithms
Nov 1, 2018
In the introduction to her new book, Hannah Fry points out something interesting about the phrase “Hello World.” It’s never been quite clear, she says, whether the phrase—which is frequently the entire output of a student’s first computer program—is
10 min read
How Netflix’s OTT Architecture Functions?
Techfastly
Article
How Netflix’s OTT Architecture Functions?
May 1, 2022
With so many OTT platforms in the market today, Netflix has managed to capture a majority of the audience on a global scale. Netflix has become the go-to source of so much entertainment for consumers in less than 20 years. It can even be said that Ne
4 min read
The Not-Com Bubble Is Popping
The Atlantic
Article
The Not-Com Bubble Is Popping
Oct 18, 2019
4 min read
Roundup
Linux Format
Article
Roundup
Dec 13, 2022
13 min read
Superposition
TechLife
Article
Superposition
Oct 21, 2019
2 min read
Traefik Configuration
Linux Format
Article
Traefik Configuration
Mar 10, 2020
In this tutorial we have configured Traefik using command-line switches in our Docker Compose file (the section starting command:). This is the equivalent of starting the application with a whole bunch of command options each time, and while this wou
1 min read
Observability Of The Kernel And Containers
Linux Format
Article
Observability Of The Kernel And Containers
Apr 4, 2023
Mihalis Tsoukalos is currently working on Time Series. You can reach him at: @mactsouk. For our final delve into eBPF, we’re tackling applications, the kernel and Docker containers. At the end of the day, all Linux machines execute code for applicat
10 min read
MacOS Monterey 12.5 Is Now Available And Full Of Security Updates
Macworld UK
Article
MacOS Monterey 12.5 Is Now Available And Full Of Security Updates
Aug 19, 2022
4 min read
Mac 911
MacWorld
Article
Mac 911
Apr 20, 2021
7 min read
Password Managers
Linux Format
Article
Password Managers
Feb 6, 2024
14 min read
Browser wars 2020
APC
Article
Browser wars 2020
Nov 2, 2020
8 min read
Accurate, Open Source IP-based Localisation
Linux Format
Article
Accurate, Open Source IP-based Localisation
Dec 14, 2021
8 min read
Using The Logical Editors in Cubase
Music Tech Magazine
Article
Using The Logical Editors in Cubase
Oct 15, 2020
For those of a certain age, the word ‘logic’ is irrevocably tied up in images of Spock making judgement calls on the advisability of Captain Kirk’s latest harebrained scheme for getting any given episode of Star Trek’s extras killed. In the real word
3 min read
Five Considerations When Building Your Site
Australian Photography
Article
Five Considerations When Building Your Site
Jul 21, 2022
1. The beast you feed. If you do decide to make a dynamic site with a regular blog and updates, then be aware that it can be a heavy task to keep it updated. Plan content in advance if you can. 2. Domains & URLs. Many platforms will also provide or s
1 min read
Note-taking Applications For Family History
Family Tree UK
Article
Note-taking Applications For Family History
Mar 10, 2023
7 min read
Browser Wars 2020
Maximum PC
Article
Browser Wars 2020
May 26, 2020
8 min read
Scan And Scrape Websites Using Python
Linux Format
Article
Scan And Scrape Websites Using Python
Nov 14, 2023
David Bolton once accidentally boosted the traffic for his firm’s website by 25% in one day by running a web scraper on it. Luckily, they never found out! Ever since the web made an appearance back in the mid-’90s, programmers have been writing softw
6 min read

Related categories

Skip carousel

Reviews for Solr in Action

Rating: 3 out of 5 stars

3/5

1 rating0 reviews

Book preview

Solr in Action - Timothy Potter

Copyright

For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact

Special Sales Department

Manning Publications Co.

20 Baldwin Road

PO Box 261

Shelter Island, NY 11964

Email:

orders@manning.com

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher

Photographs in this book were created by Martin Evans and Jordan Hochenbaum, unless otherwise noted. Illustrations were created by Martin Evans, Joshua Noble, and Jordan Hochenbaum. Fritzing (fritzing.org) was used to create some of the circuit diagrams.

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.

Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.

ISBN: 9781617291029

Printed in the United States of America

1 2 3 4 5 6 7 8 9 10 – MAL – 19 18 17 16 15 14

Brief Table of Contents

Copyright

Brief Table of Contents

Table of Contents

Foreword

Preface

Acknowledgments

About this Book

1. Meet Solr

Chapter 1. Introduction to Solr

Chapter 2. Getting to know Solr

Chapter 3. Key Solr concepts

Chapter 4. Configuring Solr

Chapter 5. Indexing

Chapter 6. Text analysis

2. Core Solr capabilities

Chapter 7. Performing queries and handling results

Chapter 8. Faceted search

Chapter 9. Hit highlighting

Chapter 10. Query suggestions

Chapter 11. Result grouping/field collapsing

Chapter 12. Taking Solr to production

3. Taking Solr to the next level

Chapter 13. SolrCloud

Chapter 14. Multilingual search

Chapter 15. Complex query operations

Chapter 16. Mastering relevancy

Appendix A. Working with the Solr codebase

Appendix B. Language-specific field type configurations

Appendix C. Useful data import configurations

Index

List of Figures

List of Tables

List of Listings

Copyright

Brief Table of Contents

Table of Contents

Foreword

Preface

Acknowledgments

About this Book

1. Meet Solr

Chapter 1. Introduction to Solr

1.1. Why do I need a search engine?

1.1.1. Managing text-centric data

1.1.2. Common search-engine use cases

1.2. What is Solr?

1.2.1. Information retrieval engine

1.2.2. Flexible schema management

1.2.3. Java web application

1.2.4. Multiple indexes in one server

1.2.5. Extendable (plugins)

1.2.6. Scalable

1.2.7. Fault-tolerant

1.3. Why Solr?

1.3.1. Solr for the software architect

1.3.2. Solr for the system administrator

1.3.3. Solr for the CEO

1.4. Features overview

1.4.1. User-experience features

1.4.2. Data-modeling features

1.4.3. New features in Solr 4

1.5. Summary

Chapter 2. Getting to know Solr

2.1. Getting started

2.1.1. Installing Solr

2.1.2. Starting the Solr example server

2.1.3. Understanding Solr home

2.1.4. Indexing the example documents

2.2. Searching is what it’s all about

2.2.1. Exploring Solr’s query form

2.2.2. What comes back from Solr when you search

2.2.3. Ranked retrieval

2.2.4. Paging and sorting

2.2.5. Expanded search features

2.3. Tour of the Solr administration console

2.4. Adapting the example to your needs

2.5. Summary

Chapter 3. Key Solr concepts

3.1. Searching, matching, and finding content

3.1.1. What is a document?

3.1.2. The fundamental search problem

3.1.3. The inverted index

3.1.4. Terms, phrases, and Boolean logic

3.1.5. Finding sets of documents

3.1.6. Phrase queries and term positions

3.1.7. Fuzzy matching

3.1.8. Quick recap

3.2. Relevancy

3.2.1. Default similarity

3.2.2. Term frequency

3.2.3. Inverse document frequency

3.2.4. Boosting

3.2.5. Normalization factors

3.3. Precision and Recall

3.3.1. Precision

3.3.2. Recall

3.3.3. Striking the right balance

3.4. Searching at scale

3.4.1. The denormalized document

3.4.2. Distributed searching

3.4.3. Clusters vs. servers

3.4.4. The limits of Solr

3.5. Summary

Chapter 4. Configuring Solr

4.1. Overview of solrconfig.xml

4.1.1. Common XML data-structure and type elements

4.1.2. Applying configuration changes

4.1.3. Miscellaneous settings

4.2. Query request handling

4.2.1. Request-handling overview

4.2.2. Search handler

4.2.3. Browse request handler for Solritas: an example

4.2.4. Extending query processing with search components

4.3. Managing searchers

4.3.1. New searcher overview

4.3.2. Warming a new searcher

4.4. Cache management

4.4.1. Cache fundamentals

4.4.2. Filter cache

4.4.3. Query result cache

4.4.4. Document cache

4.4.5. Field value cache

4.5. Remaining configuration options

4.6. Summary

Chapter 5. Indexing

5.1. Example microblog search application

5.1.1. Representing content for searching

5.1.2. Overview of the Solr indexing process

5.2. Designing your schema

5.2.1. Document granularity

5.2.2. Unique key

5.2.3. Indexed fields

5.2.4. Stored fields

5.2.5. Preview of schema.xml

5.3. Defining fields in schema.xml

5.3.1. Required field attributes

5.3.2. Multivalued fields

5.3.3. Dynamic fields

5.3.4. Copy fields

5.3.5. Unique key field

5.4. Field types for structured nontext fields

5.4.1. String fields

5.4.2. Date fields

5.4.3. Numeric fields

5.4.4. Advanced field type attributes

5.5. Sending documents to Solr for indexing

5.5.1. Indexing documents using XML or JSON

5.5.2. Using the SolrJ client library to add documents from Java

5.5.3. Other tools for importing documents into Solr

5.6. Update handler

5.6.1. Committing documents to the index

5.6.2. Transaction log

5.6.3. Atomic updates

5.7. Index management

5.7.1. Index storage

5.7.2. Segment merging

5.8. Summary

Chapter 6. Text analysis

6.1. Analyzing microblog text

6.2. Basic text analysis

6.2.1. Analyzer

6.2.2. Tokenizer

6.2.3. Token filter

6.2.4. StandardTokenizer

6.2.5. Removing stop words with StopFilterFactory

6.2.6. LowerCaseFilterFactory—lowercase letters in terms

6.2.7. Testing your analysis with Solr’s analysis form

6.3. Defining a custom field type for microblog text

6.3.1. Collapsing repeated letters with PatternReplaceCharFilterFactory

6.3.2. Preserving hashtags, mentions, and hyphenated terms

6.3.3. Removing diacritical marks using ASCIIFoldingFilterFactory

6.3.4. Stemming with KStemFilterFactory

6.3.5. Injecting synonyms at query time with SynonymFilterFactory

6.3.6. Putting it all together

6.4. Advanced text analysis

6.4.1. Advanced field attributes

6.4.2. Per-language text analysis

6.4.3. Extending text analysis using a Solr plugin

6.5. Summary

2. Core Solr capabilities

Chapter 7. Performing queries and handling results

7.1. The anatomy of a Solr request

7.1.1. Request handlers

7.1.2. Search components

7.1.3. Query parsers

7.2. Working with query parsers

7.2.1. Specifying a query parser

7.2.2. Local params

7.3. Queries and filters

7.3.1. The fq and q parameters

7.3.2. Handling expensive filters

7.4. The default query parser (Lucene query parser)

7.4.1. Lucene query parser syntax

7.5. Handling user queries (eDisMax query parser)

7.5.1. eDisMax query parser overview

7.5.2. eDisMax query parameters

7.5.3. Searching across multiple fields

7.5.4. Boosting queries and phrases

7.5.5. Field aliasing

7.5.6. User-accessible fields

7.5.7. Minimum match

7.5.8. eDisMax benefits and drawbacks

7.6. Other useful query parsers

7.6.1. Field query parser

7.6.2. Term and Raw query parsers

7.6.3. Function and Function Range query parsers

7.6.4. Nested queries and the Nested query parser

7.6.5. Boost query parser

7.6.6. Prefix query parser

7.6.7. Spatial query parsers

7.6.8. Join query parser

7.6.9. Switch query parser

7.6.10. Surround query parser

7.6.11. Max Score query parser

7.6.12. Collapsing query parser

7.7. Returning results

7.7.1. Choosing a response format

7.7.2. Choosing fields to return

7.7.3. Paging through results

7.8. Sorting results

7.8.1. Sorting by fields

7.8.2. Sorting by functions

7.8.3. Fuzzy sorting

7.9. Debugging query results

7.9.1. Returning debug information

7.10. Summary

Chapter 8. Faceted search

8.1. Navigating your content at a glance

8.2. Setting up test data

8.3. Field faceting

8.4. Query faceting

8.5. Range faceting

8.6. Filtering upon faceted values

8.6.1. Applying filters to your facets

8.6.2. Safely filtering on faceted values

8.7. Multiselect faceting, keys, and tags

8.7.1. Keys

8.7.2. Tags, excludes, and multiselect faceting

8.8. Beyond the basics

8.9. Summary

Chapter 9. Hit highlighting

9.1. Overview of hit highlighting

9.2. How highlighting works

9.2.1. Set up a new Solr core for UFO sightings

9.2.2. Preprocess UFO sightings before indexing

9.2.3. Exploring the UFO sightings dataset

9.2.4. Hit highlighting out of the box

9.2.5. Nuts and bolts

9.2.6. Refining highlighter results

9.3. Improving performance using FastVectorHighlighter

9.4. PostingsHighlighter

9.5. Summary

Chapter 10. Query suggestions

10.1. Spell-check

10.1.1. Indexing Wikipedia articles

10.1.2. Spell-check example

10.1.3. Spell-check search component

10.2. Autosuggesting query terms

10.2.1. Autosuggest request handler

10.2.2. Autosuggest search component

10.3. Suggesting document field values

10.3.1. Using n-grams for suggestions

10.3.2. N-gram-driven request handler

10.4. Suggesting queries based on user activity

Schema design

Find most popular query

Boosting more recent popularity

10.5. Summary

Chapter 11. Result grouping/field collapsing

11.1. Result grouping vs. field collapsing

11.2. Skipping duplicate documents

11.3. Returning multiple documents per group

11.4. Grouping by functions and queries

11.4.1. Grouping by function

11.4.2. Grouping by query

11.5. Paging and sorting grouped results

11.6. Grouping gotchas

11.6.1. Faceting upon result groups

11.6.2. Distributed result grouping

11.6.3. Returning a flat list

11.6.4. Grouping on multivalued and tokenized fields

11.6.5. Grouping performance

11.7. Efficient field collapsing with the Collapsing query parser

11.8. Summary

Chapter 12. Taking Solr to production

12.1. Developing a Solr distribution

12.2. Deploying Solr

12.2.1. Building your Solr distribution

12.2.2. Embedded Solr

12.3. Hardware and server configuration

12.3.1. RAM and SSDs

12.3.2. JVM settings

12.3.3. The index shuffle

12.3.4. Useful system tricks

12.4. Data acquisition strategies

Update Formats, Indexing Time, and Batching

Data Import Handler

Extracting text from files with Solr Cell

12.5. Sharding and replication

12.5.1. Choosing to shard

12.5.2. Choosing to replicate

12.6. Solr core management

Defining cores

Creating cores through the Core Admin API

Reloading cores

Renaming and swapping cores

Unloading and deleting cores

Splitting and merging indexes

Getting the status of cores

12.7. Managing clusters of servers

12.7.1. Load balancers and Solr health check

12.7.2. Generic vs. customized configuration

12.8. Querying and interacting with Solr

12.8.1. REST API

12.8.2. Available Solr client libraries

12.8.3. Using SolrJ from Java

12.9. Monitoring Solr’s performance

12.9.1. Solr’s Plugins / Stats page

12.9.2. Solr cache performance

12.9.3. Pulling stats from request handlers and MBeans

12.9.4. External monitoring options

12.9.5. Solr logs

12.9.6. Load testing

12.10. Upgrading between Solr versions

12.11. Summary

3. Taking Solr to the next level

Chapter 13. SolrCloud

13.1. Getting started with SolrCloud

13.1.1. Starting Solr in cloud mode

13.1.2. Motivation behind the SolrCloud architecture

13.2. Core concepts

13.2.1. Collections vs. cores

13.2.2. ZooKeeper

13.2.3. Choosing the number of shards and replicas

13.2.4. Cluster-state management

13.2.5. Shard-leader election

13.2.6. Important SolrCloud configuration settings

13.3. Distributed indexing

13.3.1. Document shard assignment

13.3.2. Adding documents

13.3.3. Near real-time search

13.3.4. Node recovery process

13.4. Distributed search

13.4.1. Multistage query process

13.4.2. Distributed search limitations

13.5. Collections API

13.5.1. Create a collection

13.5.2. Collection aliasing

13.6. Basic system-administration tasks

13.6.1. Configuration updates

13.6.2. Rolling restart

13.6.3. Restarting a failed node

13.6.4. Is node X active?

13.6.5. Adding a replica

13.6.6. Offsite backup

13.7. Advanced topics

13.7.1. Custom hashing

13.7.2. Shard splitting

13.8. Summary

Chapter 14. Multilingual search

14.1. Why linguistic analysis matters

14.2. Stemming vs. lemmatization

14.3. Stemming in action

14.4. Handling edge cases

14.4.1. KeywordMarkerFilterFactory

14.4.2. StemmerOverrideFilterFactory

14.5. Available language libraries in Solr

14.5.1. Language-specific analyzer chains

14.5.2. Dictionary-based stemming (Hunspell)

14.6. Searching content in multiple languages

14.6.1. Separate field per language

14.6.2. Separate index per language

14.6.3. Multiple languages in one field

14.6.4. Creating a field type to handle multiple languages per field

14.7. Language identification

14.7.1. Update processors for language identification

14.7.2. Dynamically assigning detected language analyzers within a field

14.8. Summary

Chapter 15. Complex query operations

15.1. Function queries

15.1.1. Function syntax

15.1.2. Searching on functions

15.1.3. Returning functions like fields

15.1.4. Sorting on functions

15.1.5. Available functions in Solr

15.1.6. Implementing a custom function

15.2. Geospatial search

15.2.1. Searching near a single point

15.2.2. Advanced geospatial search

15.3. Pivot faceting

Pivot-faceting limitations

Future improvements to pivot faceting

15.4. Referencing external data

Using Solr’s ExternalFileField

15.5. Cross-document and cross-index joins

Cross-document joins

Cross-core joins

15.6. Big data analytics with Solr

15.7. Summary

Chapter 16. Mastering relevancy

16.1. The impact of relevancy tuning

16.2. Debugging the relevancy calculation

16.3. Relevancy boosting

16.3.1. Per-field boosting

16.3.2. Per-term boosting

16.3.3. Payload boosting

16.3.4. Function boosting

16.3.5. Term-proximity boosting

16.3.6. Elevating the relevancy of important documents

16.4. Pluggable Similarity class implementations

16.5. Personalized search and recommendations

16.5.1. Search vs. recommendations

16.5.2. Attribute-based matching

16.5.3. Hierarchical matching

16.5.4. More Like This

16.5.5. Concept-based matching

16.5.6. Geographical matching

16.5.7. Collaborative filtering

16.5.8. Hybrid approaches

16.6. Creating a personalized search experience

16.7. Running relevancy experiments

16.8. Summary

Appendix A. Working with the Solr codebase

A.1. Pulling the right version of Solr

A.2. Setting up Solr in your IDE

Importing Lucene/Solr into Eclipse

Importing Lucene/Solr into IntelliJ IDEA

A.3. Debugging Solr code

Attaching your IDE to a running Solr instance

A.4. Downloading and applying Solr patches

A.5. Contributing patches

Appendix B. Language-specific field type configurations

Appendix C. Useful data import configurations

C.1. Indexing Wikipedia

C.2. Indexing Stack Exchange

Index

List of Figures

List of Tables

List of Listings

Foreword

Solr has had a long and successful history, but a major new chapter began recently with the advent of Solr 4 and SolrCloud. This is the perfect time for Solr in Action. With clear examples, enlightening diagrams, and coverage from key concepts through the newest features, Solr in Action will have you successfully using Solr in no time!

Solr was born out of necessity in 2004, at CNET Networks (now CBS Interactive), to replace a commercial search engine being discontinued by the vendor. Even though I had no formal search background when I started writing Solr, it felt like a very natural fit, because I have always enjoyed making software go fast. I viewed Solr more as an alternate type of datastore designed around an inverted index than as a full-text search engine, and that has helped Solr extend beyond the legacy enterprise search market.

By the end of 2005, Solr was powering the search and faceted navigation of a number of CNET sites, and soon it was made open source. Solr was contributed to the Apache Software Foundation in January 2006 and became a subproject of the Lucene PMC (with Lucene Java as its sibling). There had always been a large degree of overlap with Lucene (the core full-text search library used by Solr) committers, and in 2010 the projects were merged. Separate Lucene and Solr downloads would still be available, but they would be developed by a single unified team. Solr’s version number jumped to match that of Lucene, and the releases have since been synchronized.

The recent Solr 4 release is a major milestone, adding SolrCloud—the set of highly scalable features including distributed indexing with no single points of failure. The NoSQL feature set was also expanded to include transaction logs, update durability, optimistic concurrency, and atomic updates. Solr in Action, written by longtime Solr power users and community members, Trey and Timothy, covers these important recent Solr features and provides an excellent starting point for those new to Solr.

Solr is now used in more places than I could ever have imagined—from integrated library systems to e-commerce platforms, analytics and business intelligence products, content-management systems, internet searches, and more. It’s been rewarding to see Solr grow from a few early adopters to a huge global community of helpful users and active volunteers cooperatively pushing development forward.

Solr in Action gives you the knowledge and techniques you need to use Solr’s features that have been under development since 2004. With Solr in Action in hand, you too are now well equipped to join the global community and help take Solr to new heights!

YONIK SEELEY

CREATOR OF SOLR

Preface

In 2008, I was asked to take over leadership of CareerBuilder’s search technology team. We were using the Microsoft FAST search platform at the time, but realized that search was too important to the success of our business for us to continue relying on a commercial vendor instead of developing the domain expertise internally. I immediately began investigating open source alternatives such as Solr, which seemed to provide most of the key features needed for our products. By the summer of 2009, we decided that we were ready to bring our search expertise in-house and convert our systems to Solr.

The timing was great. Lucene, the open source search library upon which Solr is built, had become a full top-level Apache project in February 2005, and Solr, which had been contributed to the Apache Software Foundation in 2006, had become a top-level Apache project in January of 2007. Both technologies were reaching critical mass and would soon be merged (in March 2010) into a unified project.

By the summer of 2010, our entire platform was converted to Solr. In the process, we increased the speed of our searches, significantly reduced the number of servers necessary to support our search infrastructure, dropped expensive licensing fees, increased platform stability, and in-sourced much of the search expertise for which we had previously been dependent on a commercial vendor.

Little did we know at that time how much additional value we would gain by bringing search in-house. We have been able to build entirely new suites of search-based products—from traditional keyword and semantic search, to big data analytics products, to real-time recommendation engines—utilizing Solr as a scalable search architecture to handle billions of documents and millions of queries an hour across hundreds of servers. We have entered the era of cloud services, elastic scalability, and an explosion of data that we strive to make meaningful for society, and with Solr we are able to tackle each of these challenges head-on.

When Manning approached me about writing Solr in Action, I was hesitant because I knew it would be a large undertaking. My one requirement was that I needed a strong coauthor, and that is exactly what I found in Timothy Potter. Tim also has years of experience developing search-based solutions with Lucene and Solr. He has a wealth of expertise building text analysis systems for social data and architecting real-time analytics solutions using Solr and other cutting-edge big data technologies. With both of us having received so much help from the Solr community over the years and with such a clear need for an example-driven guide to Solr, Tim and I are excited to be able to provide Solr in Action to help the next generation of search engineers. It’s the book we wish we’d had five years ago when we started with Solr, and we hope that you find it to be useful, whether you are just getting introduced to Solr or are looking to take your knowledge to the next level.

TREY GRAINGER

Acknowledgments

Much like Solr, this book would not have been possible without the support of a large community of dedicated people:

Lucene/Solr committers who not only write amazing code but also provide invaluable expertise and advice, all the while demonstrating patience with new members of the community

Active Lucene/Solr community members who contribute code, update the wiki and other documentation, and answer questions on the Lucene and Solr mailing lists

Yonik Seeley, original creator of Solr, who contributed the foreword to our book

Our Manning Early Access Program (MEAP) readers who posted comments in the Author Online forum

The reviewers who provided valuable feedback throughout the development process: Alexandre Madurell, Ammar Alrashed, Brandon Harper, Chris Nauroth, Craig Smith, Edward Welker, Gregor Zurowski, John Viviano, Leo Cassarani, Robert Petersen, Scott Anthony, Sopan Shewale, and Uma Maheshwar Rao Gunuganti

Ivan Todorović and John Guthrie who provided a detailed technical proofread of the manuscript shortly before it went into production

Our Manning editors, Elizabeth Lexleigh, Susan Conant, Melinda Rankin, Elizabeth Martin, and Janet Vail

Bert Bates at Manning for helping us improve the instructional quality of our writing

Family and friends who supported us through the many hours of research and writing

Trey Grainger

First and foremost, I would like to thank my amazing wife, Lindsay, for her support and patience during the many long days and nights it took to write this book. Without her understanding and help throughout the journey, this book would have never been possible (especially with the birth of our daughter midway through the project).

I would also like to thank Paula and Steven Woolf for the countless hours they spent watching Melodie so that I could push this project to completion. Finally, I would like to thank the team at CareerBuilder—both the company leadership and my Search team—for giving me the opportunity to work with such great people and to build a cutting-edge search platform that benefits society in such a clear way.

Timothy Potter

I would like to thank Sharon Russom, my mother, for instilling a love of learning and books early in my childhood, and David Potter, my father, for all of his support throughout college and my career. This book would not have been possible without the help of Lori Joy. Thank you for your support and for being understanding during the late evenings and missed weekends, and for being a sounding board early in the writing process.

I also thank my former team at the Dachis Group. I could not have done this without their insightful questions about Solr and their giving me the opportunity to build a large-scale search solution using Solr.

About this Book

Whether handling big data, building cloud-based services, or developing multitenant web applications, it’s vital to have a fast, reliable search solution. Apache Solr is a scalable and ready-to-deploy open source full-text search engine powered by Lucene. It offers key features like multilingual keyword searching, faceted search, intelligent matching, content clustering, and relevancy weighting right out of the box.

Solr in Action is the definitive guide to implementing fast and scalable search using Apache Solr. It uses well-documented examples ranging from basic keyword searching to scaling a system for billions of documents and queries. With this book, you’ll gain a deep understanding of how to implement core Solr capabilities such as faceted navigation through search results, matched snippet highlighting, field collapsing and search results grouping, spell-checking, query autocomplete, querying by functions, and more. You’ll also see how to take Solr to the next level, with deep coverage of large-scale production use cases, sophisticated multilingual search, complex query operations, and advanced relevancy tuning strategies.

Roadmap

Solr in Action is divided into three parts: Meet Solr, "Core Solr capabilities, and Taking Solr to the next level." If you are new to Solr and to search in general, we strongly recommend that you read the chapters in part 1 in order, as many of the concepts presented in these chapters build on each other.

The concepts covered in part 2 were chosen because they are common features of most search applications. You can safely skip any chapter in part 2 that may not apply to your current needs. For example, result grouping is a common feature in many search engines, but if your data doesn’t require grouping, then you can safely skip chapter 11.

The four chapters (13–16) in part 3 are the most challenging as they introduce advanced topics, including multilingual search, running Solr in a large-scale cluster environment, advanced data operations, and relevancy tuning.

Most of the chapters use hands-on activities to help you work through the material. Our goal for each example was that it be easy to use but cover the chapter topic thoroughly. In many examples, we used data from real-world datasets so that you would get exposure to working with realistic use cases.

Chapter 1 introduces the type of data and use cases Solr was designed to handle. You’ll learn about the kinds of problems you can solve with Solr and gain an overview of its key features. Solr 4 is a significant milestone for the Lucene/Solr project, so even if you’re an expert on previous versions of Solr, we encourage you to read chapter 1 to get a sense for all the new and exciting features in Solr 4.

Chapter 2 shows how to install and run Solr on your local workstation. After starting Solr, we demonstrate how to index and query a set of example documents that ship with Solr. We also take a brief tour of Solr’s web-based administration console.

Chapter 3 introduces general search theory and how Solr implements that theory in practice. Most interestingly, this chapter covers the inverted search index and how relevancy scoring works to present the most relevant documents at the top of search results. Even if you have worked with Solr in the past, we recommend reading this chapter to refresh your understanding of the fundamental operations in a search engine.

Chapter 4 shows the basics of Solr’s configuration, primarily focused on Solr’s main configuration file: solrconfig.xml. Our aim in this chapter is to introduce the most important configuration settings for Solr, particularly those that impact how Solr processes requests from client applications. The knowledge you gain in this chapter will be applied throughout the rest of the book.

Chapter 5 teaches how Solr indexes documents, starting with a discussion of another important configuration file: schema.xml. You’ll learn how to define fields to represent structured data like numbers, dates, prices, and unique identifiers. We also cover how update requests are processed and configured using solrconfig.xml.

Chapter 6 builds on the material in chapter 5 by showing how to index text fields using text analysis. Solr was designed to efficiently search and rank documents requiring full-text search. Text analysis is an important part of the search process in that it removes the linguistic variations between indexed text and queries.

At this point in the book, you’ll have a solid foundation and will be ready to put Solr to work on your own search needs. As your knowledge of search and Solr grows, so too will your need to go beyond basic keyword searching and implement common search features such as advanced query parsing, hit highlighting, spell-checking, autosuggest, faceting, and result grouping.

In chapter 7, we cover how to construct queries and how they are executed. You’ll learn about Solr’s many query parsers, as well as how to sort, format, return, and debug search results.

In chapter 8, you’ll learn about one of the most powerful and popular features of Solr—faceting. Solr’s faceting provides tools to refine search criteria and helps users discover more information by categorizing search results into subgroups.

Chapter 9 explains how to highlight query terms in search results in order to improve the user experience with your search solution.

In chapter 10, we cover spell-checking and autosuggestions. Solr’s autosuggest features allow a user to start typing a few characters and receive a list of suggested queries as they type.

Chapter 11 explores Solr’s result grouping and field collapsing support to help you return an optimal mix of search results when your index includes many similar documents, such as multiple locations of the same restaurant in a city.

Chapter 12 helps you prepare to deploy Solr in a production environment. This chapter will help you plan your hardware and resource needs, as well as whether you need to consider sharding and replication to handle a large number of documents and query requests.

Chapter 13 covers a set of distributed features known as SolrCloud. You’ll learn how to run Solr in cloud mode so that you can scale your search application to support a large volume of users and documents. You’ll come away from this chapter having a solid understanding of how Solr achieves scalability and fault tolerance by distributing indexes across multiple servers.

Chapter 14 builds upon the text analysis concepts covered in chapter 6 by teaching you how to handle multilingual text in your search engine. If you need to work with non-English text or support multiple languages in the same index, this chapter is a must-read.

Chapter 15 explores advanced query features, including function queries, geospatial search, multilevel faceting, and cross-document and cross-index joins.

In chapter 16, you’ll learn techniques for improving the relevancy of your results, such as boosting, scoring based upon functions, alternate similarity algorithms, and debugging relevancy scores. In addition, we provide an in-depth discussion of using Solr for personalized search and recommendations.

There are three appendixes, which cover a number of subtopics from earlier chapters in greater depth. Appendix A focuses on working with the Solr codebase and how you can create your own custom Solr distribution if you need features or bug fixes not available in an official release. This is an extension of some of the material from the beginning of chapter 12.

Appendix B lists, in table format, out-of-the box configurations for many of the languages Solr supports. This material is an extended version of the language configurations covered in chapter 14.

Appendix C highlights the Data Import Handler (DIH) in more detail (extending coverage from chapters 10 and 12), demonstrating the steps necessary for importing a number of large, publicly available datasets.

How to use this book

Solr in Action is designed to be accessible for any software engineer—no previous experience working with search engines is assumed. The topics covered rise in expertise level throughout the book, and even the most seasoned Solr professionals are likely to learn something from the last few chapters. The scope of the book is massive—coming in at over 600 pages—but the engaging and practical real-world examples and careful balance between theory and practice make the book a real asset to anyone using Solr —whether you are just getting started or have years of experience.

As mentioned above, the chapters in part 1 provide the foundation upon which the rest of the book will be built, and they will be critical for anyone new to Solr. These chapters should be read in sequence to give you the best overview of Solr and search in general. If you are new to Solr, chapter 2 will show you how to start and use Solr for the first time, and chapter 3 will provide the key search theory that the rest of the book builds upon. Configuring your Solr server and setting up field types to properly analyze your content round out the search topics needed to understand Solr’s fundamentals.

Many of the chapters in part 2 can be skipped if your work does not include the features discussed. In particular, chapters 9, 10, and 11 are largely standalone topics that are not important for understanding later chapters, so you can skip them if you are not planning on implementing hit highlighting, query suggestions, or result grouping/field collapsing any time soon. Chapters 7 and 8 cover some of the most commonly used features of many search applications, so you will want to at least skim through them before putting the book away.

The remaining chapters cover some of the advanced topics surrounding Solr. Tough challenges will be tackled, including scaling a cluster of servers, multilingual search, complex query operations, and advanced relevancy techniques. While all chapters in parts 2 and 3 build on part 1, chapter 13 (SolrCloud) additionally builds on chapter 12 (Taking Solr to production), chapter 15 (Complex query operations) builds on chapters 7 (Performing queries and handling results) and 8 (Faceted search), and chapter 16 (Mastering relevancy) further builds on chapter 15. In order to get the most benefit out of the book, be mindful not to skip any earlier chapters that provide the necessary background for your understanding of these more advanced topics.

Many of the chapters include executable examples that you can run as you read along. These examples demonstrate new topics and provide you with the opportunity for hands-on exploration of Solr’s capabilities—often through just hitting a running Solr server from your web browser. While you do not have to run all of the examples and can simply use them as reference configurations in many cases, running the examples will provide you with hands-on experience that may help some of the more challenging topics sink in.

Whether you plan to work your way through the whole book—going from first-time Solr user to Solr expert—is up to you. If not, you can always refer to the book over time as your interest and need for more advanced Solr capabilities continue to grow.

Code conventions and downloads

Java code, configuration snippets, executable commands, contents of files, and server requests/responses (subsequently referred to as source code) in this book are in a fixed-width font, which sets them apart from the surrounding text. In many listings, the source code is annotated to point out the key concepts. In some cases, source code is in bold fixed-width font for emphasis. We have tried to format the source code so it fits within the available page space in the book by adding line breaks and using indentation carefully. Sometimes, however, very long lines include line-continuation markers like this: .

Throughout the book you will find references to files that are included with Solr or with the examples that come with the book. File names will typically be in italics, except when they are referenced within source code, where they will still use a fixed-width font.

Source code examples appear throughout this book, with longer listings appearing under clear listing headers and shorter listings appearing between lines of text. Source code for all the working examples in the book is available for download from the publisher’s website at www.manning.com/SolrinAction or www.manning.com/grainger.

A README.txt file is provided in the root folder of the accompanying source code, providing details on how to compile and run the examples. We chose to use Java as the development language for this book because it is the language used within the Lucene/Solr project, and we thought it would be easiest for readers to deal with one, consistent programming language.

After you download Solr in chapter 2, we will refer to the folder in which you installed Solr as $SOLR_INSTALL in the rest of the book. Similarly, we will refer to the folder into which you download and extract the source code accompanying this book as $SOLR_IN_ACTION. Wherever you see either of these, you should substitute the actual folder name on your system.

Author Online

Purchase of Solr in Action includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the authors and from other users. To access the forum and subscribe to it, point your browser to www.manning.com/SolrinAction or www.manning.com/grainger. The page provides information on how to get on the forum once you’re registered, what kind of help is available, and the rules of conduct on the forum.

Manning’s commitment to our readers is to provide a venue where a meaningful dialog between individual readers and between readers and the authors can take place. It’s not a commitment to any specific amount of participation on the part of the authors, whose contribution to the forum remains voluntary (and unpaid). We suggest you ask the authors challenging questions lest their interest stray!

The Author Online forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.

About the cover illustration

The figure on the cover of Solr in Action is captioned A Gothscheer woman, or a woman from a Gothic tribe. The Goths were a northern people that came from Scandinavia to Europe 2000 years ago, and originally settled around the Baltic Sea. They played an important role in the fall of the Roman Empire and the emergence of Medieval Europe. They eventually separated into two branches, with the Visigoths becoming federates of the Romans and then moving west to France and Spain, and the Ostrogoths moving to northern Italy, the Balkans, and as far east as the Black Sea. Over time, their language and culture disappeared as they assimilated in the regions where they had settled.

This illustration is taken from a recent reprint of Balthasar Hacquet’s Images and Descriptions of Southwestern and Eastern Wenda, Illyrians, and Slavs published by the Ethnographic Museum in Split, Croatia, in 2008. Hacquet (1739–1815) was an Austrian physician and scientist who spent many years studying the botany, geology, and ethnography of many parts of the Austrian Empire, as well as the Veneto, the Julian Alps, and the western Balkans, inhabited in the past by peoples of many different tribes and nationalities. Hand-drawn illustrations accompany the many scientific papers and books that Hacquet published.

The rich diversity of the drawings in Hacquet’s publications speaks vividly of the uniqueness and individuality of Alpine and Balkan regions just 200 years ago. This was a time when the dress codes of two villages separated by a few miles identified people uniquely as belonging to one or the other, and when members of an ethnic tribe, social class, or trade could be easily distinguished by what they were wearing. Dress codes have changed since then and the diversity by region, so rich at the time, has faded away. It is now often hard to tell the inhabitant of one continent from another, and today’s inhabitants of the towns and villages on the shores of the Baltic or Mediterranean or Black Seas are not readily distinguishable from residents of other parts of Europe.

We at Manning celebrate the inventiveness, the initiative, and the fun of the computer business with book covers based on costumes from two centuries ago brought back to life by illustrations such as this one.

Part 1. Meet Solr

Our primary focus in these first six chapters will be to explore Solr’s two most important functions: indexing data and executing queries. After reading part 1, you should have a solid understanding of Solr’s query and indexing capabilities, including how to perform analysis of text and other data types, and how to execute searches across that data.

As with every new subject, first we must start with the basics—learning how to install Solr and run it locally.

If you are new to the full-text search space, some of the terminology may be unfamiliar, so consider chapter 3 a dictionary of sorts. What are the key differentiators between a search engine and a database? What is an inverted index? What is relevancy ranking and how does Solr implement it?

With the basics out of the way, starting with chapter 4, we begin looking under the hood of the Solr engine to see how requests are executed and to get an idea of the configuration settings that govern request processing. The main configuration file in Solr, solrconfig.xml, contains numerous settings, some of which (such as cache management settings) are useful when just starting out, while others are intended for advanced users.

A search engine is not very interesting until it has some documents indexed. In chapters 5 and 6, we focus on how documents get indexed, covering document schema design, field types, and text analysis. Understanding these core aspects of indexing will help you throughout the rest of the book.

Chapter 1. Introduction to Solr

This chapter covers

Characteristics of data handled by search engines

Common search engine use cases

Key components of Solr

Reasons to choose Solr

Feature overview

With fast-growing technologies such as social media, cloud computing, mobile applications, and big data, these are exciting, and challenging, times to be in computing. One of the main challenges facing software architects is handling the massive volume of data consumed and produced by a huge, global user base. In addition, users expect online applications to always be available and responsive. To address the scalability and availability needs of modern web applications, we’ve seen a growing interest in specialized, nonrelational data storage and processing technologies, collectively known as NoSQL (Not only SQL). These systems share a common design pattern of matching storage and processing engines to specific types of data rather than forcing all data into the once-standard relational model. In other words, NoSQL technologies are optimized to solve a specific class of problems for specific types of data. The need to scale has led to hybrid architectures composed of a variety of NoSQL and relational databases; gone are the days of the one-size-fits-all data-processing solution.

This book is about Apache Solr, a specific NoSQL technology. Solr, just as its nonrelational brethren, is optimized for a unique class of problems. Specifically, Solr is a scalable, ready-to-deploy enterprise search engine that’s optimized to search large volumes of text-centric data and return results sorted by relevance. That was a bit of a mouthful, so let’s break that statement down into its basic parts:

Scalable— Solr scales by distributing work (indexing and query processing) to multiple servers in a cluster.

Ready to deploy— Solr is open source, is easy to install and configure, and provides a preconfigured example to help you get started.

Optimized for search— Solr is fast and can execute complex queries in subsecond speed, often only tens of milliseconds.

Large volumes of documents— Solr is designed to deal with indexes containing many millions of documents.

Text-centric— Solr is optimized for searching natural-language text, like emails, web pages, resumes, PDF documents, and social messages such as tweets or blogs.

Results sorted by relevance— Solr returns documents in ranked order based on how relevant each document is to the user’s query.

In this book, you’ll learn how to use Solr to design and implement scalable search solutions. You’ll begin by learning about the types of data and use cases Solr supports. This will help you understand where Solr fits into the big picture of modern application architectures and which problems Solr is designed to solve.

1.1. Why do I need a search engine?

Because you’re looking at this book, we suspect that you already have an idea about why you need a search engine. Rather than speculate on why you’re considering Solr, we’ll get right down to the hard questions you need to answer about your data and use cases in order to decide if a search engine is right for you. In the end, it comes down to understanding your data and users and picking a technology that works for both. Let’s start by looking at the properties of data that a search engine is optimized to handle.

1.1.1. Managing text-centric data

A hallmark of modern application architectures is matching the storage and processing engine to your data. If you’re a programmer, you know to select the best data structure based on how you use the data in an algorithm; that is, you don’t use a linked list when you need fast random lookups. The same principle applies with search engines. Search engines like Solr are optimized to handle data exhibiting four main characteristics:

Text-centric

Read-dominant

Document-oriented

Flexible schema

A possible fifth characteristic is having a large volume of data to deal with; that is, big data, but our focus is on what makes a search engine special among other NoSQL technologies. It goes without saying that Solr can deal with large volumes of data.

Although these are the four main characteristics of data that search engines like Solr handle efficiently, you should think of them as rough guidelines, not strict rules. Let’s dig into each to see why they’re important for search. For now, we’ll focus on the high-level concepts; we’ll get into the how in later chapters.

Text-centric

You’ll undoubtedly encounter the term unstructured used to describe the type of data that’s handled by a search engine. We think unstructured is a little ambiguous because any text document based on human language has implicit structure. You can think of unstructured as being from the perspective of a computer, which sees text as a stream of characters. The character stream must be parsed using language-specific rules to extract the structure and make it searchable, which is exactly what search engines do.

We think text-centric is more appropriate for describing the type of data Solr handles, because a search engine is specifically designed to extract the implicit structure of text into its index to improve searching. Text-centric data implies that the text of a document contains information that users are interested in finding. Of course, a search engine also supports nontext data such as dates and numbers, but its primary strength is handling text data based on natural language.

The centric part is important because if users aren’t interested in the information in the text, a search engine may not be the best solution for your problem. Consider an application in which employees create travel expense reports. Each report contains a number of structured data fields such as date, expense type, currency, and amount. In addition, each expense may include a notes field in which employees can provide a brief description of the expense. This would be an example of data that contains text but isn’t text-centric, in that it’s unlikely that the accounting department needs to search the notes field when generating monthly expense reports. Just because data contains text fields doesn’t mean that data is a natural fit for a search engine.

Think about whether your data is text-centric. The main consideration is whether or not the text fields in your data contain information that users will want to query. If yes, then a search engine is probably a good choice. You’ll see how to unlock the structure in text by using Solr’s text analysis capabilities in chapters 5 and 6.

Read-dominant

Another key aspect of data that search engines handle effectively is that data is read-dominant and therefore intended to be accessed efficiently, as opposed to updated frequently. Let’s be clear that Solr does allow you to update existing documents in your index. Think of read-dominant as meaning that documents are read far more often than they’re created or updated. But don’t take this to mean that you can’t write a lot of data or that you have limits on how frequently you can write new data. In fact, one of the key features in Solr 4 is near real-time (NRT) search, which allows you to index thousands of documents per second and have them be searchable almost immediately.

The key point behind read-dominant data is that when you write data to Solr, it’s intended to be read and reread myriad times over its lifetime. Think of a search engine as being optimized for executing queries (a read operation), for example, as opposed to storing data (a write operation). Also, if you must update existing data in a search engine often, that could be an indication that a search engine might not be the best solution for your needs. Another NoSQL technology, like Cassandra, might be a better choice when you need fast random writes to existing data.

Document-oriented

Until now, we’ve talked about data, but in reality, search engines work with documents. In a search engine, a document is a self-contained collection of fields, in which each field only holds data and doesn’t contain nested fields. In other words, a document in a search engine like Solr has a flat structure and doesn’t depend on other documents. The flat concept is slightly relaxed in Solr, in that a field can have multiple values, but fields don’t contain subfields. You can store multiple values in a single field, but you can’t nest fields inside of other fields.

The flat, document-oriented approach in Solr works well with data that’s already in document format, such as a web page, blog, or PDF document, but what about modeling normalized data stored in a relational database? In this case, you need to denormalize data spread across multiple tables into a flat, self-contained document structure. We’ll learn how to approach problems like this in chapter 3.

You also want to consider which fields in your documents must be stored in Solr and which should be stored in another system, such as a database. A search engine isn’t the place to store data unless it’s useful for search or displaying results; for example, if you have a search index for online videos, you don’t want to store the binary video files in Solr. Rather, large binary fields should be stored in another system, such as a content-distribution network (CDN). In general, you should store the minimal set of information for each document needed to satisfy search requirements. This is a clear example of not treating Solr as a general data-storage technology; Solr’s job is to find videos of interest, not to manage large binary files.

Flexible schema

The last main characteristic of search-engine data is that it has a flexible schema. This means that documents in a search index don’t need to have a uniform structure. In a relational database, every row in a table has the same structure. In Solr, documents can have different fields. Of course, there should be some overlap between the fields in documents in the same index, but they don’t have to be identical.

Imagine a search application for finding homes for rent or sale. Listings will obviously share fields like location, number of bedrooms, and number of bathrooms, but they’ll also have different fields based on the listing type. A home for sale would have fields for listing price and annual property taxes, whereas a home for rent would have a field for monthly rent and pet policy.

To summarize, search engines in general and Solr in particular are optimized to handle data having four specific characteristics: text-centric, read-dominant, document-oriented, and flexible schema. Overall, this implies that Solr is not a general-purpose data-storage and processing technology.

The whole point of having such a variety of options for storing and processing data is that you don’t have to find a one-size-fits-all technology. Search engines are good at certain things and quite horrible at others. This means, in most cases, you’re going to find that Solr complements relational and NoSQL databases more than it replaces them.

Now that we’ve talked about the type of data Solr is optimized to handle, let’s think about the primary use cases a search engine like Solr is designed for. These use cases are intended to help you understand how a search engine is different than other data-processing technologies.

1.1.2. Common search-engine use cases

In this section, we look at things you can do with a search engine like Solr. As with our discussion of the types of data in section 1.1.1, use these as guidelines, not as strict rules. Before we get into specifics, we should remind you to keep in mind that the bar for excellence in search is high. Modern users are accustomed to web search engines like Google and Bing being fast and effective at serving modern web-information needs. Moreover, most popular websites have powerful search solutions to help people find information quickly. When you’re evaluating a search engine like Solr and designing your search solution, make sure you put user experience as a high priority.

Basic keyword search

It’s almost too obvious to point out that a search engine supports keyword search, as that’s its main purpose, but it’s worth mentioning, because keyword search is the most typical way users will begin working with your search solution. It would be rare for a user to want to fill out a complex search form initially. Given that basic keyword search will be the most common way users will interact with your search engine, it stands to reason that this feature must provide a great user experience.

In general, users want to type in a few simple keywords and get back great results. This may sound like a simple task of matching query terms to documents, but consider a few of the issues that must be addressed to provide a great user experience:

Relevant results must be returned quickly, within a second or less in most cases.

Spelling correction is needed in case the user misspells some of the query terms.

Autosuggestions save keystrokes, particularly for mobile applications.

Synonyms of query terms must be recognized.

Documents containing linguistic variations of query terms must be matched.

Phrase handling is needed; that is, does the user want documents matching all words or any of the words in a phrase.

Queries with common words like a, an, of, and the must be handled properly.

The user must have a way to see more results if the top results aren’t satisfactory.

As you can see, a number of issues exist that make a seemingly basic feature hard to implement without a specialized approach. But with a search engine like Solr, these features come out of the box and are easy to implement. Once you give users a powerful tool to execute keyword searches, you need to consider how to display the results. This brings us to our next use case: ranking results based on their relevance to the user’s query.

Ranked retrieval

A search engine stands alone as a way to return top documents for a query. In an SQL query to a relational database, a row either matches a query or it doesn’t, and results are sorted based on one or more of the columns. A search engine returns documents sorted in descending order by a score that indicates the strength of the match of the document to the query. How the strength of the match is calculated depends on a number of factors, but in general a higher score means the document is more relevant to the query.

Ranking documents by relevancy is important for a couple of reasons:

Modern search engines typically store a large volume of documents, often millions or billions of documents. Without ranking documents by relevance to the query, users can become overloaded with results with no clear way to navigate them.

Users are more comfortable with and accustomed to getting results from other search engines using only a few keywords. Users are impatient and expect the search engine to do what I mean, not what I say. This is true of search solutions backing mobile applications in which users on the go will enter short queries with potential misspellings and expect it to simply work.

To influence ranking, you can assign more weight to, or boost, certain documents, fields, or specific terms. You can boost results by their age to help push newer documents toward the top of search results. You’ll learn about ranking documents in chapter 3.

Beyond keyword search

With a search engine like Solr, users can type in a few keywords and get back results. For many users, though, this is only the first step in a more interactive session in which the search results give them the ability to keep exploring. One of the primary use cases of a search engine is to drive an information-discovery session. Frequently, your users won’t know exactly what they’re looking for and typically don’t have any idea what information is contained in your system. A good search engine helps users narrow in on their information needs.

The central idea here is to return documents from an initial query, as well as tools to help users refine their search. In other words, in addition to returning matching documents, you also return tools that give your users an idea of what to do next. You can, for example, categorize search results using document features to allow users to narrow down their results. This is known as faceted search, and it’s one of the main strengths of Solr. You’ll see an example of a faceted search for real estate in section 1.2. Facets are covered in depth in chapter 8.

Don’t use a search engine to ...

Let’s consider a few use cases in which a search engine wouldn’t be useful. First, search engines are designed to return a small set of documents per query, usually 10 to 100. More documents for the same query can be retrieved using Solr’s built-in paging support. Consider a query that matches a million documents; if you request all of those documents back at once, you should be prepared to wait a long time. The query itself will likely execute quickly, but reconstructing a million documents from the underlying index structure will be extremely slow, as engines like Solr store fields on disk in a format from which it’s easy to create a few documents, but from which it takes a long time to reconstruct many documents when generating results.

Another use case in which you shouldn’t use a search engine is deep analytic tasks that require access to a large subset of the index (unless you have a lot of memory). Even if you avoid the previous issue by paging through results, the underlying data structure of a search index isn’t designed for retrieving large portions of the index at once.

We’ve touched on this previously, but we’ll reiterate that search engines aren’t the place for querying across relationships between documents. Solr does support querying using a parent-child relationship, but doesn’t provide support for navigating complex relational structures as is possible with SQL. In chapter 3, you’ll learn techniques to adapt relational data to work with Solr’s flat document structure.

Also, there’s no direct support in most search engines for document-level security, at least not in Solr. If you need fine-grained permissions on documents, then you’ll have to handle that outside of the search engine.

Now that we’ve seen the types of data and use cases for which a search engine is the right (or wrong) solution, it’s time to dig into what Solr does and how it does it on a high level. In the next section, you’ll learn what capabilities Solr provides and how it approaches important software-design principles such as integration with external systems, scalability, and high availability.

1.2. What is Solr?

In this section, we introduce the key components of Solr by designing a search application from the ground up. This will help you understand what specific features Solr provides and the motivation for their existence. But before we get into the specifics of what Solr is, let’s make sure you know what Solr isn’t.

Solr isn’t a web search engine like Google or Bing.

Solr has nothing to do with search engine optimization (SEO) for a website.

Now imagine we need to design a real estate search web application for potential homebuyers. The central use case for this application will be searching for homes for sale using a web browser. Figure 1.1 depicts a screenshot from this fictitious web application. Don’t focus too much on the layout or design of the UI; it’s only a mock-up to give visual context. What’s important is the type of experience that Solr can support.

Figure 1.1. Mock-up screenshot of a fictitious search application to depict Solr features

Let’s tour the screenshot in figure 1.1 to illustrate some of Solr’s key features. Starting at the top-left corner, working clockwise, Solr provides powerful features to support a keyword search box. As we discussed in section 1.1.2, providing a great user experience with basic keyword search requires complex infrastructure that Solr provides out of the box. Specifically, Solr provides spell-checking (suggesting as the user types), synonym handling, phrase queries, and text-analysis tools to deal with linguistic variations in query terms, such as buying a house or purchase a home.

Solr also provides a powerful solution for implementing geospatial queries. In figure 1.1, matching home listings are displayed on a map based on their distance from the latitude/longitude of the center of our fictitious neighborhood. With Solr’s geospatial support, you can sort documents by geo distance, limit documents to those within a particular geo distance, or even return the geo distance per document from any location. It’s also important that geospatial searches are fast and efficient, to support a UI that allows users to zoom in and out and move around on a map.

Once the user performs a query, the results can be further categorized using Solr’s faceting support to show features of the documents in the result set. Facets are a way to categorize the documents in a result set in order to drive discovery and query refinement. In figure 1.1, search results are categorized into facets for features, home style, and listing type.

Now that we have

Enjoying the preview?

Page 1 of 1

Solr in Action

About this ebook

Timothy Potter

Related authors

Related to Solr in Action

Related ebooks

Computers For You

Related podcast episodes

Related articles

Related categories

Reviews for Solr in Action

What did you think?

Book preview

Solr in Action - Timothy Potter

Copyright

Brief Table of Contents

Table of Contents

Foreword

Preface

Acknowledgments

Trey Grainger

Timothy Potter

About this Book

Roadmap

How to use this book

Code conventions and downloads

Author Online

About the cover illustration

Part 1. Meet Solr

Chapter 1. Introduction to Solr

1.1. Why do I need a search engine?

1.2. What is Solr?