Discover this podcast and so much more

Podcasts are free to enjoy without a subscription. We also offer ebooks, audiobooks, and so much more for just $11.99/month.

Episode 104. It's all about Apache Tika, the project that lets you index EVERYTHING.

Episode 104. It's all about Apache Tika, the project that lets you index EVERYTHING.

FromJava Pub House


Episode 104. It's all about Apache Tika, the project that lets you index EVERYTHING.

FromJava Pub House

ratings:
Length:
76 minutes
Released:
Apr 19, 2024
Format:
Podcast episode

Description

So we continue to have guests in our show to talk to us about interesting things... This time is about Apache Tika. This is an incredible tool to do search file processing and metadata extraction. Think about that you have tons of unstructured files, like emails, or documents, and you want to extract, index and then search theses. This is Tika's purpose. And who best to walk us through how it does its magic that its Project Management Committee (PMC) Chair, Tim Allison! So take a listen as we go deeper on ingesting tons of content (which is fundamental for things like training LLMs). http://www.javapubhouse.com/datadog We thank DataDogHQ for sponsoring this podcast episode Don't forget to SUBSCRIBE to our cool NewsCast OffHeap! http://www.javaoffheap.com/ Apache Tika * https://tika.apache.org/ OpenSearch Project and OpenSearch Neural Plugin Tutorials * https://opensearch.org/ * https://opensearch.org/docs/latest/search-plugins/neural-search/ * https://opster.com/guides/opensearch/opensearch-machine-learning/how-to-set-up-vector-search-in-opensearch/  * https://opster.com/guides/opensearch/opensearch-machine-learning/opensearch-hybrid-search/ * https://sease.io/2024/01/opensearch-knn-plugin-tutorial.html * https://sease.io/2024/04/opensearch-neural-search-tutorial-hybrid-search.html Selected Advanced File Processing toolkits/services * https://unstructured.io/ * https://aws.amazon.com/textract/ * https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence Selected Hybrid Search/RAG toolkits (there are _MANY_ others!) * Haystack: https://haystack.deepset.ai/ * LangChain: https://www.langchain.com/ * LangStream: https://langstream.ai/ Search/Relevance Conferences * https://haystackconf.com/ * https://2024.berlinbuzzwords.de/ * https://mices.co/ Tim's personal project * JavaFX (ahem) tika-config writer UI: https://github.com/tballison/tika-gui-v2 Do you like the episodes? Want more? Help us out! Buy us a beer! https://www.javapubhouse.com/beer And Follow us!  https://www.twitter.com/javapubhouse
Released:
Apr 19, 2024
Format:
Podcast episode

Titles in the series (100)

This podcast talks about how to program in Java; not your tipical system.out.println("Hello world"), but more like real issues, such as O/R setups, threading, getting certain components on the screen or troubleshooting tips and tricks in general. The format is as a podcast so that you can subscribe to it, and then take it with you and listen to it on your way to work (or on your way home), and learn a little bit more (or reinforce what you knew) from it.