File Converters
Use File Converters to extract text from files in different formats and cast it into the unified Document format.
Position in a Pipeline | At the very beginning of an indexing Pipeline |
Input | Filename |
Output | Documents |
Classes | PDFToTextConverter DocxToTextConverter AzureConverter ImageToTextConverter MarkdownConverter |
Tutorial: To see an example of file converters in a pipeline, see out advanced indexing tutorial.
Usage
Click a tab to read more about each converter and see how to initialize it:
Haystack also has a convert_files_to_dicts()
utility function that
will convert all txt or pdf files in a given directory.
Copied!
from haystack.utils import convert_files_to_dictsdocs = convert_files_to_dicts(dir_path=doc_dir)