When you extra text from PDF with Tika, it includes additional metadata
fields. This is the document I get after executing the example from the ref
guide at
https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-with-tika.html#trying-out-solr-cell
{
"responseHeader":{
"status":0,
"QTime":0,
"params":{
"q":"id:doc1"
}
},
"response":{
"numFound":1,
"start":0,
"numFoundExact":true,
"docs":[{
"meta":["date","2008-11-13T13:35:51Z","pdf:docinfo:custom:AAPL:Keywords","solr,
word,
pdf","pdf:PDFVersion","1.3","pdf:docinfo:title","solr-word","xmp:CreatorTool","Microsoft
Word","stream_content_type","application/pdf","pdf:hasXFA","false","access_permission:can_print_degraded","true","subject","solr
word","dc:format","application/pdf;
version=1.3","pdf:docinfo:creator_tool","Microsoft
Word","access_permission:fill_in_form","true","stream_name","myfile","pdf:encrypted","false","dc:title","solr-word","modified","2008-11-13T13:35:51Z","cp:subject","solr
word","pdf:docinfo:subject","solr
word","pdf:hasMarkedContent","false","pdf:docinfo:creator","Grant
Ingersoll","meta:author","Grant
Ingersoll","meta:creation-date","2008-11-13T13:35:51Z","stream_source_info","solr-word.pdf","created","2008-11-13T13:35:51Z","access_permission:extract_for_accessibility","true","Creation-Date","2008-11-13T13:35:51Z","Author","Grant
Ingersoll","producer","Mac OS X 10.5.5 Quartz
PDFContext","pdf:docinfo:producer","Mac OS X 10.5.5 Quartz
PDFContext","Keywords","solr, word,
pdf","access_permission:modify_annotations","true","AAPL:Keywords","solr,
word, pdf","dc:creator","Grant
Ingersoll","dcterms:created","2008-11-13T13:35:51Z","Last-Modified","2008-11-13T13:35:51Z","dcterms:modified","2008-11-13T13:35:51Z","Last-Save-Date","2008-11-13T13:35:51Z","pdf:docinfo:keywords","solr,
word,
pdf","pdf:docinfo:modified","2008-11-13T13:35:51Z","meta:save-date","2008-11-13T13:35:51Z","Content-Type","application/pdf","stream_size","21052","X-Parsed-By","org.apache.tika.parser.DefaultParser","X-Parsed-By","org.apache.tika.parser.pdf.PDFParser","creator","Grant
Ingersoll","dc:subject","solr, word,
pdf","access_permission:assemble_document","true","xmpTPg:NPages","1","pdf:hasXMP","false","access_permission:extract_content","true","access_permission:can_print","true","meta:keyword","solr,
word,
pdf","access_permission:can_modify","true","pdf:docinfo:created","2008-11-13T13:35:51Z"],
"div":["page"],
"id":"doc1",
"date":["2008-11-13T13:35:51Z"],
"pdf_docinfo_custom_aapl_keywords":["solr, word, pdf"],
"pdf_pdfversion":[1.3],
"pdf_docinfo_title":["solr-word"],
"xmp_creatortool":["Microsoft Word"],
"stream_content_type":["application/pdf"],
"pdf_hasxfa":[false],
"access_permission_can_print_degraded":[true],
"subject":["solr word"],
"dc_format":["application/pdf; version=1.3"],
"pdf_docinfo_creator_tool":["Microsoft Word"],
"access_permission_fill_in_form":[true],
"stream_name":["myfile"],
"pdf_encrypted":[false],
"dc_title":["solr-word"],
"modified":["2008-11-13T13:35:51Z"],
"cp_subject":["solr word"],
"pdf_docinfo_subject":["solr word"],
"pdf_hasmarkedcontent":[false],
"pdf_docinfo_creator":["Grant Ingersoll"],
"meta_author":["Grant Ingersoll"],
"meta_creation_date":["2008-11-13T13:35:51Z"],
"stream_source_info":["solr-word.pdf"],
"created":["2008-11-13T13:35:51Z"],
"access_permission_extract_for_accessibility":[true],
"creation_date":["2008-11-13T13:35:51Z"],
"author":["Grant Ingersoll"],
"producer":["Mac OS X 10.5.5 Quartz PDFContext"],
"pdf_docinfo_producer":["Mac OS X 10.5.5 Quartz PDFContext"],
"pdf_unmappedunicodecharsperpage":[0],
"keywords":["solr, word, pdf"],
"access_permission_modify_annotations":[true],
"aapl_keywords":["solr, word, pdf"],
"dc_creator":["Grant Ingersoll"],
"dcterms_created":["2008-11-13T13:35:51Z"],
"last_modified":["2008-11-13T13:35:51Z"],
"dcterms_modified":["2008-11-13T13:35:51Z"],
"title":["solr-word"],
"last_save_date":["2008-11-13T13:35:51Z"],
"pdf_docinfo_keywords":["solr, word, pdf"],
"pdf_docinfo_modified":["2008-11-13T13:35:51Z"],
"meta_save_date":["2008-11-13T13:35:51Z"],
"content_type":["application/pdf"],
"stream_size":[21052],
"x_parsed_by":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"],
"creator":["Grant Ingersoll"],
"dc_subject":["solr, word, pdf"],
"access_permission_assemble_document":[true],
"xmptpg_npages":[1],
"pdf_hasxmp":[false],
"pdf_charsperpage":[85],
"access_permission_extract_content":[true],
"access_permission_can_print":[true],
"meta_keyword":["solr, word, pdf"],
"access_permission_can_modify":[true],
"pdf_docinfo_created":["2008-11-13T13:35:51Z"],
"content":[" \n \n \n \n \n \n \n \n \n \n \n \n \n \n
\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
\n \n \n \n solr-word \n \n \n This is a test of PDF and Word
extraction in Solr, it is only a test. Do not panic. \n \n \n "],
"_version_":1800949864184414208
}]
}
}
Some of those fields are read from metadata embedded in the PDF file.
Op di 4 jun 2024 om 18:15 schreef Walter Underwood <[email protected]>:
> PDFs don’t have fields. PDFs are instructions for a monkey with rubber
> stamps to make a printed page. They have instructions to move to a location
> and put a character there.
>
> As an XML developer friend said, turning a PDF document into structured
> text is like turning hamburger back into a cow.
>
> I dealt with PDF documents in search for over twenty years. You are lucky
> to get searchable text out of them.
>
> wunder
> Walter Underwood
> [email protected]
> http://observer.wunderwood.org/ (my blog)
>
> > On Jun 4, 2024, at 8:28 AM, Uwe Amberger <[email protected]> wrote:
> >
> > Hallo!
> >
> > Problem description:
> > I want to index a wide variety of PDFs whose content I have no knowledge
> of. So I cannot define any fields in advance. Users should be able to
> search for terms, and every PDF containing these terms should be found.
> >
> > I think that a schemaless schema (which adds unknown fields) is not the
> way to go:
> > 1. Apache solr documentation warns not to use a schemaless schema in a
> production environment.
> > 2. As can be read here:
> https://solr.apache.org/guide/solr/9_2/indexing-guide/schemaless-mode.html
> > "Once a field has been added to the schema, its field type is fixed."
> And it cannot be added again with a different field type.
> >
> > Question:
> > When indexing a PDF, is there a way to ignore its unknown fields and
> still index the PDF?
> >
> > Possible solution:
> > I found the IgnoreFieldUpdateProcessorFactory class, which seems to
> offer this possibility, but how do I configure it in the solrconfig.xml?
> >
> > Thanks for any help!
>
>