bq: Is this expected behavior where it returns only a subset of the
documents it has found?

No. But there is _so_ much you're leaving out here that it's totally
impossible to say much.

bq: I've indexed a lot of documents (*.docx & *.vsd).

how? Tika? ExtractingRequestHandler? Some custom code? What fields
from these docs is mapped to what fields in Solr? How are those fields
analyzed?

bq:   "q":"NS Finance 9.2",

This parses as default_search_field:NS Finance 9.2, or perhaps it goes
against edismax and is searched across multiple fields. Or.... Add
&debug=query to see how it is actually parsed. Which won't be found at
all if this is a title and mapped to some a different field (and not
put into a "bag of words" by a copyField directive).

If much of that is gibberish, you have a sense of how impossible it is
to say much without knowing a lot about your setup.

My point is that you cannot say "I know the text is in there" and
expect anything really, you have to be able to say "I know the text is
going into field X. Field X is defined as fieldType Y. My query is
parsed as Z" to know whether these docs should be found.

And that pre-supposes you're even able to predict that the text you
"know" is in the document is being extracted. PDF files for instance
(I know you're not indexing them, just sayin') can be tuned to
consider how much space is between letters to try to squash them
together, so depending on the settings 'e r i c k' could either be 5
individual letters or one 5-letter word. And it would change if the
were a little more space between the letters......

Here's an sample Solr program that uses Tika to extract text from
documents, it might help you figure out what's actually happening if
you're using ExtractingRequestHandler to ingest data.

Best,
Erick

On Tue, Oct 17, 2017 at 4:53 PM, Phillip Wu <phillip...@unsw.edu.au> wrote:
> Hi,
> I've indexed a lot of documents (*.docx & *.vsd).
>
> When I run a query from the website it returns only a small proportion of the 
> data in the index:
> {
> "responseHeader":{
> "status":0,
> "QTime":66,
> "params":{
>    "q":"NS Finance 9.2",
>    "fl":"id,date",
>    "start":"0",
>    "_":"1508193512223"}},
> "response":{"numFound":2053,"start":0,"docs":[
> ..here it returns only 9 documents of type *.doc
> ]
>
> I know the search text occurs in some of the *.vsd files so I re-run:
> {
> "responseHeader":{
> "status":0,
> "QTime":754,
> "params":{
> "q":"\"NS Finance 9.2\" id:*FIN*.vsd",
> "fl":"id,date", "_":"1508193512223"}},
> "response":{"numFound":9,"start":0,"docs":[
> ..here it returns only 9 documents of *.vsd
> ]
>
> Is this expected behavior where it returns only a subset of the documents it 
> has found?
>
> I want all the documents that contain the query string.
> How do I tell Solr to return ALL documents containing the string?
>
>
>
>
>
>
>
>

Reply via email to