Re: Recap on derived objects in Solr Index, 'schema in a can'

Erick Erickson Wed, 22 Dec 2010 10:44:54 -0800

No, one cannot ignore the schema. If you try to add a field not in the
schema you get
an error. One could, however, use any arbitrary subset
of the fields defined in the schema for any particular #document# in the
index. Say
your schema had fields f1, f2, f3...f10. You could have fields f1-f5 in one
doc, and
fields f6-f10 in another and f1, f4, f9 in another and.....


The only field(s) that #must# be in a document are the required="true"
fields.

There's no real penalty for omitting fields from particular documents. This
allows
you to store "special" documents that aren't part of normal searches.

You could, for instance, use a document to store meta-information about your
index that had whatever meaning you wanted in a field(s) that *no* other
document
had. Your app could then read that "special" document and make use of that
info.
Searches on "normal" documents wouldn't return that doc, etc.

You could effectively have N indexes contained in one index where a document
in each logical sub-index had fields disjoint from the other logical
sub-indexes.
Why you'd do something like that rather than use cores is a very good
question,
but you #could# do it that way...

All this is much different from a database where there are penalties for
defining
a large number of unused fields.

Whether doing this is wise or not given the particular problem you're trying
to
solve is another discussion <G>..

Best
Erick

On Mon, Dec 20, 2010 at 11:03 PM, Dennis Gearon <gear...@sbcglobal.net>wrote:

> Based on more searches and manual consolidation, I've put together some of
> the ideas for this already suggested in a summary below. The last item in
> the
> summary
> seems to be interesting, low technical cost way of doing it.
>
> Basically, it treats the index like a 'BigTable', a la "No SQL".
>
> Erick Erickson pointed out:
> "...but there's absolutely no requirement
> that all documents in SOLR have the same fields..."
>
> I guess I don't have the right understanding of what goes into a Document
> in Solr. Is it just a set of fields, each with it's own independent field
> type
> declaration/id, it's name, and it's content?
>
> So even though there's a schema for an index, one could ignore it and
> jsut throw any other named fields and types and content at document
> addition
> time?
>
> So If I wanted to search on a base set, all documents having it, I could
> then
> additionally filter based on the (might be wrong use of this) dynamic
> fields?
>
>
>
>
>
>
> Origninal Thread that I started:
> ----------------------------------------
>
> http://lucene.472066.n3.nabble.com/A-schema-inside-a-Solr-Schema-Schema-in-a-can-tt2103260.html
>
>
> -----------------------------------------------------------------------------------------------------
>
> Repeat of the problem, (not actual ratios, numbers, i.e. could be WORSE!):
>
> -----------------------------------------------------------------------------------------------------
>
>
> 1/ Base object of some kind, x number of fields
> 2/ Derived objects representing Divisiion in company, different customer
> bases,
> etc.
>      each having 2 additional, unique fields.
> 3/ Assume 1000 such derived object types
> 4/ A 'flattened' Index would have the x base object fields,
>    ****and 2000**** additional fields
>
>
> ================================================
> Solutions Posited
> -----------------------
>
> A/ First thought, muliti-value columns as key pairs.
>      1/ Difficult to access individual items of more than one 'word' length
>             for querying in multivalued fields.
>      2/ All sorts of statistical stuff probably wouldn't apply?
>      3/ (James Dayer said:) There's also one "gotcha" we've experienced
> when
> searching acrosse
>            multi-valued fields:  SOLR will match across field occurences.
>             In the  example below, if you were to search
> q=contrib_name:(james
> AND smith),
>             you will get this record back.  It matches one name from one
> contributor  and
>
>             another name from a different contributor.  This is not what
> our
> users want.
>
>
>             As a work-around, I am converting these to phrase queries with
>             slop: "james smith"~50 ... Just use a slop # smaller than your
> positionIncrementGap
>
>             and bigger than the # of terms entered.  This will  prevent the
> cross-field matches
>
>             yet allow the words to occur in any  order.
>
>            The problem with this approach is that Lucene doesn't support
> wildcards in phrases
> B/ Dynamic fields was suggested, but I am not sure exactly how they
>        work, and the person who suggested it was not sure it would work,
> either.
> C/ Different field naming conventions were suggested in field types were
> similar.
>        I can't predict that.
> D/ Found this old thread, and i had other suggestions:
>       1/ Use multiple cores, one for each record type/schema, aggregate
> them in
> during the query.
>       2/ Use a fixed number of additional fields X 2. Eatch additional
> field is
> actually a pair of fields.
>           The first of the pair gives the colmn name, the second gives the
> data.
>
>            a) Although I like this, I wonder how many extra fields to use,
>            b) it was pointed out that relevancy and other statistical
> criterial
> for queries might suffer.
>       3/ Index the different objects exactly as they are, i.e. as Erick
> Erickson said:
>           "I'm not entirely sure this is germane, but there's absolutely no
> requirement
>
>           that all documents in SOLR have the same fields. So it's possible
> for
> you to
>
>           index the "wildly different content" in "wildly different fields"
> <G>. Then
>
>           searching for screen:LCD would be straightforward."...
> Dennis Gearon
>
>
> Signature Warning
> ----------------
> It is always a good idea to learn from your own mistakes. It is usually a
> better
> idea to learn from others’ mistakes, so you do not have to make them
> yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>

Re: Recap on derived objects in Solr Index, 'schema in a can'

Reply via email to