I would really avoid schemaless in _any_ situation where I know the
schema ahead of time.

bq: But in my case, I am planning to use solrj (so, no spelling mistakes)

On, I'm quite sure there'll be some kind of mistake sometime ;) I know
of at at least one situation where a programming mistake in SolrJ
caused over 20K unique dynamic fields to be created. Admittedly, not a
spelling mistake.

But ranting aside, let's draw a clear distinction between schemaless
and managed schema on the one hand and classic on the other.

Both schemaless and managed schema use the same underlying mechanism
to change your schema file, specifically the REST API. The difference
is that _you_ need to issue the REST API commands in "managed schema"
yourself (or script or whatever). "schemaless" mode issues those REST
API commands for you whenever the update processor sees a field it
doesn't recognize, after guessing what kind of field it is.

Classic, of course, requires you to hand-edit a text file and upload
it to SolrCloud and reload collections for changes to take effect.

I tend to prefer classic when I know up-front exactly what my schema
should be. In fact, I tend to strip everything out of the schema.xml
file I know I don't need including dynamic field definitions,
copyfields and the like. Like Shawn, I want my docs to fail if they
don't conform to my schema ASAP.

Managed is ideal for situations where you have some UI front-end that
allows end users (or administrators) to define a schema and don't want
them to muck around with hand-editing files.

Schemaless is very cool, but IMO not something I'd go to production
with, especially at scale. It's way cool for starting out, but as the
scale grows you want to squeeze out all the unessential bits of the
index you can, and schemaless doesn't have the "meta-knowledge" you
have (or at least should have) about the problem space.


bq: Another thing to keep in mind is, I am pushing documents to solr
from some random/unknown source and they are not getting stored on
separate disc

This is pretty scary. How are you controlling what fields get indexed?
You mentioned SolrJ, so I'm presuming you have a mechanism to map all
the information (meta-data included) you get from those random/unknown
sources into your known schema?

FWIW,
Erick

On Wed, Jan 20, 2016 at 10:03 AM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 1/20/2016 10:17 AM, Prateek Jain J wrote:
>>
>> What all I could gather from various blogs is, defining schema stops
>> developers from accidently adding fields to solr. But in my case, I am
>> planning to use solrj (so, no spelling mistakes). My point is:
>>
>>
>> 1.       Is there any advantage like performance or anything else while
>> reading or writing or querying, if we go schema way?
>>
>> 2.       What impact it can have on maintainability of project?
>>
>> Another thing to keep in mind is, I am pushing documents to solr from some
>> random/unknown source and they are not getting stored on separate disc
>> (using solr for indexing and storing). By this what I mean is, re-indexing
>> is not an option for me.  Starting schemaless might give me a quick start
>> for project but, is there a fine print that is getting missed? Any
>> inputs/experiences/pointers are welcome.
>
>
> There is no performance difference.  With a managed schema, there is still a
> schema file in the config, it just has a different filename and can be
> changed remotely.  Internally, I am pretty sure that the java objects are
> identical.
>
> I personally would not want to have a managed schema or run in schemaless
> mode in production.  I do not want it to be possible for anybody else to
> change the config.
>
> Thanks,
> Shawn
>

Reply via email to