Hi Shawn, > You must send indexing requests to Solr,
Are you referring to posting <add>....</add> queries to SOLR, or to something else? > If you can set up multiple threads or processes... How do you do that? > https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory Can you update the stopwords.txt file, and then re-index the documents? How? Many thanks. Philippe ----- Mail original ----- De: "Shawn Heisey" <apa...@elyograg.org> À: solr-user@lucene.apache.org Envoyé: Vendredi 27 Mars 2015 14:38:20 Objet: Re: Tweaking SOLR memory and cull facet words On 3/27/2015 4:14 AM, phi...@free.fr wrote: > Hi, > > my SOLR 5 solrconfig.xml file contains the following lines: > > <!-- Faceting defaults --> > <str name="facet">on</str> > <str name="facet.field">text</str> > <str name="facet.mincount">100</str> > > > where the 'text' field contains thousands of words. > > When I start SOLR, the search engine takes several minutes to index the words > in the 'text' field (although loading the browse template later only takes a > few seconds because the 'text' field has already been indexed). > > Here are my questions: > > - should I increase SOLR's JVM memory to make initial indexing faster? > > e.g., SOLR_JAVA_MEM="-Xms1024m -Xmx204800m" in solr.in.sh > > - how can I cull facet words according to certain criteria (length, case, > etc.)? For instance, my facets are the following: > > application (22427) > inytapdf0 (22427) > pdf (22427) > the (22334) > new (22131) > herald (21983) > york (21975) > paris (21780) > a (21692) > and (21298) > of (21288) > i (21247) > in (21062) > to (20918) > on (20899) > m (20857) > by (20733) > de (20664) > for (20580) > at (20417) > with (20371) > ... > > Obviously, words such as "the", "i", "to","m", etc. should not be indexed. > Furthermore, I don't care about "nouns". I am only interested in people and > location names. Starting Solr does not index anything, unless you are talking about one of the sidecar indexes for spelling correction or suggestions. You must send indexing requests to Solr, and if you are experiencing slow indexing, chances are that it's because of slowness in obtaining data from the source, not Solr ... or that you are indexing with a single thread. If you can set up multiple threads or processes that are indexing in parallel, it should go faster. Thousands of terms are not hard for Solr to handle at all. When the number of terms gets into the millions or billions, then it starts becoming a hard problem. If you use the stopword filter on the index analysis chain for the field that you are using for facets, then all the stopwords will be removed from the facets. That would change how searches work on the field, so you will probably want to use copyField to create a new field that you use for faceting. There are other filters that can do things you have mentioned, like LengthFilterFactory: https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory As far as java heap sizing, trial and error is about the only way to find the right size. http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap Thanks, Shawn