Hello Mr.Hostetter, Thank you for patiently reading through my post, I apologize for being cryptic in my previous messages..
>>when you cut/pasted the facet output, you excluded the field names. based >>on the schema & solrconfig.xml snippets you posted later, i'm assuming >>they are usstate, and keyword, but you have to be explicit so that people can help correlate the >>results you are getting with the schema you posted I had to be brief as my facets are in the order of 100K over 800K documents and also if I give the complete schema.xml I was afraid nobody would read my long message :-) ..Hence I showed only relevant pieces of the result showing different fields having same problem >>i'm assuming they are usstate, and keyword, but you have to be explicit so that people can help correlate the >>results you are getting with the schema you posted -- for example, you haven't posted anything that would verify that the usstate >>field actually uses your keywordText field Yes, you are right here is the compete relavant snippet regarding keywordText and associated fields. keyword, keywordlower and keywordformatted are all aggregations of all other fields like - person, personformatted, organization, location. location itself is aggregation of usstate, country. The aggregation is done seperately in custom code even before indexing into solr <fieldType name="keywordText" class="solr.TextField" sortMissingLast="true" omitNorms="true" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.TrimFilterFactory" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false" /> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.TrimFilterFactory" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false" /> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> <field name="person" type="keywordText" indexed="true" stored="true" multiValued="true" termVectors="false" termPositions="false" termOffsets="false"/> <field name="organization" type="keywordText" indexed="true" stored="true" multiValued="true" termVectors="false" termPositions="false" termOffsets="false"/> <field name="location" type="keywordText" indexed="true" stored="true" multiValued="true" termVectors="false" termPositions="false" termOffsets="false"/> <field name="country" type="keywordText" indexed="true" stored="true" multiValued="true" termVectors="false" termPositions="false" termOffsets="false"/> <field name="usstate" type="keywordText" indexed="true" stored="true" multiValued="true" termVectors="false" termPositions="false" termOffsets="false"/> <field name="subject" type="keywordText" indexed="true" stored="true" multiValued="true" termVectors="false" termPositions="false" termOffsets="false"/> <field name="keyword" type="keywordText" indexed="true" stored="true" multiValued="true" termVectors="false" termPositions="false" termOffsets="false"/> <field name="keywordlower" type="keywordText" indexed="true" stored="true" multiValued="true" termVectors="false" termPositions="false" termOffsets="false"/> <field name="personformatted" type="keywordText" indexed="true" stored="true" multiValued="true" termVectors="false" termPositions="false" termOffsets="false"/> <field name="keywordformatted" type="keywordText" indexed="true" stored="true" multiValued="true" termVectors="false" termPositions="false" termOffsets="false"/> >>A huge gap is in what your synonym files contain ... something weird in >>there could easily explain superfluous terms getting added to your data. Here are my synonym entries ------------------------------------------------------- #Persons barack obama, barak obama, barack h. obama, barack hussein obama, barak hussein obama hillary clinton, hillary r. clinton, hillary rodham clinton timothy geithner, tim geithner, timothy f. geithner, geithner, timothy franz geithner vladimir putin, putin #Organizations U.N, U.N., u.n, un, UN, United Nations => U.N DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security => D.H.S USCIS, United States Citizenship and Immigration Services, U.S.C.I.S. => United States Citizenship and Immigration Services, U.S.C.I.S SEC, Securities and Exchange Commission, S.E.C, S.E.C, SEC. => Securities and Exchange Commission, S.E.C FCC, Federal Communications Commission, F.C.C, F.C.C. => Federal Communications Commission, F.C.C GSA, General Services Administration, G.S.A, G.S.A. => General Services Administration, G.S.A SBA, Small Business Administration, S.B.A, S.B.A. => Small Business Administration, S.B.A. FEMA, Federal Emergency Management Agency, FEMA. => FEMA AT&T, ATT, ATT., AT&T., AT&T Wireless => AT&T BBC, British Broadcasting Corporation, B.B.C, B.B.C. => B.B.C,BBC Bank of America, BOA, B.O.A, Bank of America Corp, Bank of America Corp. => B.O.A General Motors, G.M., G.M, GM, General Motors Corp., General Motors Corp => General Motors, G.M NFL, National Football League, N.F.L, N.F.L. => N.F.L Exxon Mobil, Exxon Mobil Corp => Exxon Mobil Google, Google Inc, Google Inc. => Google AIG, A.I.G, A.I.G., American International Group => American International Group, A.I.G Goldman Sachs, Goldman Sachs Inc., Goldman Sachs Group Inc, Goldman Sachs Group Inc. => Goldman Sachs GE, General Electric Co., General Electric Co, G.E, G.E., General Electric => G.E, General Electric General Dynamics, General Dynamics Corp,General Dynamics Corp., General Dynamics Information Technology, General Dynamics Advanced Information Systems => General Dynamics HP, Hewlett Packard Co,Hewlett Packard Co., Hewlett Packard, Hewlett-Packard, Hewlett-Packard Corp,H.P, H.P. => Hewlett Packard, H.P IBM, International Business Machines, I.B.M, International Business Machines Corp => I.B.M Johns Hopkins University, Johns Hopkins, JHU, J.H.U, J.H.U. => Johns Hopkins University, JHU, J.H.U J.C. Penney, J.C. Penney Co. => J.C. Penney JPMorgan Chase, JPMorgan Chase & Co., JPMorgan Chase & Co, JPMorgan => JPMorgan Chase & Co. Lockheed Martin, Lockheed Martin Corp, Lockheed Martin Corp., Lockheed, Lockheed VH => Lockheed Martin Merrill Lynch, Merrill Lynch & Co., Merrill, Merrill. => Merrill Lynch Microsoft, Microsoft Corp., Microsoft Corp, Microsoft. => Microsoft Northrop Grumman, Northrop Grumman Corp., Northrop Grumman Corp, Northrop, Northrop Corp. => Northrop Grumman Smyth Co., Smyth Co Sony, Sony Corp., Sony Corp => Sony Corp. TJX Companies, TJX, TJX Cos. => TJX Companies Target Corp., Target Corp, Target Corp stores => Target Corp. Walmart, WalMart Inc, WalMart Stores, WalMart Stores Inc, WalMart Stores Inc. => WalMart Inc. Yahoo, Yahoo Inc co, Yahoo Inc. => Yahoo Inc. AP, AP., A.P, A.P., Associated Press => Associated Press #Countries USA,USA.,U.S.A.,u.s.a,u.s.a.,U.S,U.S.,US,US.,u.s, u.s.,United States,United States of America,United States Of America,united states,united states of america,united states of america => U.S.A UAE,U.A.E.,United Arab Emirates,united arab emirates,uae,u.a.e, u.a.e. => United Arab Emirates,U.A.E UK,U.K.,u.k,u.k.,United Kingdom,united kingdom => United Kingdom,U.K USSR,U.S.S.R,U.S.S.R.,ussr,u.s.s.r,u.s.s.r.,Soviet Union,soviet union,Russia,russia => U.S.S.R,Soviet Union,Russia #usa states Alabama, Ala., AL => Alabama Alaska, AK => Alabama Arizona, Ariz., Ariz, AZ => Arizona Arkansas, Ark., AR => Arkansas California, Calif., CA => California Colorado, Colo., CO => Colorado Connecticut, Conn., CT => Connecticut Delaware, Del., DE => Delaware Florida, Fla., FL => Florida Georgia, Ga., GA => Georgia Hawaii, Hawaii, HI => Hawaii Idaho, Idaho, ID => Idaho Illinois, Ill., IL => Illinois Indiana, Ind., IN => Indiana Iowa, IA => Iowa Kansas, Kans., KS => Kansas Kentucky, Ky., KY => Kentucky Louisiana, La., LA => Louisiana Maine, ME => Maine Maryland, Md., MD => Maryland Massachusetts, Mass., MA => Massachusetts Michigan, Mich., MI => Michigan Minnesota, Minn., MN => Minnesota Mississippi, Miss., MS => Mississippi Missouri, Mo., MO => Missouri Montana, Mont., MT => Montana Nebraska, Nebr., NE => Nebraska Nevada, Nev., NV => Nevada New Hampshire, N.H., NH => New Hampshire New Jersey, N.J., NJ => New Jersey New Mexico, N.M., NM => New Mexico New York, N.Y., NY => New York North Carolina, N.C., NC => North Carolina North Dakota, N.D., ND => North Dakota Ohio, OH => Ohio Oklahoma, Okla., OK => Oklahoma Oregon, Ore., OR => Oregon Pennsylvania, Pa., PA => Pennsylvania Rhode Island, R.I., RI => Rhode Island South Carolina, S.C., SC => South Carolina South Dakota, S.D., SD => South Dakota Tennessee, Tenn., TN => Tennessee Texas, Tex., TX => Texas Utah, UT => Utah Vermont, Vt., VT => Vermont Virginia, Va., VA => Virginia Washington, Wash., WA => Washington West Virginia, W.Va., WV => West Virginia Wisconsin, Wis., WI => Wisconsin Wyoming, Wyo., WY => Wyoming #US TERRITORIES American Samoa, AS => American Samoa DC,D.C.,D.C,dc,d.c,d.c.,District of Columbia,District Of Columbia,district of columbia,Washington D.C.,Washington DC.,Washington DC,washington d.c,washington dc,washington d.c.,washington,Washington => D.C,Washington D.C Federated States of Micronesia, FSM, FM => Micronesia Guam, GU => Guam Marshall Islands, MH => Marshall Islands Northern Mariana Islands, MP => Northern Mariana Islands Palau, PW => Palau Puerto Rico, P.R., PR => Puerto Rico Virgin Islands, V.I., VI => Virgin Islands, VI ------------------------------------------------------- >>all that said: my best guess is that you have old data in your index from >>an older version of your schema when you had differnet analyzers >>configured. I reindexed all 800K articles after wiping all of the index 2 times same result >>if a term is showing up in the facet counts, you can search on it -- find >>the first doc that matches, verify that the term isn't actually in the >>data, and then reindex that one doc -- if it stops matching your search >>(and the facet count drops by one) then i'm right, just reindex >>everything That was the first thing I did...I ran analyzer on field types and fields- No problem there. Then I queried via solr admin console - keyword:"New" and it gave me docs that had N.Y, New York, New Mexico etc (because of synonyms) but no docs which had just "New"...but I could see it in the facets as I mentioned in the previous posts...thats what was baffling me. On Tue, Oct 6, 2009 at 7:58 PM, Chris Hostetter <hossman_luc...@fucit.org>wrote: > > A few comments about the info you've provided... > > when you cut/pasted the facet output, you excluded the field names. based > on the schema & solrconfig.xml snippets you posted later, i'm assuming > they are usstate, and keyword, but you have to be explicit so that people > can help correlate the > results you are getting with the schema you posted -- for example, you > haven't posted anything that would verify that the usstate field actually > uses your keywordText field, for ll we know it has a different field type > by mistake (which would explain your problem). ... you have to post > everything that would let us connect the dots from input to output in > order to see where things might be going wrong. > > A huge gap is in what your synonym files contain ... something weird in > there could easily explain superfluous terms getting added to your data. > > all that said: my best guess is that you have old data in your index from > an older version of your schema when you had differnet analyzers > configured. > > if a term is showing up in the facet counts, you can search on it -- find > the first doc that matches, verify that the term isn't actually in the > data, and then reindex that one doc -- if it stops matching your search > (and the facet count drops by one) then i'm right, just reindex > everything. > > (this is where a timestamp field recording exactly when each doc was added > to the index comes in handy, you can compare it with the file modification > time on your schema.xml and be certain which docs where indexed prior to > you changes) > > > > -Hoss > >