Re: Large number of collections in SolrCloud

2015-08-03 Thread Mukesh Jha
We have similar date and language based collection. We also ran into similar issues of having huge clusterstate.json file which took an eternity to load up. In our case the search cases were language specific so we moved to multiple solr cluster each having a different zk namespace per language, s

Re: Collapsing Query Parser returns one record per shard...was not expecting this...

2015-08-03 Thread Joel Bernstein
One of things to keep in mind with Grouping is that if you are relying on an accurate group count (ngroups) then you will also have to collocate documents based on the grouping field. The main advantage to the Collapsing qparser plugin is it provides fast field collapsing on high cardinality field

Re: Collapsing Query Parser returns one record per shard...was not expecting this...

2015-08-03 Thread Joel Bernstein
Your findings are the expected behavior for the Collapsing qparser. The Collapsing qparser requires records in the same collapsed field to be located on the same shard. The typical approach for this is to use composite Id routing to ensure that documents with the same collapse field land on the sam

Re: Documentation for: solr.EnglishPossessiveFilterFactory

2015-08-03 Thread Alexandre Rafalovitch
Seems simple enough that the source answers all the questions: https://github.com/apache/lucene-solr/blob/lucene_solr_4_9/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishPossessiveFilter.java#L66 It just looks for a couple of versions of apostrophe followed by s or S. Regards

DateRangeField Query throws NPE

2015-08-03 Thread Stephen Weiss
Hi everyone, I'm running into a trouble building a query with DateRangeField. Web-based queries work fine, but this code throws an NPE: dateRangeQuery = dateRangeField.getRangeQuery(null, SidxS.getSchema().getField("sku_history.date_range"), start_date_str, end_date_str, true, true); ERROR

Re: Closing the IndexSearcher/IndexWriter for a core

2015-08-03 Thread Brian Hurt
So unloading a core doesn't delete the data? That is good to know. On Mon, Aug 3, 2015 at 6:22 PM, Erick Erickson wrote: > This doesn't work in SolrCloud, but it really sounds like "lots of > cores" which is designed > to keep the most recent N cores loaded and auto-unload older ones, see: > ht

Re: Closing the IndexSearcher/IndexWriter for a core

2015-08-03 Thread Brian Hurt
Some further information: The main things use memory that I see from my heap dump are: 1. Arrays of org.apache.lucene.util.fst.FST$Arc classes- which mainly seem to hold nulls. The ones of these I've investigated have been held by org.apache.lucene.util.fst.FST objects, I have 38 cores open and

Re: Closing the IndexSearcher/IndexWriter for a core

2015-08-03 Thread Erick Erickson
This doesn't work in SolrCloud, but it really sounds like "lots of cores" which is designed to keep the most recent N cores loaded and auto-unload older ones, see: http://wiki.apache.org/solr/LotsOfCores Best, Erick On Mon, Aug 3, 2015 at 4:57 PM, Brian Hurt wrote: > Is there are an easy way for

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Konstantin Gribov
Upayavira, manual commit isn't a good advice, especially with small bulks or single document, is it? I see recommendations on using autoCommit+autoSoftCommit instead of manual commit mostly. вт, 4 авг. 2015 г. в 1:00, Upayavira : > SolrJ is just a "SolrClient". In pseudocode, you say: > > SolrCli

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Upayavira
SolrJ is just a "SolrClient". In pseudocode, you say: SolrClient client = new SolrClient("http://localhost:8983/solr/whatever";); List docs = new ArrayList<>(); SolrInputDocument doc = new SolrInputDocument(); doc.addField("id", "abc123"); doc.addField("some-text-field", "I like it when the sun s

Documentation for: solr.EnglishPossessiveFilterFactory

2015-08-03 Thread Steven White
Hi Everyone, Does anyone knows where I can find docs on ? The only one I found is the API doc: http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/en/EnglishPossessiveFilterFactory.html but that's not what I'm looking for, I'm looking for one to describe in details how

RE: Do not match on high frequency terms

2015-08-03 Thread Swedish, Steve
Thanks for your response. For TermsComponent, I am able to get a list of all terms in a field that have a document frequency under a certain threshold, but I was wondering if I could instead pass a list of terms, and get back only the terms from that list that have a document frequency under a c

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Alexandre Rafalovitch
Well, If it is just file names, I'd probably use SolrJ client, maybe with Java 8. Read file names, split the name into parts with regular expressions, stuff parts into different field names and send to Solr. Java 8 has FileSystem walkers, etc to make it easier. You could do it with DIH, but it wo

Closing the IndexSearcher/IndexWriter for a core

2015-08-03 Thread Brian Hurt
Is there are an easy way for a client to tell Solr to close or release the IndexSearcher and/or IndexWriter for a core? I have a use case where we're creating a lot of cores with not that many documents per zone (a few hundred to maybe 10's of thousands). Writes come in batches, and reads also te

Re: solr multicore vs sharding vs 1 big collection

2015-08-03 Thread Bill Bell
Yeah a separate by month or year is good and can really help in this case. Bill Bell Sent from mobile > On Aug 2, 2015, at 5:29 PM, Jay Potharaju wrote: > > Shawn, > Thanks for the feedback. I agree that increasing timeout might alleviate > the timeout issue. The main problem with increasing t

Collapsing Query Parser returns one record per shard...was not expecting this...

2015-08-03 Thread Peter Lee
>From my reading of the solr docs (e.g. >https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results >and https://cwiki.apache.org/confluence/display/solr/Result+Grouping), I've >been under the impression that these two methods (result grouping and >collapsing query parser) can

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Mugeesh Husain
@Alexandre No i dont need a content of a file. i am repeating my requirement I have a 40 millions of files which is stored in a file systems, the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf I just split all Value from a filename only,these values i have to index. I am interested t

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Mugeesh Husain
@Erik Hatcher You mean i have to use Solrj for indexing to it.(right ?) Can Solrj handle large amount of data which i have mentioned previous post ? If i will use DIH then how will i split value from filename etc. I want to start my development in a right direction that why i am little confuse on

Re: HTTP Error 500 on "/admin/ping" request

2015-08-03 Thread Steven White
I found the issue. With GET, the legacy code I'm calling into was written like so: clientResponse = resource.contentType("application/atom+xml").accept("application/atom+xml").get(); This is a bug, and should have been: clientResponse = resource.accept("application/atom+xml").get(); Go

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Alexandre Rafalovitch
Just to reconfirm, are you indexing file content? Because if you are, you need to be aware most of the PDF do not extract well, as they do not have text flow preserved. If you are indexing PDF files, I would run a sample through Tika directly (that's what Solr uses under the covers anyway) and see

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Erick Erickson
Ahhh, listen to Hatcher if you're not indexing the _contents_ of the files, just the filenames Erick On Mon, Aug 3, 2015 at 2:22 PM, Erik Hatcher wrote: > Most definitely yes given your criteria below. If you don’t care for the > text to be parsed and indexed within the files, a simple fil

Re: Large number of collections in SolrCloud

2015-08-03 Thread Erick Erickson
Hmmm, one thing that will certainly help is the new per-collection state.json that will replace clusterstate.json. That'll reduce a lot of chatter. You might also get a lot of mileage out of breaking the collections into sub-groups that are distinct thus reducing the number of collections on each

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Erik Hatcher
Most definitely yes given your criteria below. If you don’t care for the text to be parsed and indexed within the files, a simple file system crawler that just got the directory listings and posted the file names split as you’d like to Solr would suffice it sounds like. — Erik Hatcher, Senior S

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Erick Erickson
I'd go with SolrJ personally. For a terabyte of data that (I'm inferring) are PDF files and the like (aka "semi-structured documents) you'll need to have Tika parse out the data you need to index. And doing that through posting or DIH puts all the analysis on the Solr servers, which will work, but

Re: Why is /query needed for Json Facet?

2015-08-03 Thread William Bell
OK I figured it out. The documentation is not updated. The default component are as follows: FacetModule.COMPONENT_NAME = "facet_module" Thus. The following is the default with the new facet_module. We need someone to update the solrconfig.xml and the docs. query facet face

Re: posting html files

2015-08-03 Thread Erik Hatcher
My recommendation, start with the default configset (data_driven_schema_configs) like this: # grab an HTML page to use curl http://lucene.apache.org/solr/index.html > index.html bin/solr start bin/solr create -c html_test bin/post -c html_test index.html $ curl "http://localhost:8983/

Re: HTTP Error 500 on "/admin/ping" request

2015-08-03 Thread Steven White
Yes, my application is in Java, no I cannot switch to SolrJ because I'm working off legacy code for which I don't have the luxury to refactor.. If my application is sending the wrong Content-Type HTTP header, which part is it and why the same header is working for the other query paths such as: "/

Re: HTTP Error 500 on "/admin/ping" request

2015-08-03 Thread Shawn Heisey
On 8/3/2015 11:34 AM, Steven White wrote: > Hi Everyone, > > I cannot figure out why I'm getting HTTP Error 500 off the following code: > Ping query caused exception: Bad contentType for search handler > :application/atom+xml Your application is sending an incorrect Content-Type HTTP header tha

Re: posting html files

2015-08-03 Thread Huiying Ma
Thanks Erik, I'm trying to index some html files in the same format and I need to index them according to classes and tags. I've tried data_driven_schema_configs but I can only get the title and id but not other tags and classes I wanted. So now I want to edit the schema in the basic_configs but t

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Mugeesh Husain
Hi Alexandre, I have a 40 millions of files which is stored in a file systems, the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf 1.)I have to split all underscore value from a filename and these value have to be index to the solr. 2.)Do Not need file contains(Text) to index. You Told me "

Re: posting html files

2015-08-03 Thread Erik Hatcher
My hunch is that the basic_configs is *too* basic for your needs here. basic_configs does not include /update/extract - it’s very basic - stripped of all the “extra” components. Try using the default, data_driven_schema_configs instead. If you’re still having issues, please provide full detail

posting html files

2015-08-03 Thread Huiying Ma
Hi everyone, I created a core with the basic config sets and schema, when I use bin/post to post one html file, I got the error: SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException.. HTTP ERROR 404 when I go to localhost:8983/solr/core/update, I got:

HTTP Error 500 on "/admin/ping" request

2015-08-03 Thread Steven White
Hi Everyone, I cannot figure out why I'm getting HTTP Error 500 off the following code: // Using: org.apache.wink.client String contentType = "application/atom+xml"; URI uri = new URI("http://localhost:8983"; + "/solr/db/admin/ping?wt=xml"); Resource resource = client.resource(uri

Why is /query needed for Json Facet?

2015-08-03 Thread William Bell
I tried using /select and this query does not work? Cannot understand why. Passing Parameters via JSON We can also pass normal request parameters in the JSON body within the params block: $ curl "http://localhost:8983/solr/query?fl=title,author"-d ' { params:{ q:"title:hero", rows:1 }

Re: How to use BitDocSet within a PostFilter

2015-08-03 Thread Stephen Weiss
Yes that was it. Had no idea this was an issue! On Monday, August 3, 2015, Roman Chyla mailto:roman.ch...@gmail.com>> wrote: Hi, inStockSkusBitSet.get(currentChildDocNumber) Is that child a lucene id? If yes, does it include offset? Every index segment starts at a different point, but docs are

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Alexandre Rafalovitch
That's still a VERY open question. The answer is Yes, but the details depend on the shape and source of your data. And the search you are anticipating. Is this a lot of entries with small number of fields. Or a - relatively - small number of entries with huge field counts. Do you need to store/ret

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Mugeesh Husain
Hi, I am new in solr development and have a same requirement and I have already got some knowledge such as how many shard have to created such amount of data at all. with help of googling. I want to take Some suggestion there are so many method to do indexing such as DIH,solr,Solrj. Please sugges

Re: Large number of collections in SolrCloud

2015-08-03 Thread Olivier
Hi, Thanks a lot Erick and Shawn for your answers. I am aware that it is a very particular issue with not a common use of Solr. I just wondered if people had the similar business case. For information we need a very important number of collections with the same configuration cause of legally reaso

Trouble getting "langid.map.individual" setting to work in Solr 5.0.x

2015-08-03 Thread David Smith
I am trying to use “languid.map.individual” setting to allow field “a” to detect as, say, English, and be mapped to “a_en”, while in the same document, field “b” detects as, say, German and is mapped to “b_de”. What happens in my tests is that the global language is detected (for example, Germa

Re: Collection APIs to create collection and custom cores naming

2015-08-03 Thread Erick Erickson
See: https://issues.apache.org/jira/browse/SOLR-6719 It's not clear that we'll support this, so this may just be a doc change. How would you properly support having more than one replica? Or, for that matter, having more than one shard? Property.name would have to do something to make the core nam

Re: How to use BitDocSet within a PostFilter

2015-08-03 Thread Roman Chyla
Hi, inStockSkusBitSet.get(currentChildDocNumber) Is that child a lucene id? If yes, does it include offset? Every index segment starts at a different point, but docs are numbered from zero. So to check them against the full index bitset, I'd be doing Bitset.exists(indexBase + docid) Just one thin

Duplicate Documents

2015-08-03 Thread Tarala, Magesh
I'm using solr 4.10.2. I'm using "id" field as the unique key - it is passed in with the document when ingesting the documents into solr. When querying I get duplicate documents with different "_version_". Out off approx. 25K unique documents ingested into solr, I see approx. 300 duplicates. It

[JOB] Financial search engine company AlphaSense is looking for Search Engineers

2015-08-03 Thread Dmitry Kan
Hi fellow Solr devs / users, I decided to resend the info on the opening assuming most of you could have been on vacation in July. I don't intend to send it any longer :) Company: AlphaSense https://www.alpha-sense.com/ Position: Search Engineer AlphaSense is a one-stop financial search engin

reload collections timeout

2015-08-03 Thread olivier
Hi everybody, I have about 1300 collections, 3 shards, replicationfactor = 3, MaxShardPerNode=3. I have 3 boxes of 64G (32 JVM). When I want to reload all my collections I get a timeout error. Is there a way to make a reload in async as to create collections (async=requestid)? I saw on this is

Re: solr multicore vs sharding vs 1 big collection

2015-08-03 Thread Upayavira
There are two things that are likely to cause the timeouts you are seeing, I'd say. Firstly, your server is overloaded - that can be handled by adding additional replicas. However, it doesn't seem like this is the case, because the second query works fine. Secondly, you are hitting garbage colle

Indexing issues after cluster restart.

2015-08-03 Thread Fadi Mohsen
Hi, using SOLR 5.2 after restarting the cluster, I get below exceptions org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes published as DOWN in our cluster state. followed by : org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: No registered leader wa

Re: Multiple boost queries on a specific field

2015-08-03 Thread bengates
Hello Chris, This totally does the trick. I drastically improved relevancy. Thank you much for your advices ! - Ben -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-boost-queries-on-a-specific-field-tp4217678p4220396.html Sent from the Solr - User mailing list arch