Numerous problems with SolrCloud

2015-12-21 Thread John Smith
This is my first experience with SolrCloud, so please bear with me.

I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
3.4.7. There's around 80 Gb of index, some collections are rather big
(20Gb) and some very small. All of them have only one shard. The bigger
ones are almost constantly being updated (and of course queried at the
same time).

I've had a huge number of errors, many different ones. At some point the
system seemed rather stable, but I've tried to add a few new collections
and things went wrong again. The usual symptom is that some cores stop
synchronizing; sometimes an entire server is shown as "gone" (although
it's still alive and well). When I add a core on a server, another (or
several others) often goes down on that server. Even when the system is
rather stable some cores are shown as recovering. When restarting a
server it takes a very long time (30 min at least) to fully recover.

Some of the many errors I've got (I've skipped the warnings):
- org.apache.solr.common.SolrException: Error trying to proxy request
for url
- org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
up to try to start recovery on replica
- org.apache.solr.common.SolrException; Error while trying to recover.
core=[...]:org.apache.solr.common.SolrException: No registered leader
was found after waiting
- update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
tlog=null}
- org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE
after succesful recovery
- org.apache.solr.common.SolrException; Could not find core to call recovery
- org.apache.solr.common.SolrException: Error CREATEing SolrCore '...':
Unable to create core
- org.apache.solr.request.SolrRequestInfo; prev == info : false
- org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was
not closed!
- org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter
- org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed
prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
- org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard
- org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
for connection from pool
- and so on...

Any advice on where I should start? I've checked disk space, memory
usage, max number of open files, everything seems fine there. My guess
is that the configuration is rather unaltered from the defaults. I've
extended timeouts in Zookeeper already.

Thanks,
John



Re: Numerous problems with SolrCloud

2015-12-21 Thread John Smith
Thanks, I'll have a try. Can the load on the Solr servers impair the zk
response time in the current situation, which would cause the desync? Is
this the reason for the change?

John.


On 21/12/15 16:45, Erik Hatcher wrote:
> John - the first recommendation that pops out is to run (only) 3 zookeepers, 
> entirely separate from Solr servers, and then as many Solr servers from there 
> that you need to scale indexing and querying to your needs.  Sounds like 3 
> ZKs + 2 Solr’s is a good start, given you have 5 servers at your disposal.
>
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com <http://www.lucidworks.com/>
>
>
>
>> On Dec 21, 2015, at 10:37 AM, John Smith  wrote:
>>
>> This is my first experience with SolrCloud, so please bear with me.
>>
>> I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
>> the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
>> 3.4.7. There's around 80 Gb of index, some collections are rather big
>> (20Gb) and some very small. All of them have only one shard. The bigger
>> ones are almost constantly being updated (and of course queried at the
>> same time).
>>
>> I've had a huge number of errors, many different ones. At some point the
>> system seemed rather stable, but I've tried to add a few new collections
>> and things went wrong again. The usual symptom is that some cores stop
>> synchronizing; sometimes an entire server is shown as "gone" (although
>> it's still alive and well). When I add a core on a server, another (or
>> several others) often goes down on that server. Even when the system is
>> rather stable some cores are shown as recovering. When restarting a
>> server it takes a very long time (30 min at least) to fully recover.
>>
>> Some of the many errors I've got (I've skipped the warnings):
>> - org.apache.solr.common.SolrException: Error trying to proxy request
>> for url
>> - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
>> up to try to start recovery on replica
>> - org.apache.solr.common.SolrException; Error while trying to recover.
>> core=[...]:org.apache.solr.common.SolrException: No registered leader
>> was found after waiting
>> - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
>> tlog=null}
>> - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE
>> after succesful recovery
>> - org.apache.solr.common.SolrException; Could not find core to call recovery
>> - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...':
>> Unable to create core
>> - org.apache.solr.request.SolrRequestInfo; prev == info : false
>> - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was
>> not closed!
>> - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter
>> - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed
>> prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
>> - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard
>> - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
>> for connection from pool
>> - and so on...
>>
>> Any advice on where I should start? I've checked disk space, memory
>> usage, max number of open files, everything seems fine there. My guess
>> is that the configuration is rather unaltered from the defaults. I've
>> extended timeouts in Zookeeper already.
>>
>> Thanks,
>> John
>>
>



Re: Numerous problems with SolrCloud

2015-12-21 Thread John Smith
OK, great. I've eliminated OOM errors after increasing the memory
allocated to Solr: 12Gb out of 20Gb. It's probably not an optimal
setting but this is all I can have right now on the Solr machines. I'll
look into GC logging too.

Turning to the Solr logs, a quick sweep revealed a lot of "Caused by:
java.net.SocketException: Connection reset" lines, but this isn't very
explicit. I suppose I'll have to cross-check on the concerned server(s).

Anyway, I'll have a try at the updated setting and I'll get back to the
list.

Thanks,
John.


On 21/12/15 17:21, Erick Erickson wrote:
> ZK isn't pushed all that heavily, although all things are possible. Still,
> for maintenance putting Zk on separate machines is a good idea. They
> don't have to be very beefy machines.
>
> Look in your logs for LeaderInitiatedRecovery messages. If you find them
> then _probably_ you have some issues with timeouts, often due to
> excessive GC pauses, turning on GC logging can help you get
> a handle on that.
>
> Another "popular" reason for nodes going into recovery is Out Of Memory
> errors, which is easy to do in a system that gets set up and
> then more and more docs get added to it. You either have to move
> some collections to other Solr instances, get more memory to the JVM
> (but watch out for GC pauses and starving the OS's memory) etc.
>
> But the Solr logs are the place I'd look first for any help in understanding
> the root cause of nodes going into recovery.
>
> Best,
> Erick
>
> On Mon, Dec 21, 2015 at 8:04 AM, John Smith  wrote:
>> Thanks, I'll have a try. Can the load on the Solr servers impair the zk
>> response time in the current situation, which would cause the desync? Is
>> this the reason for the change?
>>
>> John.
>>
>>
>> On 21/12/15 16:45, Erik Hatcher wrote:
>>> John - the first recommendation that pops out is to run (only) 3 
>>> zookeepers, entirely separate from Solr servers, and then as many Solr 
>>> servers from there that you need to scale indexing and querying to your 
>>> needs.  Sounds like 3 ZKs + 2 Solr’s is a good start, given you have 5 
>>> servers at your disposal.
>>>
>>>
>>> —
>>> Erik Hatcher, Senior Solutions Architect
>>> http://www.lucidworks.com <http://www.lucidworks.com/>
>>>
>>>
>>>
>>>> On Dec 21, 2015, at 10:37 AM, John Smith  wrote:
>>>>
>>>> This is my first experience with SolrCloud, so please bear with me.
>>>>
>>>> I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
>>>> the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
>>>> 3.4.7. There's around 80 Gb of index, some collections are rather big
>>>> (20Gb) and some very small. All of them have only one shard. The bigger
>>>> ones are almost constantly being updated (and of course queried at the
>>>> same time).
>>>>
>>>> I've had a huge number of errors, many different ones. At some point the
>>>> system seemed rather stable, but I've tried to add a few new collections
>>>> and things went wrong again. The usual symptom is that some cores stop
>>>> synchronizing; sometimes an entire server is shown as "gone" (although
>>>> it's still alive and well). When I add a core on a server, another (or
>>>> several others) often goes down on that server. Even when the system is
>>>> rather stable some cores are shown as recovering. When restarting a
>>>> server it takes a very long time (30 min at least) to fully recover.
>>>>
>>>> Some of the many errors I've got (I've skipped the warnings):
>>>> - org.apache.solr.common.SolrException: Error trying to proxy request
>>>> for url
>>>> - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
>>>> up to try to start recovery on replica
>>>> - org.apache.solr.common.SolrException; Error while trying to recover.
>>>> core=[...]:org.apache.solr.common.SolrException: No registered leader
>>>> was found after waiting
>>>> - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
>>>> tlog=null}
>>>> - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE
>>>> after succesful recovery
>>>> - org.apache.solr.common.SolrException; Could not find core to call 
>>>> recovery
>>>> - org.apache.solr.common.SolrException: Error CREATEing SolrCore 

Re: Numerous problems with SolrCloud

2015-12-22 Thread John Smith
Hi,

Thanks Erick for your input. I've added GC logging, but it was normal
when the error came again this morning. I was adding a large collection
(27 Gb): on the first server all went well. At the time I created the
core on a second server, it was almost immediately disconnected from the
cloud. This time I could nail what seems to be the root cause in the logs:

ERROR - 2015-12-22 09:39:29.029; [   ]
org.apache.solr.common.SolrException; OverseerAutoReplicaFailoverThread
had an error in its thread work
loop.:org.apache.solr.common.SolrException: Error reading cluster properties
at
org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:738)
at
org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.doWork(OverseerAutoReplicaFailoverThread.java:153)
at
org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.run(OverseerAutoReplicaFailoverThread.java:132)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryDelay(ZkCmdExecutor.java:108)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:76)
at
org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:308)
at
org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:731)
... 3 more

WARN  - 2015-12-22 09:39:29.890; [   ]
org.apache.solr.handler.admin.ZookeeperInfoHandler$ZKPrinter; Keeper
Exception
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /live_nodes
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
at
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
at
org.apache.solr.handler.admin.ZookeeperInfoHandler$ZKPrinter.printTree(ZookeeperInfoHandler.java:581)
at
org.apache.solr.handler.admin.ZookeeperInfoHandler$ZKPrinter.print(ZookeeperInfoHandler.java:527)
at
org.apache.solr.handler.admin.ZookeeperInfoHandler.handleRequestBody(ZookeeperInfoHandler.java:406)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156)
at
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:664)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:438)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:222)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:181)
...

After that the server was marked as "gone" in the cloud graph and it
took a long time to register itself again and recover.

I haven't changed the ZK config yet as per your suggestion below. Could
this fix the problem? Do you have any other suggestion?

Thanks,
John


On 21/12/15 17:39, Erick Erickson wrote:
> right, do note that when you _do_ hit an OOM, you really
> should restart the JVM as nothing is _really_ certain after
> that.
>
> You're right, just bumping the memory is a band-aid, but
> whatever gets you by. Lucene makes heavy use of
> MMapDirectory which uses OS memory rather than JVM
> memory, so you're robbing Peter to pay Paul when you
> allocate high percentages of the physical memory to the JVM.
> See Uwe's excellent blog here:
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> And yeah, your "connection reset" errors may well be GC-related
> if you're getting a lot of stop-the-world GC pauses.
>
> Sounds like you inherited a system that's getting more and more
> docs added to it over time and outgrew its host, but that's a guess.
>
> And you get to deal with it over the holidays too ;)
>
> Best,
> Erick
>
> On Mon, Dec 21, 2015 at 8:33 AM, John Smith  wrote:
>> OK, great. I've eliminated OOM errors after increasing the memory
>> allocated to Solr: 12Gb out of 20Gb. It's probably not an optimal
>> setting but this is all I can have right now on the Solr machines. I'll
>> look into GC logging too.
>>
>> Turning to the Solr logs, a quick sweep revealed a lot of "Caused by:
>> java.net.SocketException: Connection reset" lines, but this isn't very
>> explicit. I suppose I'll have to cross-check on the concerned server(s).
>>
>> Anyway, I'll have a 

More problems (now jetty errorrs) with SolrCloud

2016-01-22 Thread John Smith
Hi,

This morning one of the 2 nodes of our SolrCloud went down. I've tried
many ways to recover it but to no avail. I've tried to unload all cores
on the failed node and reload it after emptying the data directory,
hoping it would sync from scratch. The core is still marked as down and
no data is downloaded.

I get a lot of messages like to following ones in the log:

WARN  - 2016-01-22 14:14:28.535; [   ]
org.eclipse.jetty.http.HttpParser; badMessage: 400 Unknown Version for
HttpChannelOverHttp@e795880{r=0,c=false,a=IDLE,uri=-}
WARN  - 2016-01-22 14:15:02.559; [   ]
org.eclipse.jetty.http.HttpParser; badMessage: 400 Unknown Version for
HttpChannelOverHttp@727c5f10{r=0,c=false,a=IDLE,uri=-}
WARN  - 2016-01-22 14:15:02.580; [   ]
org.eclipse.jetty.http.HttpParser; badMessage:
java.lang.IllegalStateException: too much data after closed for
HttpChannelOverHttp@727c5f10{r=0,c=true,a=COMPLETED,uri=null}
ERROR - 2016-01-22 14:15:03.496; [   ]
org.apache.solr.common.SolrException;
null:org.apache.solr.common.SolrException: Unexpected method type: [...
truncated json data, presumably from a document being updated ...]POST
etc.

I've restarted all nodes in zk and SolrCloud, and reloaded all cores on
the main node (that is, the one that seems to work).


After some research, I saw that the zk timeout is set to 60 sec. in the
solr config, but at least one entry in the gc logs mentions 134 sec.
However, I noticed that the zk logs states a negociated timeout of 30 sec...

Here are my questions:
- Are the log entries shown above related to the zk session timeout, or
should I look elsewhere?
- How to make sure the timeout negociated with zk matches the value from
the solr config?
- What parameter(s) would allow to reduce the gc execution time,
presumably at the expense of a more frequent gc?

Thanks!



Re: Boosts for relevancy (shopping products)

2016-03-19 Thread John Smith
Hi,

For once I might be of some help: I've had a similar configuration
(large set of products from various sources). It's very difficult to
find the right balance between all parameters and requires a lot of
tweaking, most often in the dark unfortunately.

What I've found is that omitNorms=true is a real breakthrough: without
it results tend to favor small texts, which is not what's wanted for
product names. I also added a RemoveDuplicatesTokenFilterFactory for the
name as it's a common practice for spammers to repeat some key words in
order to be better placed in results. Stemming and custom stop words
(e.g. "cheap", "sale", ...) are other potential ideas.

I've also ended up in removing the description field as it's often too
broad, and name is now the only field left: brand, category and merchant
(as well as other fields) are offered as additional filters using
facets. Note that you'd have to re-index them as plain strings.

It's more difficult to achieve but popularity boost can also be useful:
you can measure it by sales or by number of clicks. I use a combination
of both, and store those values using partial updates.

Hope it helps,
John


On 17/03/16 09:36, Robert Brown wrote:
> Hi,
>
> I currently have an index of ~50m docs representing shopping products:
> name, description, brand, category, etc.
>
> Our "qf" is currently setup as:
>
> name^5
> brand^2
> category^3
> merchant^2
> description^1
>
> mm: 100%
> ps: 5
>
> I'm getting complaints from the business concerning relevancy, and was
> hoping to get some constructive ideas/thoughts on whether these boosts
> look semi-sensible or not, I think they were put in place pretty much
> at random.
>
> I know it's going to be a case of rounds upon rounds of testing, but
> maybe there's a good starting point that will save me some time?
>
> My initial thoughts right now are to actually just search on the name
> field, and maybe the brand (for things like "Apple Ipod").
>
> Has anyone got a similar setup that could share some direction?
>
> Many Thanks,
> Rob
>



Actual (specific) RT Search?

2016-03-20 Thread John Smith
Hi,

The purpose of the project is an actual RT Search, not NRT, but with a
specific condition: when an updated document meets a fixed criteria, it
should be filtered out from future results (no reuse of the document).
This criteria is present in the search query but of course doesn't work
for uncommitted documents.

What I wrote is a combination of the following:
- an UpdateRequestProcessor in the update chain storing the document
unique key in a local cache when the condition is met
- a postCommit listener clearing the cache
- a PostFilter collecting documents that aren't found in the cache,
activated in the search query as a fq parameter

Functionally it does the job, however for large indexes the filter takes
a hit. The index that poses problem has 18 mil documents in 13Gb, and
queries return an average of 25,000 docs in results. The VM has 8 cores
and 20Gb RAM, and uses nimble storage (combination of ssd & hd). Without
the code Solr works like a charm. My guess so far is that the filter has
to fetch the unique key for all documents in results, which consumes a
lot of resources.

What would be your advice?
- Could I use the internal document id instead of a field value? This id
would have to be available both in the UpdateRequestProcessor and
PostFilter: is it the case and how can I access it? I suppose the
SolrInputDocument in the request processor doesn't have it yet anyway.
- If I reduce the autoSoftCommit maxDocs value (how far?), would it be
wise (and feasible) to convert the PostFilter into a plain filter query
such as "*:* NOT (id:1 OR id:2)" or something similar? How could I
implement this and how to estimate the filter cost in order for Solr to
execute it at the right position?
- Maybe I took the wrong path altogether?

Thanks in advance
John



Unexpected delayed document deletion with atomic updates

2015-10-07 Thread John Smith
Hi,

I'm bumping on the following problem with update XML messages. The idea
is to record the number of clicks for a document: each time, a message
is sent to .../update such as this one:



abc
1
1.05



(Clicks is an int field; Boost is a float field, it's updated to reflect
the change in popularity using a formula based on the number of clicks).

At the moment in the dev environment, changes are committed immediately.


When a document is updated, the changes are indeed reflected in the
search results. If I click on the same document again, all goes well.
But  when I click on an other document, the latter gets updated as
expected but the former is plainly deleted. It can no longer be found
and the admin core Overview page counts 1 document less. If I click on a
3rd document, so goes the 2nd one.


The schema is the default one amended to remove unneeded fields and add
new ones, nothing fancy. All fields are stored="true" and there's no
. I've tried versions 5.2.1 & 5.3.1 in standalone mode, with
the same outcome. It looks like a bug to me but I might have overlooked
something? This is my first attempt at atomic updates.

Thanks,
John.



Re: Unexpected delayed document deletion with atomic updates

2015-10-07 Thread John Smith
The ids are all different: they're unique numbers followed by a couple
of keywords. I've made a test with a small collection of 10 documents to
make sure I can manage them manually: all ids are confirmed as different.

I also dumped the exact command, here's one example:

101084385_Sebago_ sebago shoes11.8701925463775

It's sent as the body of a POST request to
http://127.0.0.1:8080/solr/ato_test/update?wt=json&commit=true, with a
Content-Type: text/xml header. I still noted the consistent loss of
another document with the update above.

John


On 08/10/15 00:38, Upayavira wrote:
> What ID are you using? Are you possibly using the same ID field for
> both, so the second document you visit causes the first to be
> overwritten?
>
> Upayavira
>
> On Wed, Oct 7, 2015, at 06:38 PM, Erick Erickson wrote:
>> This certainly should not be happening. I'd
>> take a careful look at what you actually send.
>> My _guess_ is that you're not sending the update
>> command you think you are
>>
>> As a test you could just curl (or use post.jar) to
>> send these types of commands up individually.
>>
>> Perhaps looking at the solr log would help too...
>>
>> Best,
>> Erick
>>
>> On Wed, Oct 7, 2015 at 6:32 AM, John Smith 
>> wrote:
>>> Hi,
>>>
>>> I'm bumping on the following problem with update XML messages. The idea
>>> is to record the number of clicks for a document: each time, a message
>>> is sent to .../update such as this one:
>>>
>>> 
>>> 
>>> abc
>>> 1
>>> 1.05
>>> 
>>> 
>>>
>>> (Clicks is an int field; Boost is a float field, it's updated to reflect
>>> the change in popularity using a formula based on the number of clicks).
>>>
>>> At the moment in the dev environment, changes are committed immediately.
>>>
>>>
>>> When a document is updated, the changes are indeed reflected in the
>>> search results. If I click on the same document again, all goes well.
>>> But  when I click on an other document, the latter gets updated as
>>> expected but the former is plainly deleted. It can no longer be found
>>> and the admin core Overview page counts 1 document less. If I click on a
>>> 3rd document, so goes the 2nd one.
>>>
>>>
>>> The schema is the default one amended to remove unneeded fields and add
>>> new ones, nothing fancy. All fields are stored="true" and there's no
>>> . I've tried versions 5.2.1 & 5.3.1 in standalone mode, with
>>> the same outcome. It looks like a bug to me but I might have overlooked
>>> something? This is my first attempt at atomic updates.
>>>
>>> Thanks,
>>> John.
>>>



Re: Unexpected delayed document deletion with atomic updates

2015-10-07 Thread John Smith
Oh, I forgot Erick's mention of the logs: there's nothing unusual in
INFO level, the update request just gets mentioned. No exception. I
reran it with the DEBUG level, but most of the log was related to jetty.
Here's a line I noticed though:

org.apache.solr.servlet.HttpSolrCall; Closing out SolrRequest:
{wt=json&commit=true&update.chain=dedupe}

The update.chain parameter wasn't part of the original request, and
"dedupe" looks suspicious to me. Perhaps should I investigate further there?

Thanks,
John.


On 08/10/15 08:25, John Smith wrote:
> The ids are all different: they're unique numbers followed by a couple
> of keywords. I've made a test with a small collection of 10 documents to
> make sure I can manage them manually: all ids are confirmed as different.
>
> I also dumped the exact command, here's one example:
>
> 101084385_Sebago_ sebago shoes name="Clicks" update="set">1 update="set">1.8701925463775
>
> It's sent as the body of a POST request to
> http://127.0.0.1:8080/solr/ato_test/update?wt=json&commit=true, with a
> Content-Type: text/xml header. I still noted the consistent loss of
> another document with the update above.
>
> John
>
>
> On 08/10/15 00:38, Upayavira wrote:
>> What ID are you using? Are you possibly using the same ID field for
>> both, so the second document you visit causes the first to be
>> overwritten?
>>
>> Upayavira
>>
>> On Wed, Oct 7, 2015, at 06:38 PM, Erick Erickson wrote:
>>> This certainly should not be happening. I'd
>>> take a careful look at what you actually send.
>>> My _guess_ is that you're not sending the update
>>> command you think you are....
>>>
>>> As a test you could just curl (or use post.jar) to
>>> send these types of commands up individually.
>>>
>>> Perhaps looking at the solr log would help too...
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, Oct 7, 2015 at 6:32 AM, John Smith 
>>> wrote:
>>>> Hi,
>>>>
>>>> I'm bumping on the following problem with update XML messages. The idea
>>>> is to record the number of clicks for a document: each time, a message
>>>> is sent to .../update such as this one:
>>>>
>>>> 
>>>> 
>>>> abc
>>>> 1
>>>> 1.05
>>>> 
>>>> 
>>>>
>>>> (Clicks is an int field; Boost is a float field, it's updated to reflect
>>>> the change in popularity using a formula based on the number of clicks).
>>>>
>>>> At the moment in the dev environment, changes are committed immediately.
>>>>
>>>>
>>>> When a document is updated, the changes are indeed reflected in the
>>>> search results. If I click on the same document again, all goes well.
>>>> But  when I click on an other document, the latter gets updated as
>>>> expected but the former is plainly deleted. It can no longer be found
>>>> and the admin core Overview page counts 1 document less. If I click on a
>>>> 3rd document, so goes the 2nd one.
>>>>
>>>>
>>>> The schema is the default one amended to remove unneeded fields and add
>>>> new ones, nothing fancy. All fields are stored="true" and there's no
>>>> . I've tried versions 5.2.1 & 5.3.1 in standalone mode, with
>>>> the same outcome. It looks like a bug to me but I might have overlooked
>>>> something? This is my first attempt at atomic updates.
>>>>
>>>> Thanks,
>>>> John.
>>>>
>



Re: Unexpected delayed document deletion with atomic updates

2015-10-08 Thread John Smith
Yes indeed, the update chain had been activated... I commented it out
again and the problem vanished.

Good job, thanks Erick and Upayavira!
John


On 08/10/15 08:58, Upayavira wrote:
> Look for the DedupUpdateProcessor in an update chain.
>
> that is there, but commented out IIRC in the techproducts sample
> configs.
>
> Perhaps you uncommented it to use your own update processors, but didn't
> remove that component?
>
> On Thu, Oct 8, 2015, at 07:38 AM, John Smith wrote:
>> Oh, I forgot Erick's mention of the logs: there's nothing unusual in
>> INFO level, the update request just gets mentioned. No exception. I
>> reran it with the DEBUG level, but most of the log was related to jetty.
>> Here's a line I noticed though:
>>
>> org.apache.solr.servlet.HttpSolrCall; Closing out SolrRequest:
>> {wt=json&commit=true&update.chain=dedupe}
>>
>> The update.chain parameter wasn't part of the original request, and
>> "dedupe" looks suspicious to me. Perhaps should I investigate further
>> there?
>>
>> Thanks,
>> John.
>>
>>
>> On 08/10/15 08:25, John Smith wrote:
>>> The ids are all different: they're unique numbers followed by a couple
>>> of keywords. I've made a test with a small collection of 10 documents to
>>> make sure I can manage them manually: all ids are confirmed as different.
>>>
>>> I also dumped the exact command, here's one example:
>>>
>>> 101084385_Sebago_ sebago shoes>> name="Clicks" update="set">1>> update="set">1.8701925463775
>>>
>>> It's sent as the body of a POST request to
>>> http://127.0.0.1:8080/solr/ato_test/update?wt=json&commit=true, with a
>>> Content-Type: text/xml header. I still noted the consistent loss of
>>> another document with the update above.
>>>
>>> John
>>>
>>>
>>> On 08/10/15 00:38, Upayavira wrote:
>>>> What ID are you using? Are you possibly using the same ID field for
>>>> both, so the second document you visit causes the first to be
>>>> overwritten?
>>>>
>>>> Upayavira
>>>>
>>>> On Wed, Oct 7, 2015, at 06:38 PM, Erick Erickson wrote:
>>>>> This certainly should not be happening. I'd
>>>>> take a careful look at what you actually send.
>>>>> My _guess_ is that you're not sending the update
>>>>> command you think you are
>>>>>
>>>>> As a test you could just curl (or use post.jar) to
>>>>> send these types of commands up individually.
>>>>>
>>>>> Perhaps looking at the solr log would help too...
>>>>>
>>>>> Best,
>>>>> Erick
>>>>>
>>>>> On Wed, Oct 7, 2015 at 6:32 AM, John Smith 
>>>>> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm bumping on the following problem with update XML messages. The idea
>>>>>> is to record the number of clicks for a document: each time, a message
>>>>>> is sent to .../update such as this one:
>>>>>>
>>>>>> 
>>>>>> 
>>>>>> abc
>>>>>> 1
>>>>>> 1.05
>>>>>> 
>>>>>> 
>>>>>>
>>>>>> (Clicks is an int field; Boost is a float field, it's updated to reflect
>>>>>> the change in popularity using a formula based on the number of clicks).
>>>>>>
>>>>>> At the moment in the dev environment, changes are committed immediately.
>>>>>>
>>>>>>
>>>>>> When a document is updated, the changes are indeed reflected in the
>>>>>> search results. If I click on the same document again, all goes well.
>>>>>> But  when I click on an other document, the latter gets updated as
>>>>>> expected but the former is plainly deleted. It can no longer be found
>>>>>> and the admin core Overview page counts 1 document less. If I click on a
>>>>>> 3rd document, so goes the 2nd one.
>>>>>>
>>>>>>
>>>>>> The schema is the default one amended to remove unneeded fields and add
>>>>>> new ones, nothing fancy. All fields are stored="true" and there's no
>>>>>> . I've tried versions 5.2.1 & 5.3.1 in standalone mode, with
>>>>>> the same outcome. It looks like a bug to me but I might have overlooked
>>>>>> something? This is my first attempt at atomic updates.
>>>>>>
>>>>>> Thanks,
>>>>>> John.
>>>>>>



Re: Unexpected delayed document deletion with atomic updates

2015-10-08 Thread John Smith
After some further investigation, for those interested: the
SignatureUpdateProcessorFactory fields were somehow mis-configured (I
guess copied over from another collection). The initial import had been
made using a data import handler: I suppose the update chain isn't
called in this process and no signature field is created - am I right?.

The first time a document was updated, a signature field with value
"" was added. The next time, the same signature was
generated for the new udpate, which triggered the deletion of all
documents with the same signature (i.e. the first one) as overwriteDupes
was set to true. Correct behavior but quite tricky...

So my conclusion here (please correct me if I'm wrong) is of course to
fix the signature configuration problem, but also to manage calling the
update chain (or maybe a simplified one, e.g. by skipping logging) in
the data import handler. Is there an easy way to do this? Conceptually,
shouldn't the update chain be callable from the data import process -
maybe it is?

John


On 08/10/15 09:43, Upayavira wrote:
> Yay!
>
> On Thu, Oct 8, 2015, at 08:38 AM, John Smith wrote:
>> Yes indeed, the update chain had been activated... I commented it out
>> again and the problem vanished.
>>
>> Good job, thanks Erick and Upayavira!
>> John
>>
>>
>> On 08/10/15 08:58, Upayavira wrote:
>>> Look for the DedupUpdateProcessor in an update chain.
>>>
>>> that is there, but commented out IIRC in the techproducts sample
>>> configs.
>>>
>>> Perhaps you uncommented it to use your own update processors, but didn't
>>> remove that component?
>>>
>>> On Thu, Oct 8, 2015, at 07:38 AM, John Smith wrote:
>>>> Oh, I forgot Erick's mention of the logs: there's nothing unusual in
>>>> INFO level, the update request just gets mentioned. No exception. I
>>>> reran it with the DEBUG level, but most of the log was related to jetty.
>>>> Here's a line I noticed though:
>>>>
>>>> org.apache.solr.servlet.HttpSolrCall; Closing out SolrRequest:
>>>> {wt=json&commit=true&update.chain=dedupe}
>>>>
>>>> The update.chain parameter wasn't part of the original request, and
>>>> "dedupe" looks suspicious to me. Perhaps should I investigate further
>>>> there?
>>>>
>>>> Thanks,
>>>> John.
>>>>
>>>>
>>>> On 08/10/15 08:25, John Smith wrote:
>>>>> The ids are all different: they're unique numbers followed by a couple
>>>>> of keywords. I've made a test with a small collection of 10 documents to
>>>>> make sure I can manage them manually: all ids are confirmed as different.
>>>>>
>>>>> I also dumped the exact command, here's one example:
>>>>>
>>>>> 101084385_Sebago_ sebago shoes>>>> name="Clicks" update="set">1>>>> update="set">1.8701925463775
>>>>>
>>>>> It's sent as the body of a POST request to
>>>>> http://127.0.0.1:8080/solr/ato_test/update?wt=json&commit=true, with a
>>>>> Content-Type: text/xml header. I still noted the consistent loss of
>>>>> another document with the update above.
>>>>>
>>>>> John
>>>>>
>>>>>
>>>>> On 08/10/15 00:38, Upayavira wrote:
>>>>>> What ID are you using? Are you possibly using the same ID field for
>>>>>> both, so the second document you visit causes the first to be
>>>>>> overwritten?
>>>>>>
>>>>>> Upayavira
>>>>>>
>>>>>> On Wed, Oct 7, 2015, at 06:38 PM, Erick Erickson wrote:
>>>>>>> This certainly should not be happening. I'd
>>>>>>> take a careful look at what you actually send.
>>>>>>> My _guess_ is that you're not sending the update
>>>>>>> command you think you are
>>>>>>>
>>>>>>> As a test you could just curl (or use post.jar) to
>>>>>>> send these types of commands up individually.
>>>>>>>
>>>>>>> Perhaps looking at the solr log would help too...
>>>>>>>
>>>>>>> Best,
>>>>>>> Erick
>>>>>>>
>>>>>>> On Wed, Oct 7, 2015 at 6:32 AM, John Smith 
>>>>>>> wrote:
>>>&g

Re: Unexpected delayed document deletion with atomic updates

2015-10-08 Thread John Smith
Well, every day we update a lot of documents (usually several millions)
so the DIH is a good fit.

Calling the update chain would make sense there: after all a data import
is just a batch update. Otherwise, the same operations would have to be
made upfront, possibly in another environment and/or language. That's
probably what I'm gonna do anyway.

Thanks for your help!
John


On 08/10/15 13:39, Upayavira wrote:
> You can either specify the update chain via an update.chain request
> parameter, or you can configure a new request parameter with its own URL
> and separate update.chain value. 
>
> I have no idea how you would then reference that in the DIH - I've never
> really used it.
>
> Upayavira
>
> On Thu, Oct 8, 2015, at 09:25 AM, John Smith wrote:
>> After some further investigation, for those interested: the
>> SignatureUpdateProcessorFactory fields were somehow mis-configured (I
>> guess copied over from another collection). The initial import had been
>> made using a data import handler: I suppose the update chain isn't
>> called in this process and no signature field is created - am I right?.
>>
>> The first time a document was updated, a signature field with value
>> "" was added. The next time, the same signature was
>> generated for the new udpate, which triggered the deletion of all
>> documents with the same signature (i.e. the first one) as overwriteDupes
>> was set to true. Correct behavior but quite tricky...
>>
>> So my conclusion here (please correct me if I'm wrong) is of course to
>> fix the signature configuration problem, but also to manage calling the
>> update chain (or maybe a simplified one, e.g. by skipping logging) in
>> the data import handler. Is there an easy way to do this? Conceptually,
>> shouldn't the update chain be callable from the data import process -
>> maybe it is?
>>
>> John
>>
>>
>> On 08/10/15 09:43, Upayavira wrote:
>>> Yay!
>>>
>>> On Thu, Oct 8, 2015, at 08:38 AM, John Smith wrote:
>>>> Yes indeed, the update chain had been activated... I commented it out
>>>> again and the problem vanished.
>>>>
>>>> Good job, thanks Erick and Upayavira!
>>>> John
>>>>
>>>>
>>>> On 08/10/15 08:58, Upayavira wrote:
>>>>> Look for the DedupUpdateProcessor in an update chain.
>>>>>
>>>>> that is there, but commented out IIRC in the techproducts sample
>>>>> configs.
>>>>>
>>>>> Perhaps you uncommented it to use your own update processors, but didn't
>>>>> remove that component?
>>>>>
>>>>> On Thu, Oct 8, 2015, at 07:38 AM, John Smith wrote:
>>>>>> Oh, I forgot Erick's mention of the logs: there's nothing unusual in
>>>>>> INFO level, the update request just gets mentioned. No exception. I
>>>>>> reran it with the DEBUG level, but most of the log was related to jetty.
>>>>>> Here's a line I noticed though:
>>>>>>
>>>>>> org.apache.solr.servlet.HttpSolrCall; Closing out SolrRequest:
>>>>>> {wt=json&commit=true&update.chain=dedupe}
>>>>>>
>>>>>> The update.chain parameter wasn't part of the original request, and
>>>>>> "dedupe" looks suspicious to me. Perhaps should I investigate further
>>>>>> there?
>>>>>>
>>>>>> Thanks,
>>>>>> John.
>>>>>>
>>>>>>
>>>>>> On 08/10/15 08:25, John Smith wrote:
>>>>>>> The ids are all different: they're unique numbers followed by a couple
>>>>>>> of keywords. I've made a test with a small collection of 10 documents to
>>>>>>> make sure I can manage them manually: all ids are confirmed as 
>>>>>>> different.
>>>>>>>
>>>>>>> I also dumped the exact command, here's one example:
>>>>>>>
>>>>>>> 101084385_Sebago_ sebago shoes>>>>>> name="Clicks" update="set">1>>>>>> update="set">1.8701925463775
>>>>>>>
>>>>>>> It's sent as the body of a POST request to
>>>>>>> http://127.0.0.1:8080/solr/ato_test/update?wt=json&commit=true, with a
>>>>>>> Content-Type: text/xml header. I still noted the c

Re: Unexpected delayed document deletion with atomic updates

2015-10-11 Thread John Smith
Hi Allessandro,

In the example I set the value to 1, but it's actually incremented in
the code, so with time it should go up. You're right though, I could use
an inc update instead.

John


On 08/10/15 16:45, Alessandro Benedetti wrote:
> Not related to the deletion problem, only as a curiosity for your use case :
>
> 1
>
> Have i misunderstood your use case, or you should use :
>
> inc
>
> Increments a numeric value by a specific amount.
>
> Must be specified as a single numeric value.
>
> Basically overtime you click, you always set the value for that field to
> "1" .
> So a document with 1 click will be considered equal to one with 1000 clicks.
> My 2 cents
>
> Cheers
>
> On 8 October 2015 at 14:10, John Smith  wrote:
>
>> Well, every day we update a lot of documents (usually several millions)
>> so the DIH is a good fit.
>>
>> Calling the update chain would make sense there: after all a data import
>> is just a batch update. Otherwise, the same operations would have to be
>> made upfront, possibly in another environment and/or language. That's
>> probably what I'm gonna do anyway.
>>
>> Thanks for your help!
>> John
>>
>>
>> On 08/10/15 13:39, Upayavira wrote:
>>> You can either specify the update chain via an update.chain request
>>> parameter, or you can configure a new request parameter with its own URL
>>> and separate update.chain value.
>>>
>>> I have no idea how you would then reference that in the DIH - I've never
>>> really used it.
>>>
>>> Upayavira
>>>
>>> On Thu, Oct 8, 2015, at 09:25 AM, John Smith wrote:
>>>> After some further investigation, for those interested: the
>>>> SignatureUpdateProcessorFactory fields were somehow mis-configured (I
>>>> guess copied over from another collection). The initial import had been
>>>> made using a data import handler: I suppose the update chain isn't
>>>> called in this process and no signature field is created - am I right?.
>>>>
>>>> The first time a document was updated, a signature field with value
>>>> "" was added. The next time, the same signature was
>>>> generated for the new udpate, which triggered the deletion of all
>>>> documents with the same signature (i.e. the first one) as overwriteDupes
>>>> was set to true. Correct behavior but quite tricky...
>>>>
>>>> So my conclusion here (please correct me if I'm wrong) is of course to
>>>> fix the signature configuration problem, but also to manage calling the
>>>> update chain (or maybe a simplified one, e.g. by skipping logging) in
>>>> the data import handler. Is there an easy way to do this? Conceptually,
>>>> shouldn't the update chain be callable from the data import process -
>>>> maybe it is?
>>>>
>>>> John
>>>>
>>>>
>>>> On 08/10/15 09:43, Upayavira wrote:
>>>>> Yay!
>>>>>
>>>>> On Thu, Oct 8, 2015, at 08:38 AM, John Smith wrote:
>>>>>> Yes indeed, the update chain had been activated... I commented it out
>>>>>> again and the problem vanished.
>>>>>>
>>>>>> Good job, thanks Erick and Upayavira!
>>>>>> John
>>>>>>
>>>>>>
>>>>>> On 08/10/15 08:58, Upayavira wrote:
>>>>>>> Look for the DedupUpdateProcessor in an update chain.
>>>>>>>
>>>>>>> that is there, but commented out IIRC in the techproducts sample
>>>>>>> configs.
>>>>>>>
>>>>>>> Perhaps you uncommented it to use your own update processors, but
>> didn't
>>>>>>> remove that component?
>>>>>>>
>>>>>>> On Thu, Oct 8, 2015, at 07:38 AM, John Smith wrote:
>>>>>>>> Oh, I forgot Erick's mention of the logs: there's nothing unusual in
>>>>>>>> INFO level, the update request just gets mentioned. No exception. I
>>>>>>>> reran it with the DEBUG level, but most of the log was related to
>> jetty.
>>>>>>>> Here's a line I noticed though:
>>>>>>>>
>>>>>>>> org.apache.solr.servlet.HttpSolrCall; Closing out SolrRequest:
>>>>>>>> {wt=json&commit=true&update.chain=dedupe}
>>>>>>

Best way to backup and restore an index for a cloud setup in 4.6.1?

2015-05-08 Thread John Smith
All,

With a cloud setup for a collection in 4.6.1, what is the most elegant way
to backup and restore an index?

We are specifically looking into the application of when doing a full
reindex, with the idea of building an index on one set of servers, backing
up the index, and then restoring that backup on another set of servers. Is
there a better way to rebuild indexes on another set of servers?

We are not sharding if that makes any difference.

Thanks,
g10vstmoney


Creating a collection with 1 shard gives a weird range

2016-05-17 Thread John Smith
I'm trying to create a collection starting with only one shard
(numShards=1) using a compositeID router. The purpose is to start small
and begin splitting shards when the index grows larger. The shard
created gets a weird range value: 8000-7fff, which doesn't look
effective. Indeed, if a try to import some documents using a DIH, none
gets added.

If I create the same collection with 2 shards, the ranges seem more
logical (0-7fff & 8000-). In this case documents are
indexed correctly.

Is this behavior by design, i.e. is a minimum of 2 shards required? If
not, how can I create a working collection with a single shard?

This is Solr-6.0.0 in cloud mode with zookeeper-3.4.8.

Thanks,
John


Re: Creating a collection with 1 shard gives a weird range

2016-05-18 Thread John Smith
On 17/05/16 11:56, Tom Evans wrote:
> On Tue, May 17, 2016 at 9:40 AM, John Smith  wrote:
>> I'm trying to create a collection starting with only one shard
>> (numShards=1) using a compositeID router. The purpose is to start small
>> and begin splitting shards when the index grows larger. The shard
>> created gets a weird range value: 8000-7fff, which doesn't look
>> effective. Indeed, if a try to import some documents using a DIH, none
>> gets added.
>>
>> If I create the same collection with 2 shards, the ranges seem more
>> logical (0-7fff & 8000-). In this case documents are
>> indexed correctly.
>>
>> Is this behavior by design, i.e. is a minimum of 2 shards required? If
>> not, how can I create a working collection with a single shard?
>>
>> This is Solr-6.0.0 in cloud mode with zookeeper-3.4.8.
>>
> I believe this is as designed, see this email from Shawn:
>
> https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201604.mbox/%3c570d0a03.5010...@elyograg.org%3E
>
> Cheers
>
> Tom

Thanks Tom, signed integers make sense here, I overlooked that -
classical mistake. I still have a problem with DIH though, I'll
investigate further.

John


parent/child rows in solr

2018-09-07 Thread John Smith
Hi, I have a document structure like this (this is a made up schema, my
data has nothing to do with departments and employees, but the structure
holds true to my real data):

department 1
employee 11
employee 12
employee 13
room 11
room 12
room 13

department 2
employee 21
employee 22
room 21

... etc

I'm trying to figure out the best way to index this, and perform queries.
Due to the sheer volume of data, I cannot do a simple "flat file" approach,
repeating the header data for each child entry.

So that leaves me with "graph traversal" or "block joins". I've played with
both of those, but I'm running into various issues with each approach.

I need to be able to run filters on any or all of the header + child rows
in the same query, at once (can't seem to get that working in either graph
or block join). One problem I had with graph is that I can't force solr to
return the header, then all the children for that header, then the next
header + all it's children, it just spits them out without keeping them
together. block join seems to return the children nested under the parents,
which is great, but then I can't seem to filter on parent + children in the
same query: I get the dreaded error message "Parent query must not match
any docs besides parent filter"

Kinda lost here, any tips/suggestions?


Re: parent/child rows in solr

2018-09-07 Thread John Smith
Thanks Shawn, for your comments. The reason why I don't want to go flat
file structure, is due to all the wasted/duplicated data. If a department
has 100 employees, then it's very wasteful in terms of disk space to repeat
the header data over and over again, 100 times. In this example there is
only a few doc types, but my real-life data is much larger, and the problem
is a "scaling" problem; with just a little bit of data, no problem in
duplicating header fields, but with massive amounts of data it's a large
problem.

My understanding of both graph traversal and block joins, is that the
header data would only be present once, so that's why I'm gravitating
towards those solutions. I just can't seem to line up the "fq" and queries
correctly such that I am able to join 3+ document types together, filter on
them, and return my requested columns.

On Fri, Sep 7, 2018 at 9:32 PM Shawn Heisey  wrote:

> On 9/7/2018 3:06 PM, John Smith wrote:
> > Hi, I have a document structure like this (this is a made up schema, my
> > data has nothing to do with departments and employees, but the structure
> > holds true to my real data):
> >
> > department 1
> >  employee 11
> >  employee 12
> >  employee 13
> >  room 11
> >  room 12
> >  room 13
> >
> > department 2
> >  employee 21
> >  employee 22
> >  room 21
> >
> > ... etc
> >
> > I'm trying to figure out the best way to index this, and perform queries.
> > Due to the sheer volume of data, I cannot do a simple "flat file"
> approach,
> > repeating the header data for each child entry.
>
> Why not?
>
> For the precise use case you have outlined, Solr will work better if you
> only have the child documents and simply have every document contain a
> "department" field which contains an identifier for the department.
> Since this precise structure is not what you are doing, you'll need to
> adapt what I'm saying to your actual data.
>
> The volume of data should be irrelevant to this decision. Solr will
> always work best with a flat document structure.
>
> I have never used the parent/child document feature in Solr, so I cannot
> offer any advice on it.  Somebody else will need to help you if you
> choose to use that feature.
>
> Thanks,
> Shawn
>
>


Re: parent/child rows in solr

2018-09-11 Thread John Smith
>
> On 9/7/2018 7:44 PM, John Smith wrote:
> > Thanks Shawn, for your comments. The reason why I don't want to go flat
> > file structure, is due to all the wasted/duplicated data. If a department
> > has 100 employees, then it's very wasteful in terms of disk space to
> repeat
> > the header data over and over again, 100 times. In this example there is
> > only a few doc types, but my real-life data is much larger, and the
> problem
> > is a "scaling" problem; with just a little bit of data, no problem in
> > duplicating header fields, but with massive amounts of data it's a large
> > problem.
>
> If your goal is data storage, then you are completely correct.  All that
> data duplication is something to avoid for a data storage situation.
> Normalizing your data so it's relational makes perfect sense, because
> most database software is designed to efficiently deal with those
> relationships.
>
> Solr is not designed as a data storage platform, and does not handle
> those relationships efficiently.  Solr's design goals are all about
> *search*.  It often gets touted as filling a NoSQL role ... but it's not
> something I would personally use as a primary data repository.  Search
> is a space where data duplication is expected and completely normal.
> This is something that people often have a hard time accepting.
>
>
I'm not actually trying to use solr as a data storage platform; all our
data is stored in an sql database, we are using solr strictly for the
search features, not storage features.

Here is a good example from a test I ran today. I have a header table, and
8 child tables which link directly to the header table. The children link
only to 1 header row, and they do not link to other children. So a 1:many
between header and each child. Some row counts:

header:  223,580

child1:  124,978
child2:  254,045
child3:  127,917
child4:1,009,030
child5:  225,311
child6:  381,561
child7:  438,315
child8:   18,850


Trying to index that into solr with a flatfile schema, blows up into
5,475,316,072 rows. Yes, 5.5 billion rows. I calculated that by running a
left outer join between header and each child and getting a row count in
the database. That's not going to scale, at all, considering the small size
of the source input tables. Some of our indexes would require 50 million
header rows alone, never mind the child tables.

So solr has no way of indexing something like this? I can't believe I would
be the first person to run into this issue, I have a feeling I'm missing
something obvious somewhere.


Re: parent/child rows in solr

2018-09-11 Thread John Smith
On Tue, Sep 11, 2018 at 9:32 PM Shawn Heisey  wrote:

> On 9/11/2018 7:07 PM, John Smith wrote:
> > header:  223,580
> >
> > child1:  124,978
> > child2:  254,045
> > child3:  127,917
> > child4:1,009,030
> > child5:  225,311
> > child6:  381,561
> > child7:  438,315
> > child8:   18,850
> >
> >
> > Trying to index that into solr with a flatfile schema, blows up into
> > 5,475,316,072 rows. Yes, 5.5 billion rows. I calculated that by running a
>
> I think you're not getting what I'm suggesting.  Or maybe there's an
> aspect of your data that I'm not understanding.
>
> If we add up all those numbers for the child docs, there are 2.5 million
> of them.  So you would have 2.5 million docs in Solr.  I have created
> Solr indexes far larger than this, and I do not consider my work to be
> "big data".  Solr can handle 2.5 million docs easily, as long as the
> hardware resources are sufficient.
>
> Where the data duplication will come in is in additional fields in those
> 2.5 million docs.  Each one will contain some (or maybe all) of the data
> that WOULD have been in the parent document.  The amount of data
> balloons, but the number of documents (rows) doesn't.
>
> That kind of arrangement is usually enough to accomplish whatever is
> needed.  I cannot assume that it will work for your use case, but it
> does work for most.
>
> Thanks,
> Shawn
>
>
The problem is that the math isn't a simple case of adding up all the row
counts. These are "left outer join"s. In sql, it would be this query:

select * from header h
left outer join child1 c1 on c1.hid = h.id
left outer join child2 c2 on c2.hid = h.id
...
left outer join child8 c8 on c8.hid = h.id


If there are 10 rows in child1 linked to 1 header with id "abc", and 10
rows in child2 linked to that same header, then we end up with 10 * 10 rows
in solr, not 20. Considering there are 8 child tables in this example,
there is simply an explosion of data.

I can't describe it much better than that (abstractly), though perhaps I
could put together a simple example with live data. Suffice it to say, in
my example row counts above, that is all "live data" in a relatively small
database of ours, the row counts are real, and the final row count of 5.5
billion was calculated inside sql using that query above:

select count(*) from (
select id from header h
left outer join child1 c1 on c1.hid = h.id
left outer join child2 c2 on c2.hid = h.id
...
left outer join child8 c8 on c8.hid = h.id
) tmp;


Re: parent/child rows in solr

2018-09-11 Thread John Smith
On Tue, Sep 11, 2018 at 11:00 PM Shawn Heisey  wrote:

> On 9/11/2018 8:35 PM, John Smith wrote:
> > The problem is that the math isn't a simple case of adding up all the row
> > counts. These are "left outer join"s. In sql, it would be this query:
>
> I think we'll just have to conclude that I do not understand what you
> are doing.  I have no idea what "left outer join" even means, how it's
> different than a join that's NOT "left outer".
>
> I will say this:  Solr is not very efficient at joins, and there are a
> bunch of caveats involved.  It's usually better to go with a flat
> document space for a search engine.
>
> Thanks,
> Shawn
>
>
A "left outer join" in sql is a join such that if there is no match in the
child table for a given header id, then the child cells are returned as
"null" values, instead of the header row being removed from the result set
(which is what happens in "inner join" or standard sql join).

A good rundown on the various sql joins:
https://stackoverflow.com/questions/38549/what-is-the-difference-between-inner-join-and-outer-join


Re: parent/child rows in solr

2018-09-11 Thread John Smith
On Tue, Sep 11, 2018 at 11:05 PM Walter Underwood 
wrote:

> Have you tried modeling it with multivalued fields?
>
>
That's an interesting idea, but I don't think that would work. We would
lose the concept of "rows". So let's say child1 has col "a" and col "b",
both are turned into multi-value fields in the solr index. Normally in sql
we can query for a specific value in col "a", and then see what the
associated value in col "b" would be, but we can't do that if we stuff the
col values in multi-value; we can no longer see which value from col "a"
corresponds to which value in col "b". I'm probably explaining that poorly,
but I just don't see how that would work.


statistics in hitlist

2018-02-23 Thread John Smith
I'm using solr, and enabling stats as per this page:
https://lucene.apache.org/solr/guide/6_6/the-stats-component.html

I want to get more stat values though. Specifically I'm looking for
r-squared (coefficient of determination). This value is not present in
solr, however some of the pieces used to calculate r^2 are in the stats
element, for example:

0.0
10.0
15
17
85.0
603.0
5.667
2.943920288775949


So I have the sumOfSquares available (SST), and using this calculation, I
can get R^2:

R^2 = 1 - SSE/SST

All I need then is SSE. Is there anyway I can get SSE from those other
stats in solr?

Thanks in advance!


Re: statistics in hitlist

2018-02-23 Thread John Smith
Hi Joel, thanks for the answer. I'm not really a stats guy, but the end
result of all this is supposed to be obtaining R^2. Is there no way of
obtaining this value, then (short of iterating over all the results in the
hitlist and calculating it myself)?

On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein  wrote:

> Typically SSE is the sum of the squared errors of the prediction in a
> regression analysis. The stats component doesn't perform regression,
> although it might be a nice feature.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Feb 23, 2018 at 12:17 PM, John Smith  wrote:
>
> > I'm using solr, and enabling stats as per this page:
> > https://lucene.apache.org/solr/guide/6_6/the-stats-component.html
> >
> > I want to get more stat values though. Specifically I'm looking for
> > r-squared (coefficient of determination). This value is not present in
> > solr, however some of the pieces used to calculate r^2 are in the stats
> > element, for example:
> >
> > 0.0
> > 10.0
> > 15
> > 17
> > 85.0
> > 603.0
> > 5.667
> > 2.943920288775949
> >
> >
> > So I have the sumOfSquares available (SST), and using this calculation, I
> > can get R^2:
> >
> > R^2 = 1 - SSE/SST
> >
> > All I need then is SSE. Is there anyway I can get SSE from those other
> > stats in solr?
> >
> > Thanks in advance!
> >
>


Re: statistics in hitlist

2018-03-01 Thread John Smith
Joel, thanks for the pointers to the streaming feature. I had no idea solr
had that (and also just discovered the very intersting sql feature! I will
be sure to investigate that in more detail in the future).

However I'm having some trouble getting basic streaming functions working.
I've already figured out that I had to move to "solr cloud" instead of
"solr standalone" because I was getting errors about "cannot find zk
instance" or whatever which went away when using "solr start -c" instead.

But now I'm trying to use the random function since that was one of the
functions used in your example.

random(tx_header, q="*:*", rows="100", fl="countyname")

I posted that directly in the "stream" section of the solr admin UI. This
is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in case
it was a bug in one)

I get back an error message:
*sort param could not be parsed as a query, and is not a field that exists
in the index: random_-255009774*

I'm not passing in any sort field anywhere. But the solr logs show these
three log entries:

2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header s:shard1
r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
[tx_header_shard1_replica_n1]  webapp=/solr path=/select
params={q=*:*&_stateVer_=tx_header:6&fl=countyname
*&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2} status=400
QTime=19

2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1
r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient
Request to collection [tx_header] failed due to (400)
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://192.168.13.31:8983/solr/tx_header: sort param could
not be parsed as a query, and is not a field that exists in the index:
random_-255009774, retry? 0

2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header s:shard1
r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.s.ExceptionStream
java.io.IOException:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://192.168.13.31:8983/solr/tx_header: sort param could
not be parsed as a query, and is not a field that exists in the index:
random_-255009774


So basically it looks like solr is injecting the "sort=random_" stuff into
my query and of course that is failing on the search since that
field/column doesn't exist in my schema. Everytime I run the random
function, I get a slightly different field name that it injects, but they
all start with "random_" etc.

I have tried adding my own sort field instead, hoping solr wouldn't inject
one for me, but it still injected a random sort fieldname:
random(tx_header, q="*:*", rows="100", fl="countyname", sort="countyname
asc")


Assuming I can fix that whole problem, my second question is: can I add
multiple "fq=" parameters to the random function? I build a pretty
complicated query using many fq= fields, and then want to run some stats on
that hitlist; so somehow I have to pass in the query that made up the exact
hitlist to these various functions, but when I used multiple "fq=" values
it only seemed to use the last one I specified and just ignored all the
previous fq's?

Thanks in advance for any comments/suggestions...!




On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein  wrote:

> This is going to be a complex answer because Solr actually now has multiple
> ways of doing regression analysis as part of the Streaming Expression
> statistical programming library. The basic documentation is here:
>
> https://lucene.apache.org/solr/guide/7_2/statistical-programming.html
>
> Here is a sample expression that performs a simple linear regression in
> Solr 7.2:
>
> let(a=random(collection1, q="any query", rows="15000", fl="fieldA,
> fieldB"),
> b=col(a, fieldA),
> c=col(a, fieldB),
> d=regress(b, c))
>
>
> The expression above takes a random sample of 15000 results from
> collection1. The result set will include fieldA and fieldB in each record.
> The result set is stored in variable "a".
>
> Then the "col" function creates arrays of numbers from the results stored
> in variable a. The values in fieldA are stored in the variable "b". The
> values in fieldB are stored in variable "c".
>
> Then the regress function performs a simple linear regression on arrays
> stored in variables "b" and "c".
>
> The output of the regress function is a map containing the regression
> result. This result includes RSquared and other attributes of the
> regression model such as R (correlation), slope, y intercept etc...
>
>
>
>

Re: statistics in hitlist

2018-03-05 Thread John Smith
Thanks Joel for your help on this.

What I've done so far:
- unzip downloaded solr-7.2
- modify the _default "managed-schema" to add the random field type and the
dynamic random field
- start solr7 using "solr start -c"
- indexed my data using pint/pdouble/boolean field types etc

I can now run the random function all by itself, it returns random results
as expected. So far so good!

However... now trying to get the regression stuff working:

let(a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15000",
fl="oil_first_90_days_production,oil_last_30_days_production"),
b=col(a, oil_first_90_days_production),
c=col(a, oil_last_30_days_production),
d=regress(b, c))

Posted directly into solr admin UI. Run the streaming expression and I get
this error message:
"EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value
expected but found type java.lang.String for value
oil_first_90_days_production"

It thinks my numeric field is defined as a string? But when I view the
schema, those 2 fields are defined as ints:


When I run a normal query and choose xml as output format, then it also
puts "int" elements into the hitlist, so the schema appears to be correct
it's just when using this regress function that something goes wrong and
solr thinks the field is string.

Any suggestions?
Thanks!
​


On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein  wrote:

> The field type will also need to be in the schema:
>
>  
>
> 
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein  wrote:
>
> > You'll need to have this field in your schema:
> >
> > 
> >
> > I'll check to see if the default schema used with solr start -c has this
> > field, if not I'll add it. Thanks for pointing this out.
> >
> > I checked and right now the random expression is only accepting one fq,
> > but I consider this a bug. It should accept multiple. I'll create ticket
> > for getting this fixed.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith  wrote:
> >
> >> Joel, thanks for the pointers to the streaming feature. I had no idea
> solr
> >> had that (and also just discovered the very intersting sql feature! I
> will
> >> be sure to investigate that in more detail in the future).
> >>
> >> However I'm having some trouble getting basic streaming functions
> working.
> >> I've already figured out that I had to move to "solr cloud" instead of
> >> "solr standalone" because I was getting errors about "cannot find zk
> >> instance" or whatever which went away when using "solr start -c"
> instead.
> >>
> >> But now I'm trying to use the random function since that was one of the
> >> functions used in your example.
> >>
> >> random(tx_header, q="*:*", rows="100", fl="countyname")
> >>
> >> I posted that directly in the "stream" section of the solr admin UI.
> This
> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in
> case
> >> it was a bug in one)
> >>
> >> I get back an error message:
> >> *sort param could not be parsed as a query, and is not a field that
> exists
> >> in the index: random_-255009774*
> >>
> >> I'm not passing in any sort field anywhere. But the solr logs show these
> >> three log entries:
> >>
> >> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header s:shard1
> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
> >> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
> >> params={q=*:*&_stateVer_=tx_header:6&fl=countyname
> >> *&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2} status=400
> >> QTime=19
> >>
> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1
> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient
> >> Request to collection [tx_header] failed due to (400)
> >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> >> Error
> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param
> could
> >> not be parsed as a query, and is not a field that exists in the index:
> >> random_-255009774, retry? 0
> >>
> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_head

Re: statistics in hitlist

2018-03-15 Thread John Smith
Hi Joel, I did some more work on this statistics stuff today. Yes, we do
have nulls in our data; the document contains many fields, we don't always
have values for each field, but we can't set the nulls to 0 either (or any
other value, really) as that will mess up other calculations (such as when
calculating average etc); we would normally just ignore fields with null
values when calculating stats manually ourselves.

Adding a check in the "q" parameter to ensure that the fields used in the
calculations are > 0 does work now. Thanks for the tip (and sorry, should
have caught that myself). But I am unable to use "fq" for these checks,
they have to be added to the q instead. Adding fq's doesn't have any effect.


Anyway, I'm trying to change this up a little. This is what I'm currently
using (switched from "random" to "search" since I actually need the full
hitlist not just a random subset):

let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
fq="isParent:true", rows="150",
fl="id,oil_first_90_days_production,oil_last_30_days_production", sort="id
asc"),
 b=col(a, oil_first_90_days_production),
 c=col(a, oil_last_30_days_production),
 d=regress(b, c))

So I have 2 fields there defined, that works great (in terms of a test and
running the query); but I need to replace the second field,
"oil_last_30_days_production" with the avg value in
oil_first_90_days_production.

I can get the avg with this expression:
stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
fq="isParent:true", rows="150", avg(oil_first_90_days_production))

But I don't know how to push that avg value into the first streaming
expression; guessing I have to set "c=" but that is where I'm getting
lost, since avg only returns 1 value and the first parameter, "b", returns
a list of sorts. Somehow I have to get the avg value stuffed inside a
"col", where it is the same value for every row in the hitlist...?

Thanks for your help!


On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein  wrote:

> I suspect you've got nulls in your data. I just tested with null values and
> got the same error. For testing purposes try loading the data with default
> values of zero.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein 
> wrote:
>
> > Let's break the expression down and build it up slowly. Let's start with:
> >
> > let(echo="true",
> >  a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15",
> > fl="oil_first_90_days_production,oil_last_30_days_production"),
> >  b=col(a, oil_first_90_days_production))
> >
> >
> > This should return variables a and b. Let's see what the data looks like.
> > I changed the rows from 15 to 15000. If it all looks good we can expand
> the
> > rows and continue adding functions.
> >
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith  wrote:
> >
> >> Thanks Joel for your help on this.
> >>
> >> What I've done so far:
> >> - unzip downloaded solr-7.2
> >> - modify the _default "managed-schema" to add the random field type and
> >> the dynamic random field
> >> - start solr7 using "solr start -c"
> >> - indexed my data using pint/pdouble/boolean field types etc
> >>
> >> I can now run the random function all by itself, it returns random
> >> results as expected. So far so good!
> >>
> >> However... now trying to get the regression stuff working:
> >>
> >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true",
> >> rows="15000", fl="oil_first_90_days_producti
> >> on,oil_last_30_days_production"),
> >> b=col(a, oil_first_90_days_production),
> >> c=col(a, oil_last_30_days_production),
> >> d=regress(b, c))
> >>
> >> Posted directly into solr admin UI. Run the streaming expression and I
> >> get this error message:
> >> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value
> >> expected but found type java.lang.String for value
> >> oil_first_90_days_production"
> >>
> >> It thinks my numeric field is defined as a string? But when I view the
> >> schema, those 2 fields are defined as ints:
> >>
> >>
> >> When I

Re: statistics in hitlist

2018-03-16 Thread John Smith
Thanks for the link to the documentation, that will probably come in useful.

I didn't see a way though, to get my avg function working? So instead of
doing a linear regression on two fields, X and Y, in a hitlist, we need to
do a linear regression on field X, and the average value of X. Is that
possible? To pass in a function to the regress function instead of a field?





On Thu, Mar 15, 2018 at 10:41 PM, Joel Bernstein  wrote:

> I've been working on the user guide for the math expressions. Here is the
> page on regression:
>
> https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> documentation/solr/solr-ref-guide/src/regression.adoc
>
> This page is part of the larger math expression documentation. The TOC is
> here:
>
> https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> documentation/solr/solr-ref-guide/src/math-expressions.adoc
>
> The docs are still very rough but you can get an idea of the coverage.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Mar 15, 2018 at 10:26 PM, Joel Bernstein 
> wrote:
>
> > If you want to get everything in query you can do this:
> >
> > let(echo="d,e",
> >  a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO
> > *]",
> > fq="isParent:true", rows="150",
> > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> sort="id
> > asc"),
> >  b=col(a, oil_first_90_days_production),
> >  c=col(a, oil_last_30_days_production),
> >  d=regress(b, c),
> >  e=someExpression())
> >
> > The echo parameter tells the let expression which variables to output.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson  >
> > wrote:
> >
> >> What does the fq clause look like?
> >>
> >> On Thu, Mar 15, 2018 at 11:51 AM, John Smith 
> >> wrote:
> >> > Hi Joel, I did some more work on this statistics stuff today. Yes, we
> do
> >> > have nulls in our data; the document contains many fields, we don't
> >> always
> >> > have values for each field, but we can't set the nulls to 0 either (or
> >> any
> >> > other value, really) as that will mess up other calculations (such as
> >> when
> >> > calculating average etc); we would normally just ignore fields with
> null
> >> > values when calculating stats manually ourselves.
> >> >
> >> > Adding a check in the "q" parameter to ensure that the fields used in
> >> the
> >> > calculations are > 0 does work now. Thanks for the tip (and sorry,
> >> should
> >> > have caught that myself). But I am unable to use "fq" for these
> checks,
> >> > they have to be added to the q instead. Adding fq's doesn't have any
> >> effect.
> >> >
> >> >
> >> > Anyway, I'm trying to change this up a little. This is what I'm
> >> currently
> >> > using (switched from "random" to "search" since I actually need the
> full
> >> > hitlist not just a random subset):
> >> >
> >> > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1
> TO
> >> *]",
> >> > fq="isParent:true", rows="150",
> >> > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> >> sort="id
> >> > asc"),
> >> >  b=col(a, oil_first_90_days_production),
> >> >  c=col(a, oil_last_30_days_production),
> >> >  d=regress(b, c))
> >> >
> >> > So I have 2 fields there defined, that works great (in terms of a test
> >> and
> >> > running the query); but I need to replace the second field,
> >> > "oil_last_30_days_production" with the avg value in
> >> > oil_first_90_days_production.
> >> >
> >> > I can get the avg with this expression:
> >> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
> >> > fq="isParent:true", rows="150", avg(oil_first_90_days_
> production))
> >> >
> >> > But I don't know how to push that avg value into the first streaming
> >> > expression; guessing I have to set "c=" but that is where I'm
> >> getting
> >> > los