I looking to use Solr search over the byte code in Classes and Jars.
Does anyone know or have experience of Analyzers, Tokenizers, and Token
Filters for such a task?
Regards
Mark
ging.LogRecord);
> > public java.lang.String _format(java.util.logging.LogRecord);
> > public java.lang.String getHead(java.util.logging.Handler);
> > public java.lang.String getTail(java.util.logging.Handler);
> > public java.lang.String formatMessage(java.util.logging.LogRec
https://searchcode.com/
looks really interesting, however I want to crunch as much searchable
aspects out of jars sititng on a classpath or under a project structure...
Really early days so I'm open to any suggestions
On 8 May 2015 at 22:09, Mark wrote:
> To answer why bytecode -
Erik,
Thanks for the pretty much OOTB approach.
I think I'm going to just try a range of approaches, and see how far I get.
The "IDE does this suggestion" would be worth looking into as well.
On 8 May 2015 at 22:14, Mark wrote:
>
> https://searchcode.com/
>
&g
Hi Alexandre,
Solr & ASM is the extact poblem I'm looking to hack about with so I'm keen
to consider any code no matter how ugly or broken
Regards
Mark
On 9 May 2015 at 10:21, Alexandre Rafalovitch wrote:
> If you only have classes/jars, use ASM. I have done this before,
Can you configure the number of shards per collection or is this a system wide
setting affecting all collections/indexes?
Thanks
If I create my collection via the ZkCLI
(https://cwiki.apache.org/confluence/display/solr/Command+Line+Utilities) how
do I configure the number of shards and replicas?
Thanks
I'm looking to index some outlook extracted messages *.msg
I notice by default msg isn't one of the defaults so I tried the following:
java -classpath dist/solr-core-4.10.3.jar -Dtype=application/vnd.ms-outlook
org.apache.solr.util.SimplePostTool C:/temp/samplemsg/*.msg
That didn't work
However
-F
"myfile=@6252671B765A1748992DF1A6403BDF81A4A22C00.msg"
Regards
Mark
On 26 January 2015 at 21:47, Alexandre Rafalovitch
wrote:
> Seems like apple to oranges comparison here.
>
> I would try giving an explicit end point (.../extract), a single
> message, and a literal id for the SimplePostTool and
think I may just extend SimplePostToo or look to use Solr Cell perhaps?
On 26 January 2015 at 22:14, Alexandre Rafalovitch
wrote:
> Well, you are NOT posting to the same URL.
>
>
> On 26 January 2015 at 17:00, Mark wrote:
> > http://localhost:8983/solr/update
>
>
>
> --
rse a folder means that is requires an ID strategy - which I believe
is lacking.
Reagrds
Mark
On 27 January 2015 at 10:57, Erik Hatcher wrote:
> Try adding -Dauto=true and take away setting url. The type probably isn't
> needed then either.
>
> With the new Solr 5 bin/
ested.
Thanks for eveyones suggestions.
Regards
Mark
On 27 January 2015 at 18:01, Alexandre Rafalovitch
wrote:
> Your IDs seem to be the file names, which you are probably also getting
> from your parsing the file. Can't you just set (or copyField) that as an ID
> on the Solr side
imeMap.put("msg", "application/vnd.ms-outlook");
Regards
Mark
On 27 January 2015 at 18:39, Mark wrote:
> Hi Alex,
>
> On an individual file basis that would work, since you could set the ID on
> an individual basis.
>
> However recuring a folder it doesn
Is it possible to use curl to upload a document (for extract & indexing)
and specify some fields on the fly?
sort of:
1) index this document
2) by the way here are some important facets whilst your at it
Regards
Mark
ushed
> to solr. Create the SID from the existing doc, add any additional fields,
> then add to solr.
>
> On Wed, Jan 28, 2015 at 11:56 AM, Mark wrote:
>
> > Is it possible to use curl to upload a document (for extract & indexing)
> > and specify some fields on the fly?
Second thoughts SID is purely i/p as its name suggests :)
I think a better approach would be
1) curl to upload/extract passing docID
2) curl to update additional fields for that docID
On 28 January 2015 at 17:30, Mark wrote:
>
> "Create the SID from the existing doc" implies
I'm looking to
1) upload a binary document using curl
2) add some additional facets
Specifically my question is can this be achieved in 1 curl operation or
does it need 2?
On 28 January 2015 at 17:43, Mark wrote:
>
> Second thoughts SID is purely i/p as its name suggests :)
&g
Use case is
use curl to upload/extract/index document passing in additional facets not
present in the document e.g. literal.source="old system"
In this way some fields come from the uploaded extracted content and some
fields as specified in the curl URL
Hope that's clearer?
Reg
field
'stuff'","code":400}}
..getting closer..
On 28 January 2015 at 18:03, Mark wrote:
>
> Use case is
>
> use curl to upload/extract/index document passing in additional facets not
> present in the document e.g. literal.source="old system"
>
&
ture=div&fmap.div=foo_txt&boost.foo_txt=3&literal.blah_s=Bah";
-F "tutorial=@"help.pdf
and therefore I learned that you can't update a field that isn't in the
original which is what I was trying to do before.
Regards
Mark
On 28 January 2015 at 18:38, Alexandr
How would I go about doing something like this. Not sure if this is something
that can be accomplished on the index side or its something that should be done
in our application.
Say we are an online store for shoes and we are selling Product A in red, blue
and green. Is there a way when we sea
e on it outside of
> solr.
>
>
> On Thu, Jul 25, 2013 at 10:12 PM, Mark wrote:
>
>> How would I go about doing something like this. Not sure if this is
>> something that can be accomplished on the index side or its something that
>> should be done in our application.
Can someone explain how one would go about providing alternative searches for a
query… similar to Amazon.
For example say I search for "Red Dump Truck"
- 0 results for "Red Dump Truck"
- 500 results for " Red Truck"
- 350 results for "Dump Truck"
Does this require multiple searches?
Thanks
We have a set number of known terms we want to match against.
In Index:
"term one"
"term two"
"term three"
I know how to match all terms of a user query against the index but we would
like to know how/if we can match a user's query against all the terms in the
index?
Search Queries:
"my search
That was it… thanks
On Aug 2, 2013, at 3:27 PM, Shawn Heisey wrote:
> On 8/2/2013 4:16 PM, Robert Zotter wrote:
>> The problem is the query get's expanded to "1 Foo" not ( "1" OR "Foo")
>>
>> 1Foo
>> 1Foo
>> +DisjunctionMaxQuery((name_textsv:"1 foo")) ()
>> +(name_textsv:"1 foo") ()
>>
>> DisM
;t match against indexed documents.
>
> Solr does support Lucene's "min should match" feature so that you can
> specify, say, four query terms and return if at least two match. This is the
> "mm" parameter.
>
> See:
> http://wiki.apache.org/solr/ExtendedDisMax#mm
y" wrote:
> Fine, then write the query that way: +foo +bar baz
>
> But it still doesn't sound as if any of this relates to prospective
> search/percolate.
>
> -- Jack Krupansky
>
> -Original Message- From: Mark
> Sent: Monday, August 05, 2013 2:1
Ok forget the mention of percolate.
We have a large list of known keywords we would like to match against.
Product keyword: "Sony"
Product keyword: "Samsung Galaxy"
We would like to be able to detect given a product title whether or not it
matches any known keywords. For a keyword to be mat
idworks.com
>
> On Fri, Aug 9, 2013 at 8:19 AM, Erick Erickson
> wrote:
>> This _looks_ like simple phrase matching (no slop) and highlighting...
>>
>> But whenever I think the answer is really simple, it usually means
>> that I'm missing something..
I'll look into this. Thanks for the concrete example as I don't even know which
classes to start to look at to implement such a feature.
On Aug 9, 2013, at 9:49 AM, Roman Chyla wrote:
> On Fri, Aug 9, 2013 at 11:29 AM, Mark wrote:
>
>>> *All* of the terms in the fie
> So to reiteratve your examples from before, but change the "labels" a
> bit and add some more converse examples (and ignore the "highlighting"
> aspect for a moment...
>
> doc1 = "Sony"
> doc2 = "Samsung Galaxy"
> doc3 = "Sony Playstation"
>
> queryA = "Sony Experia" ... matches only do
e since you literally
> do mean "if I index this document, will it match any of these queries" (but
> doesn't score a hit on your direct check for whether it is a clean keyword
> match.)
>
> In your previous examples you only gave clean product titles, not examples o
Any ideas?
On Aug 10, 2013, at 6:28 PM, Mark wrote:
> Our schema is pretty basic.. nothing fancy going on here
>
>
>
>
>
> protected="protected.txt"/>
> generateNumberParts="1" catenateWords="0" c
Is Jetty sufficient for running Solr or should I go with something a little
more enterprise like tomcat?
Any others?
Are there any links describing best practices for interacting with SolrJ? I've
checked the wiki and it seems woefully incomplete:
(http://wiki.apache.org/solr/Solrj)
Some specific questions:
- When working with HttpSolrServer should we keep around instances for ever or
should we create a single
We are in the process of upgrading our Solr cluster to the latest and greatest
Solr Cloud. I have some questions regarding full indexing though. We're
currently running a long job (~30 hours) using DIH to do a full index on over
10M products. This process consumes a lot of memory and while updat
Thanks for the clarification.
In Solr Cloud just use 1 connection. In non-cloud environments you will need
one per core.
On Oct 8, 2013, at 5:58 PM, Shawn Heisey wrote:
> On 10/7/2013 3:08 PM, Mark wrote:
>> Some specific questions:
>> - When working with HttpSolrServer
If using one static SolrCloudServer how can I add a bean to a certain
collection. Do I need to update setDefaultCollection() each time? I doubt that
thread safe?
Thanks
Thanks Ill give that a try
On 8/26/11 9:54 AM, simon wrote:
It sounds as though you are optimizing the index after the delta import. If
you don't do that, then only new segments will be replicated and syncing
will be much faster.
On Fri, Aug 26, 2011 at 12:08 PM, Mark wrote:
W
I have a use case where I would like to search across two fields but I
do not want to weight a document that has a match in both fields higher
than a document that has a match in only 1 field.
For example.
Document 1
- Field A: "Foo Bar"
- Field B: "Foo Baz"
Document 2
- Field A: "Foo Blar
I thought that a similarity class will only affect the scoring of a
single field.. not across multiple fields? Can anyone else chime in with
some input? Thanks.
On 9/26/11 9:02 PM, Otis Gospodnetic wrote:
Hi Mark,
Eh, I don't have Lucene/Solr source code handy, but I *think* for that
Has anyone had any success/experience with building a HBase datasource
for DIH? Are there any solutions available on the web?
Thanks.
I am trying to use the CachedSqlEntityProcessor with Solr 1.4.2 however
I am not seeing any performance gains. I've read some other posts that
reference cacheKey and cacheLookup however I don't see any reference to
them in the wiki
http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityPr
FYI my sub-entity looks like the following
On 11/15/11 10:42 AM, Mark wrote:
I am trying to use the CachedSqlEntityProcessor with Solr 1.4.2
however I am not seeing any performance gains. I've read some other
posts that reference cacheKey and cacheLookup however I don't see any
re
I'm trying to use multiple threads with DIH but I keep receiving the
following error.. "Operation not allowed after ResultSet closed"
Is there anyway I can fix this?
Dec 1, 2011 4:38:47 PM org.apache.solr.common.SolrException log
SEVERE: Full Import failed:java.lang.RuntimeException: Error in
ser/201110.mbox/browser
I plan (but only plan, sorry) to address it at 4.0 where SOLR-2382
refactoring has been applied recently.
Regards
On Fri, Dec 2, 2011 at 4:57 AM, Mark wrote:
I'm trying to use multiple threads with DIH but I keep receiving the
following error.. "Operation not allow
*pk*: The primary key for the entity. It is*optional*and only needed
when using delta-imports. It has no relation to the uniqueKey defined in
schema.xml but they both can be the same.
When using in a nested entity is the PK the primary key column of the
join table or the key used for joining?
Anyone?
On 12/5/11 11:04 AM, Mark wrote:
*pk*: The primary key for the entity. It is*optional*and only needed
when using delta-imports. It has no relation to the uniqueKey defined
in schema.xml but they both can be the same.
When using in a nested entity is the PK the primary key column of
We are thinking about using Cassandra to store our search logs. Can
someone point me in the right direction/lend some guidance on design? I
am new to Cassandra and I am having trouble wrapping my head around some
of these new concepts. My brain keeps wanting to go back to a RDBMS design.
We wi
On 7/26/10 4:43 PM, Mark wrote:
We are thinking about using Cassandra to store our search logs. Can
someone point me in the right direction/lend some guidance on design?
I am new to Cassandra and I am having trouble wrapping my head around
some of these new concepts. My brain keeps wanting to
We have an index around 25-30G w/ 1 master and 5 slaves. We perform
replication every 30 mins. During replication the disk I/O obviously
shoots up on the slaves to the point where all requests routed to that
slave take a really long time... sometimes to the point of timing out.
Is there any lo
Is it possible to use DIH with Cassandra either out of the box or with
something more custom? Thanks
Is there any way or forthcoming patch that would allow configuration
of how much network bandwith (and ultimately disk I/O) a slave is
allowed during replication? We have the current problem of while
replicating our disk I/O goes through the roof. I would much rather have
the replication take
On 8/6/10 5:03 PM, Chris Hostetter wrote:
: We have an index around 25-30G w/ 1 master and 5 slaves. We perform
: replication every 30 mins. During replication the disk I/O obviously shoots up
: on the slaves to the point where all requests routed to that slave take a
: really long time... somet
On 9/2/10 8:27 AM, Noble Paul നോബിള് नोब्ळ् wrote:
There is no way to currently throttle replication. It consumes the
whole bandwidth available. It is a nice to have feature
On Thu, Sep 2, 2010 at 8:11 PM, Mark wrote:
Is there any way or forthcoming patch that would allow configuration of
bandwidth
would be nice.
-brandon
On 9/2/10 7:41 AM, Mark wrote:
Is there any way or forthcoming patch that would allow configuration of
how much network bandwith (and ultimately disk I/O) a slave is allowed
during replication? We have the current problem of while replicating our
disk I/O goes through
.
From: Shawn Heisey [s...@elyograg.org]
Sent: Friday, September 03, 2010 1:46 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr crawls during replication
On 9/2/2010 9:31 AM, Mark wrote:
Thanks for the suggestions. Our slaves have 12G with 10G dedicated to
the JVM.. too much
We're currently running Solr 3.5 and our indexing process works as follows:
We have a master that has a cron job to run a delta import via DIH every 5
minutes. The delta-import takes around 75 minutes to full complete, most of
that is due to optimization after each delta and then the slaves s
x%2B5%29%2F38%29*42%2B105%2C+x%3D50..175
Regards,
Kent Fitch
On Tue, Feb 14, 2012 at 12:29 PM, Mark <mailto:static.void@gmail.com>> wrote:
I need some help with one of my boost functions. I would like the
function to look something like the following mockup below. Starts
construct using sums of basic
sigmoidal functions. The logistic and probit functions are commonly used for
this.
Sent from my iPhone
On Feb 14, 2012, at 10:05, Mark wrote:
Thanks I'll have a look at this. I should have mentioned that the actual values
on the graph aren't important ra
Or better yet an example in solr would be best :)
Thanks!
On 2/14/12 11:05 AM, Mark wrote:
Would you mind throwing out an example of these types of functions.
Looking at Wikipedia (http://en.wikipedia.org/wiki/Probit) its seems
like the Probit function is very similar to what I want.
Thanks
After I perform a delta-import on my master the slave replicates the
whole index which can be quite time consuming. Is there any way for the
slave to replicate only partials that have changed? Do I need to change
some setting on master not to commit/optimize to get this to work?
Thanks
Is there anyway to use DIH to import from Cassandra? Thanks
es or indexes to know what has changed.
There is also the Lucandra project, not exactly what your after but
may be of interest anyway https://github.com/tjake/Lucandra
Hope that helps.
Aaron
On 30 Nov, 2010,at 05:04 AM, Mark wrote:
Is there anyway to use DIH to import from Cassandra? Thanks
Is there way to limit the number of characters returned from a stored field?
For example:
Say I have a document (~2K words) and I search for a word that's
somewhere in the middle. I would like the document to match the search
query but the stored field should only return the first 200 characte
Correct me if I am wrong but I would like to return highlighted excerpts
from the document so I would still need to index and store the whole
document right (ie.. highlighting only works on stored fields)?
On 12/3/10 3:51 AM, Ahmet Arslan wrote:
--- On Fri, 12/3/10, Mark wrote:
From: Mark
When returning results is there a way I can say to return all fields
except a certain one?
So say I have stored fields foo, bar and baz but I only want to return
foo and bar. Is it possible to do this without specifically listing out
the fields I do want?
ld be your own response writer, but unless and
until you
index gets cumbersome, I'd avoid that. Plus, storing the copied contents
only shouldn't
impact search much, since this doesn't add any terms...
Best
Erick
On Fri, Dec 3, 2010 at 10:32 AM, Mark wrote:
Correct me if I am
Ok simple enough. I just created a SearchComponent that removes values
from the fl param.
On 12/3/10 9:32 AM, Ahmet Arslan wrote:
When returning results is there a way
I can say to return all fields except a certain one?
So say I have stored fields foo, bar and baz but I only
want to return fo
Is there a way I can specify separate configuration for 2 different fields.
For field 1 I wan to display only 100 chars, Field 2 200 chars
e solution and this has
confused me.
Basically, if you guys could point me in the right direction for resources
(even as much as saying, you need X, it's over there) that would be a huge
help.
Cheers
Mark
Thanks to everyone who responded, no wonder I was getting confused, I was
completely focusing on the wrong half of the equation.
I had a cursory look through some of the Nutch documentation available and
it is looking promising.
Thanks everyone.
Mark
On Tue, Dec 7, 2010 at 10:19 PM, webdev1977
Is there any plugin or easy way to auto-warm/cache a new searcher with a
bunch of searches read from a file? I know this can be accomplished
using the EventListeners (newSearcher, firstSearcher) but I rather not
add 100+ queries to my solrconfig.xml.
If there is no hook/listener available, is
y, but Xinclude looks like what
you're after, see: http://wiki.apache.org/solr/SolrConfigXml#XInclude
Best
Erick
On Tue, Dec 7, 2010 at 6:33 PM, Mark wrote:
Is there any plugin or easy way to auto-warm/cache a new searcher with a
bunch of searches read from a file? I know this can be ac
plete statement of your setup is in order, since
we seem to be talking past each other.
Best
Erick
On Tue, Dec 7, 2010 at 10:24 PM, Mark wrote:
Maybe I should explain my problem a little more in detail.
The problem we are experiencing is after a delta-import we notice a
extremely high load time o
otherwise block on CPU with lots of new indexes being warmed at once.
Solr is not very good at providing 'real time indexing' for this reason,
although I believe there are some features in post-1.4 trunk meant to support
'near real time search' better.
_
our
call.
Best
Erick
On Wed, Dec 8, 2010 at 12:25 PM, Mark wrote:
We only replicate twice an hour so we are far from real-time indexing. Our
application never writes to master rather we just pick up all changes using
updated_at timestamps when delta-importing using DIH.
We don't have any wa
Our machines have around 8gb of ram and our index is 25gb. What are some
good values for those cache settings. Looks like we have the defaults in
place...
size="16384"
initialSize="4096"
autowarmCount="1024"
You are correct, I am just removing the health-check file and our
loadbalancer preve
After replicating an index of around 20g my slaves experience very high
load (50+!!)
Is there anything I can do to alleviate this problem? Would solr cloud
be of any help?
thanks
Markus,
My configuration is as follows...
...
false
2
...
false
64
10
false
true
No cache warming queries and our machines have 8g of memory in them with
about 5120m of ram dedicated to so Solr. When our index is around 10-11g
in size everything runs smoothly. At around 20g+ it just fall
Changing the subject. Its not related to after replication. It only
appeared after indexing an extra field which increased our index size
from 12g to 20g+
On 12/13/10 7:57 AM, Mark wrote:
Markus,
My configuration is as follows...
...
false
2
...
false
64
10
false
true
No cache warming
Can anyone offer some advice on what some good settings would be for an
index or around 6 million documents totaling around 20-25gb? It seems
like when our index gets to this size our CPU load spikes tremendously.
What would be some appropriate settings for ramBufferSize and
mergeFactor? We cu
Excellent reply.
You mentioned: "I've been experimenting with FastLRUCache versus
LRUCache, because I read that below a certain hitratio, the latter is
better."
Do you happen to remember what that threshold is? Thanks
On 12/14/10 7:59 AM, Shawn Heisey wrote:
On 12/14/201
Seems like I am missing some configuration when trying to use DIH to
import documents with chinese characters. All the documents save crazy
nonsense like "这是测试" instead of actual chinese characters.
I think its at the JDBC level because if I hardcode one of the fields
within data-confi
05 PM, Mark wrote:
Seems like I am missing some configuration when trying to use DIH to import
documents with chinese characters. All the documents save crazy nonsense
like "这是测试" instead of actual chinese characters.
I think its at the JDBC level because if I hardcode on
Glen
http://zzzoot.blogspot.com/
On Mon, Dec 27, 2010 at 5:15 PM, Mark wrote:
Solr: 1.4.1
JDBC driver: Connector/J 5.1.14
Looks like its the JDBC driver because It doesn't even work with a simple
java program. I know this is a little off subject now, but do you have any
clues? Thanks again
Just like the user of that thread... i have my database, table, columns
and system variables all set but it still doesnt work as expected.
Server version: 5.0.67 Source distribution
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql> SHOW VARIABLES LIKE 'collation%';
+
besides your browser?
Yes, I am running out of ideas! :-)
-Glen
On Mon, Dec 27, 2010 at 7:22 PM, Mark wrote:
Just like the user of that thread... i have my database, table, columns and
system variables all set but it still doesnt work as expected.
Server version: 5.0.67 Source distribution
Is there a way to create dynamic column names using the values returned
from the query?
For example:
.
However when using the mysql client all the characters would show up as
all mangled or as ''. This was resolved by running the following
query "set names utf8;".
On 12/28/10 10:17 PM, Glen Newton wrote:
Hi Mark,
Could you offer a more technical explanation of the Ra
Is it possible to query across multiple cores and combine the results?
If not available out-of-the-box could this be accomplished using some
sort of custom request handler?
Thanks for any suggestions.
On Dec 29, 2010, at 3:24 PM, Mark wrote:
Is it possible to query across multiple cores and combine the results?
If not available out-of-the-box could this be accomplished using some
sort of custom request handler?
Thanks for any suggestions.
When using DIH my delta imports appear to finish quickly.. ie it says
"Indexing completed. Added/Updated: 95491 documents. Deleted 11148
documents." in a relatively short amount of time (~30mins).
However the importMessage says "A command is still running..." for a
really long time (~60mins).
I have recently been receiving the following errors during my DIH
importing. Has anyone ran into this issue before? Know how to resolve it?
Thanks!
Jan 1, 2011 4:51:06 PM org.apache.solr.handler.dataimport.JdbcDataSource
closeConnection
SEVERE: Ignoring Error when closing connection
com.mysql
I'm receiving the following exception when trying to perform a
full-import (~30 hours). Any idea on ways I could fix this?
Is there an easy way to use DIH to break apart a full-import into
multiple pieces? IE 3 mini-imports instead of 1 large import?
Thanks.
Feb 7, 2011 5:52:33 AM org.apa
Typo in subject
On 2/7/11 7:59 AM, Mark wrote:
I'm receiving the following exception when trying to perform a
full-import (~30 hours). Any idea on ways I could fix this?
Is there an easy way to use DIH to break apart a full-import into
multiple pieces? IE 3 mini-imports instead of 1
Mon, Feb 7, 2011 at 9:29 PM, Mark wrote:
I'm receiving the following exception when trying to perform a full-import
(~30 hours). Any idea on ways I could fix this?
Is there an easy way to use DIH to break apart a full-import into multiple
pieces? IE 3 mini-imports instead of 1 large i
Has anyone applied the DIH threads patch on 1.4.1
(https://issues.apache.org/jira/browse/SOLR-1352)?
Does anyone know if this works and/or does it improve performance?
Thanks
I know that I can use the SignatureUpdateProcessorFactory to remove
duplicates but I would like the duplicates in the index but remove them
conditionally at query time.
Is there any easy way I could accomplish this?
Is there a seamless field collapsing patch for 1.4.1?
I see it has been merged into trunk but I tried downloading it to give
it a whirl but it appears that many things have changed and our
application would need some considerable work to get it up an running.
Thanks
1 - 100 of 1492 matches
Mail list logo