Hi,
I am freelancer based in New Delhi, India. I have just completed a project
in Apache-solr, for a bioinformatics company. The project involved, among
other things, importing 46 million records from mysql database to create
solr indexes and developing a user interface for doing searches (with
au
: When I did heap analysis, the culprit always seems to
: be TimeLimitedCollector thread. Because of this, considerable amount of
: classes are not getting unloaded.
...
: > > There are couple of JIRA's related to this:
: > > https://issues.apache.org/jira/browse/LUCENE-2237,
: > > https:/
Thank you - I found it.
-Original Message-
From: rajini maski [mailto:rajinima...@gmail.com]
Sent: Thursday, March 03, 2011 12:03 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr under Tomcat
Sai,
The index directory will be in your Solr_home//Conf//data
directory..
The path f
Sai,
The index directory will be in your Solr_home//Conf//data directory..
The path for this directory need to be given where ever you want to
by changing the data-dir path in config XML that is present in the same
//conf folder . You need to stop tomcat service to delete this directory and
t
Have you looked at the 'qf' parameter?
Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com
_
http://www.cosugi.org/
> -Original Message-
> From: mrw [mailto:mikerobertsw...@gmail.com]
> Sent: Wednesday, March
(11/03/03 2:54), Mark wrote:
High level overview. We have items and we have sellers. The scoring of our
documents is such that
our boost functions outweight the pure lucene term/query scoring. Our boost
functions basically take
into account how "good" the seller is.
Now for MLT searches we wou
Thanks Jonathan, this will be useful -- in the meantime, I have
implemented the query rewriting, using the QueryParsing.toString()
utility as an example.
On Wed, Mar 2, 2011 at 5:40 PM, Jonathan Rochkind wrote:
> Not per clause, no. But you can use the "nested queries" feature to set
> local para
On Wed, Mar 2, 2011 at 4:19 PM, Scott K wrote:
> On Wed, Mar 2, 2011 at 12:21, Chris Hostetter
> wrote:
>> historicly it has been because of a fundemental limitation in how the
>> Lucene FieldCache has historicly worked where the array backed FieldCaches
>> use the default numeric value (ie: 0)
: even in this WIP-State? if so .. i'll try one tomorrow evening after work
When in doubt, remember Yonik's Law Of Patches...
http://wiki.apache.org/solr/HowToContribute?highlight=law+of+patches#Contributing_Code_.28Features.2C_Big_Fixes.2C_Tests.2C_etc29
A half-baked patch in Jira,
On Wed, Mar 2, 2011 at 5:34 PM, Stefan Matheis
wrote:
> Robert,
>
> even in this WIP-State? if so .. i'll try one tomorrow evening after work
>
Its totally up to you, sometimes it can be useful to upload a partial
or WIP solution to an issue: as Hoss mentioned its a good way to get
feedback and a
mrw,
you mean a field like here
(http://files.mathe.is/solr-admin/02_query.png) on the right side,
between meta-navigation and plain solr-xml response?
actually it's just to display the computed url, but if so .. we could
use a larger field for that, of course :)
Regards
Stefan
Am 02.03.2
Robert,
even in this WIP-State? if so .. i'll try one tomorrow evening after work
Regards
Stefan
Am 02.03.2011 22:02, schrieb Robert Muir:
On Wed, Mar 2, 2011 at 3:47 PM, Stefan Matheis
wrote:
Any Questions, Ideas, Thoughts outta there? Please, let me know :)
My only question would be:
: given that fact that my java-knowledge is sort of non-existing .. my idea was
: to rework the Solr Admin Interface.
Contributions of all kinds are welcome!
: Actually it's completly work-in-progress .. but i'm interested in what you
: guys think. Right direction? Completly Wrong, just drop it?
We have two banks of Solr nodes with identical schemas. The data I'm
searching for is in both banks.
One has defaultSearchField set to field1, the other has defaultSearchField
set to field2.
We need to support both user queries and facet queries that have no user
content. For the latter, it app
Looks nice.
Might be also worth it to create a page with large query field for pasting
in complete URL-encoded queries that cross cores, etc. I did that at work
(via ASP.net) so we could paste in queries from logs and debug them. We
tend to use that quite a bit.
Cheers
Stefan Matheis wrote:
Thanks for the suggestions. Tomcat does release permgen memory with
appropriate jvm options and configuration settings (
clearReferencesStopTimerThreads, clearReferencesThreadLocals).
When I did heap analysis, the culprit always seems to
be TimeLimitedCollector thread. Because of this, considerable
On Wed, Mar 2, 2011 at 12:21, Chris Hostetter wrote:
> historicly it has been because of a fundemental limitation in how the
> Lucene FieldCache has historicly worked where the array backed FieldCaches
> use the default numeric value (ie: 0) when docs have no value (but in the
> case of Strings, t
On Wed, Mar 2, 2011 at 3:47 PM, Stefan Matheis
wrote:
> Any Questions, Ideas, Thoughts outta there? Please, let me know :)
>
My only question would be: would you mind creating a JIRA issue with
your modifications?
I was just yesterday looking at this admin stuff and thinking, man
this could rea
Hey Markus,
actually it's CC BY 3.0 - Yusuke Kamiyamane created the "Fugue Icons"
(http://p.yusukekamiyamane.com/)
Regards
Stefan
Am 02.03.2011 21:46, schrieb Markus Jelsma:
Nice! It makes multi core navigation a lot easier. What license do the icons
have?
Hi List,
given that fact that my
: I wonder if what doesn't work is trying to set an explicit relative path
: there, instead of using the baked in default "data". If you set an explicit
: relative path, is it relative to the current core solr.home, or to the main
: solr.home?
it's realtive the current working dir of the process
Nice! It makes multi core navigation a lot easier. What license do the icons
have?
> Hi List,
>
> given that fact that my java-knowledge is sort of non-existing .. my
> idea was to rework the Solr Admin Interface.
>
> Compared to CouchDBs Futon or the MongoDB Admin-Utils .. not that fancy,
> bu
Hi List,
given that fact that my java-knowledge is sort of non-existing .. my
idea was to rework the Solr Admin Interface.
Compared to CouchDBs Futon or the MongoDB Admin-Utils .. not that fancy,
but it was an idea few weeks ago - and i would like to contrib
something, a thing which has to b
Hi,
I remember reading somewhere that undeploying an application in Tomcat won't
release memory, thus repeating the cycle will indeed exhaust the permgen. You
could enable garbage collection of the permgen.
HotSpot can do this for you but it depends on using CMS which you might not
want to us
I wonder if what doesn't work is trying to set an explicit relative path
there, instead of using the baked in default "data". If you set an
explicit relative path, is it relative to the current core solr.home, or
to the main solr.home?
Let's try it to see Yep, THAT's what doesn't work, an
That's great, just what I needed, I was debugging and was expecting to
see something like this.
i'll look through the SVN history to see in which version it was added.
Thanks
On Wednesday, March 2, 2011, Yonik Seeley wrote:
> On Wed, Mar 2, 2011 at 2:43 PM, Ofer Fort wrote:
>> I didn't see this
Meanwhile, I'm having trouble getting the expected behavior at all. I'll
try to give the right details (without overwhelming with too many), if
anyone can see what's going on.
Solr 1.4.1. Multi-core. 'Main' solr home with solr.xml at
/opt/solr/solr_indexer/solr.xml
The solr.xml includes actu
Ah...so I need to be doing
&q.alt=*:*
&fq=:.
Of course, now that you showed me what I look for, I also see the
explanation in the Packt book. Sheesh.
Thanks for the tip!
Chris Hostetter-3 wrote:
>
> : For standard query handler fq-only queries, we used q=*:*. However,
> with
> : dismax, t
Hi
I get the same problem on tomcat with other applications, so this does not
appear to be limited to SOLR. I got the error on tomcat 6 and 7. The only
solution I found was to kill tomcat and start it again.
François
On Mar 2, 2011, at 2:28 PM, Search Learn wrote:
> Hello,
> We currently depl
Hi
I get the same problem on tomcat with other applications, so this does not
appear to be limited to SOLR. I got the error on tomcat 6 and 7. The only
solution I found was to kill tomcat and start it again.
François
On Mar 2, 2011, at 2:28 PM, Search Learn wrote:
> Hello,
> We currently depl
: When I sort by price ascending, documents with no price are listed
: first. I would like them listed last. I tried adding the
: sortMissingLast flag, even though it says it is only for strings, but
it works for any field type *backed* by a string, including the
SortableIntField (and it's breat
I didn't see this behavior, running solr 1.4.1, was that implemented
after this release?
On Wednesday, March 2, 2011, Yonik Seeley wrote:
> On Wed, Mar 2, 2011 at 1:58 PM, Ofer Fort wrote:
>> Thanks,
>> But each query tries to see if there is something new since the last result
>> that was found
Yes - I commented out the element in solrconfig.xml and then
got the expected behavior: the core used a data subdirectory in the core
subdirectory.
It seems like the problem arises from using the solrconfig.xml that's
distributed as example/solr/conf/solrconfig.xml
The solrconfig.xml's in
This could probably be done using a custom QParser plugin?
Define the pattern like this:
String queryTemplate = "title:%Q%^2.0 body:%Q%";
then replace the %Q% with the value of the Q param, send it through
QueryParser.parse() and return the query.
-sujit
On Wed, 2011-03-02 at 11:28 -0800, mrw
I'm guessing what i was describing is a short-circuit evaluation and i see
that lucene doesn't have it:
http://lucene.472066.n3.nabble.com/Short-circuit-in-query-td738551.html
Still would love to hear any suggestions for my type of query
ofer
On Wed, Mar 2, 2011 at 8:58 PM, Ofer Fort wrote:
>
Hello,
We currently deploy and undeploy solr web application potentially hundred's
of times during a typical day. when the solr is undeployed, its classes are
not getting unloaded and eventually we are running into permgen error.
There are couple of JIRA's related to this:
https://issues.apache.org
Thanks,
But each query tries to see if there is something new since the last result
that was found, so rounding things will return the same documents over and
over again, till we reach to the next rounded point.
Could i use the document id somehow? or something else that's bigger than
my last se
If you're confortable with XSL you can create a transformer and use Solr's
XSLTResponseWriter to do the job.
http://wiki.apache.org/solr/XsltResponseWriter
> Hi all,
>
> This list has proven itself quite useful since I got started with Solr. I'm
> wondering if it is possible to dictate the XML t
When I sort by price ascending, documents with no price are listed
first. I would like them listed last. I tried adding the
sortMissingLast flag, even though it says it is only for strings, but
it did not help. Why doesn't sortMissingLast work on non-strings? This
seems like a very common issue, bu
Hi all,
This list has proven itself quite useful since I got started with Solr. I'm
wondering if it is possible to dictate the XML that is returned by a search?
Right now it seems very inefficient in that it is formatted like:
Val
Val
Etc.
I would like to change it so that it reads something li
: For standard query handler fq-only queries, we used q=*:*. However, with
: dismax, that returns 0 results. Are fq-only queries possible with dismax?
they are if you use the q.alt param.
http://wiki.apache.org/solr/DisMaxQParserPlugin#q.alt
-Hoss
On Wed, Mar 2, 2011 at 2:43 PM, Ofer Fort wrote:
> I didn't see this behavior, running solr 1.4.1, was that implemented
> after this release?
I think so.
It's implemented now in BooleanWeight.scorer()
for (Weight w : weights) {
BooleanClause c = cIter.next();
Scorer subSc
Anyone understand how to do boolean logic across multiple fields?
Dismax is nice for searching multiple fields, but doesn't necessarily
support our syntax requirements. eDismax appears to be not available until
Solr 3.1.
In the meantime, it looks like we need to support applying the user's q
On Wed, Mar 2, 2011 at 1:58 PM, Ofer Fort wrote:
> Thanks,
> But each query tries to see if there is something new since the last result
> that was found, so rounding things will return the same documents over and
> over again, till we reach to the next rounded point.
>
> Could i use the document
Hi,
I wondered if anyone knew if there are capabilities in Solr to
'register' a query much like Elasticsearch's 'percolation'
functionality.
I.E. Instruct Solr that you are interested in documents that match a
given query and then have Solr notify you (through whatever callback
mechanism is speci
timestamp is of type:
On Wed, Mar 2, 2011 at 8:11 PM, Ofer Fort wrote:
> you are correct that my query is a tange one, probably should have
> mentioned it in the first post.
> this is the debug data:
>
>
>
>
>
> 0
> 4173
>
> on
> on
>
> 0
> timestamp:[2011-02-01T00:00:00Z TO NO
you are correct that my query is a tange one, probably should have mentioned
it in the first post.
this is the debug data:
0
4173
on
on
0
timestamp:[2011-02-01T00:00:00Z TO NOW] AND oferiko
2.2
10
timestamp:[2011-02-01T00:00:00Z TO NOW] AND
oferiko
timestamp:[2011-02-0
For standard query handler fq-only queries, we used q=*:*. However, with
dismax, that returns 0 results. Are fq-only queries possible with dismax?
Thanks!
--
View this message in context:
http://lucene.472066.n3.nabble.com/dismax-query-with-no-empty-q-parameter-tp2619170p2619170.html
Sen
mlt.boost - [true/false] set if the query will be boosted by the
interesting term relevance.
This is not the same as boost functions:
http://wiki.apache.org/solr/DisMaxQParserPlugin#bf_.28Boost_Functions.29
On 3/2/11 7:45 AM, Markus Jelsma wrote:
There is a mlt.boost parameter.
On Wednesda
High level overview. We have items and we have sellers. The scoring of
our documents is such that our boost functions outweight the pure lucene
term/query scoring. Our boost functions basically take into account how
"good" the seller is.
Now for MLT searches we would like to incorporate this s
On Wed, Mar 2, 2011 at 12:11 PM, Ofer Fort wrote:
> Hey all,
> I have an index with a lot of documents with the term X and no documents
> with the term Y.
> If i query for X it take a few seconds and returns the results.
> If I query for Y it takes a millisecond and returns an empty set.
> If i qu
Yes, I knew that the ticket is still open. This is why I am looking for the
solutions now.
2011/3/2 Tomás Fernández Löbbe
> Hi Jae, this is the Jira created for the problem of IDF on distributed
> search:
>
> https://issues.apache.org/jira/browse/SOLR-1632
>
> It's still open
>
> On Wed, Mar 2,
Thanks,
I tried it in the past and found out that my hit ratio was pretty low, so it
doesn't help most of my queries
ofer
On Wed, Mar 2, 2011 at 7:16 PM, Geert-Jan Brits wrote:
> If you often query X as part of several other queries (e.g: X | X AND Y |
> X AND Z)
> you might consider putting
If you often query X as part of several other queries (e.g: X | X AND Y |
X AND Z)
you might consider putting X in a filter query (
http://wiki.apache.org/solr/CommonQueryParameters#fq)
leading to:
q=*:*&fq=X
q=Y&fq=X
q=Z&fq=X
Filter queries are cached seperately which means that after the firs
You might want to hire a consultant.
Tika can deal with Word documents. Ids needs to be unique. One index might
work, not sure based on your info below.
For database you need to use a Java db thin connector to SQL server. Throw the
jar in the lib directory and restart. Then setup dih settings t
On Wed, Mar 2, 2011 at 11:34 AM, Gastone Penzo wrote:
> HI,
> for search i use disquery max
> and a i want to boost a field with bf parameter like:
> ...&bf=boost_has_img^5&
> the boost_has_img field of my document is 3:
> 3
> if i see the results in debug query mode i can see:
> 0.0 = (MATC
Hey all,
I have an index with a lot of documents with the term X and no documents
with the term Y.
If i query for X it take a few seconds and returns the results.
If I query for Y it takes a millisecond and returns an empty set.
If i query for Y AND X it takes a few seconds and returns an empty set
Hi Jae, this is the Jira created for the problem of IDF on distributed
search:
https://issues.apache.org/jira/browse/SOLR-1632
It's still open
On Wed, Mar 2, 2011 at 1:48 PM, Upayavira wrote:
> As I understand it there is, and the best you can do is keep the same
> number of docs per shard, an
As I understand it there is, and the best you can do is keep the same
number of docs per shard, and keep your documents randomised across
shards. That way you'll minimise the chances of suffering from
distributed IDF issues.
Upayavira
On Wed, 02 Mar 2011 10:10 -0500, "Jae Joo" wrote:
> Is there
Stefan,
Thank you very much! It works perfect...
Any idea for the other question? Someone?
Matias.
2011/3/2 Stefan Matheis
> Matias,
>
> for indexing constant/static values .. try
> http://wiki.apache.org/solr/DataImportHandler#TemplateTransformer
>
> Regards
> Stefan
>
> On Wed, Mar 2, 201
Not per clause, no. But you can use the "nested queries" feature to set
local params for each nested query instead. Which is in fact one of the
most common use cases for local params.
&q=_query_:"{type=x q.field=z}something" AND
_query_:"{!type=database}something"
URL encode that whole thin
HI,
for search i use disquery max
and a i want to boost a field with bf parameter like:
...&bf=boost_has_img^5&
the boost_has_img field of my document is 3:
3
if i see the results in debug query mode i can see:
0.0 = (MATCH) FunctionQuery(int(boost_has_img)), product of:
0.0 = int(bo
So here's something interesting. I did a delta import this morning and it
looks like I can do a global search across those fields.
I'll do another full import and see if that fixed the problem. I had done a
fullimport after making this change but it seems like another reindex is in
order.
On Wed,
Thanks you very much for this info, that helps a lot!
On Wed, Mar 2, 2011 at 10:05 AM, Jayendra Patil
wrote:
> Hi Mike,
>
> There was an issue with the Snappuller wherein it fails to clean up
> the old index directories on the slave side.
> https://issues.apache.org/jira/browse/SOLR-2156
>
> Th
There is a mlt.boost parameter.
On Wednesday 02 March 2011 16:28:35 dar...@ontrenet.com wrote:
> I think what's being asked is obvious, in that, they want to modify the
> sorted relevancy of the results of MLT. Where, instead of (or in addition
> to) sorting by the mlt score, some modified functio
Hi everyone!
I've been following this thread and I realized we've constructed something
similar to "Crawl Anywhere". The main difference is that our project is
oriented to the digital libraries and digital repositories context.
Specifically related to metadata collection from multiple sources,
info
I think what's being asked is obvious, in that, they want to modify the
sorted relevancy of the results of MLT. Where, instead of (or in addition
to) sorting by the mlt score, some modified function/subquery can be used
to further sort the results.
One example. You run a MLT query against a docum
Please also provide analysis part of fieldType text. You can also use Luke to
inspect the index.
http://localhost:8983/solr/admin/luke?fl=globalField&numTerms=100
On Wednesday 02 March 2011 16:09:33 Brian Lamb wrote:
> Here are the relevant parts of schema.xml:
>
> multiValued="true"/>
> glob
Hi,
Is it possible to set local arguments for each query clause?
example:
{!type=x q.field=z}something AND {!type=database}something
I am pulling together result sets coming from two sources, Solr index
and DB engine - however I realized that local parameters apply only to
the whole query - so
Hi Rok,
If I understood the use case rightly, Grouping of the results are
possible in Solr http://wiki.apache.org/solr/FieldCollapsing
Probably, you can create new fields with the combination for the
groups and use the field collapsing feature to group the results.
Id Type1Type2Title Grou
Is there still issue regarding distributed idf in sharding environment in
Solr 1.4 or 4.0?
If yes, any suggestions to resolve it?
Thanks,
Jae
Here are the relevant parts of schema.xml:
globalField
This is what is returned when I search:
-
0
1
-
Mammal
true
-
Mammal
Mammal
globalField:mammal
globalField:mammal
LuceneQParser
-
1.0
-
1.0
-
1.0
-
0.0
-
0.0
-
0.0
-
0.0
-
0.0
-
0.0
-
0.0
-
0.0
-
0.0
-
0.0
-
Hi Mike,
There was an issue with the Snappuller wherein it fails to clean up
the old index directories on the slave side.
https://issues.apache.org/jira/browse/SOLR-2156
The patch can be applied to fix the issue.
You can also delete the old index directories, except for the current
one which is m
Hi Sai,
You can find your index files at:
{%TOMCAT_HOME}\solr\data\index
If you want to clear the index just delete the whole index directory.
Regards,
- Savvas
On 2 March 2011 14:09, Thumuluri, Sai wrote:
> Good Morning,
> We have deployed Solr 1.4.1 under Tomcat and it works great, however I
Hi,
No, it doesn't. It looks like to be a apache httpclient 3.x limitation.
https://issues.apache.org/jira/browse/HTTPCLIENT-579
Dominique
Le 02/03/11 15:04, Thumuluri, Sai a écrit :
Dominique, Does your crawler support NTLM2 authentication? We have content
under SiteMinder which uses NTLM2 a
Yes. But keep in mind that Solr may be actually using an index.
directory for its live search. See either the replication.properties file or
consult the replication page to see what index directory it uses.
If it uses an index. directory you can safely move it to index and
remove or modify repl
Matias,
for indexing constant/static values .. try
http://wiki.apache.org/solr/DataImportHandler#TemplateTransformer
Regards
Stefan
On Wed, Mar 2, 2011 at 2:46 PM, Matias Alonso wrote:
> Good Morning,
>
>
> First, sorry for my poor english.
>
>
> I trying to index “blogs” (rss) to my solr, so I
Good Morning,
We have deployed Solr 1.4.1 under Tomcat and it works great, however I
cannot find where the index (directory) is created. I set solr home in
web.xml under /webapps/solr/WEB-INF/, but not sure where the data
directory is. I have a need where I need to completely index the site
and it
HI Jonathan:
Did you try :
This should create the indexes under some_core/data or you can make
datadir relative to some_core dir.
Regards,
- NN
http://solr-ra.tgels.com
http://rankingalgorithm.tgels.com
On 3/1/2011 7:21 AM, Jonathan Rochkind wrote:
I did try that, yes. I tried that fi
Dominique, Does your crawler support NTLM2 authentication? We have content
under SiteMinder which uses NTLM2 and that is posing challenges with Nutch?
-Original Message-
From: Dominique Bejean [mailto:dominique.bej...@eolya.fr]
Sent: Wednesday, March 02, 2011 6:22 AM
To: solr-user@lucene
Is it ok if I just delete the old copies manually? or maybe run a
script that does it?
On Tue, Mar 1, 2011 at 7:47 PM, Markus Jelsma
wrote:
> Indeed, the slave should not have useless copies but it does, at least in
> 1.4.0, i haven't seen it in 3.x, but that was just a small test that did not
>
Right now I have the slave polling every 10 seconds, becuase we want
to make sure they stay in sync. I have users who will do post
directly from a web application. But I do notice it syncs very quick,
becuase usually the update is only one or two records at a time.
I am thinking maybe 10 seconds
(11/03/02 0:23), Mark wrote:
Is it possible to add function queries/boosts to the results that are by MLT?
If not out of the box
how would one go about achieving this functionality?
Thanks
Beside the point, why do you need such function?
If you give us more information/background of your need
Good Morning,
First, sorry for my poor english.
I trying to index “blogs” (rss) to my solr, so I´m using a dataImportHandler
for this.
I can´t index the date and I don´t no how to index static values (constant)
in a Field.
When I make a “Full Import” it doesn´t index the docs; if I delete the
There is an updateRequestProcessorChain you can use to execute some
processors. Check de page for deduplication, it already has methods for
creating signatures but you can easily create your own if you have to.
Use copyField to copy the value to a non-analyzed field (string) and obtain the
orig
Hi All,
I have a requirement to analyze a field with a series of filters,
calculate a 'signature' then concatenate with the original input
e.g.
input => 'this is the input'
tokenized and filtered, input becomes say 'this input' =>
12ef5e (signature)
so the final output indexed is:
I have an index with a number of documents. For example (this example is
representative and contains many others fields):
IdType1Type2Title
1abxfg
2acabd
3adthm
4baefd
5bbikj
6bcazd
...
I want to query an index on
VIewing the indexing result, which is a part of what you are describing I
think, is a nice job for such an indexing framework.
Do you guys know whether such feature is already out there?
paul
Le 2 mars 2011 à 12:20, Geert-Jan Brits a écrit :
> Hi Dominique,
>
> This looks nice.
> In the past
Hi,
The crawler comes with a extendible document processing pipeline. If you
know java libraries or web services for 'wrapper induction' processing,
it is possible to implement a dedicated stage in the pipeline.
Dominique
Le 02/03/11 12:20, Geert-Jan Brits a écrit :
Hi Dominique,
This look
Aditya,
The crawler is not open source and won't be in the next future. Anyway,
I have to change the license because it can be use for any personal or
commercial projects.
Sincerely,
Dominique
Le 02/03/11 10:02, findbestopensource a écrit :
Hello Dominique Bejean,
Good job.
We identified
Lukas,
I am thinking about it but no decision yet.
Anyway, in next release, I will provide source code of pipeline stages
and connectors as samples.
Dominique
Le 02/03/11 10:01, Lukáš Vlček a écrit :
Hi,
is there any plan to open source it?
Regards,
Lukas
[OT] I tried HuriSearch, input "
Hi Dominique,
This looks nice.
In the past, I've been interested in (semi)-automatically inducing a
scheme/wrapper from a set of example webpages (often called 'wrapper
induction' is the scientific field) .
This would allow for fast scheme-creation which could be used as a basis for
extraction.
L
Rosa,
In the pipeline, there is a stage that extract the text from the
original document (PDF, HTML, ...).
It is possible to plug scripts (Java 6 compliant) in order to keep only
relevant parts of the document.
See
http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage
Do
David,
The UI was not the only reason that make me choose to write a totaly new
crawler. After eliminating candidate crawlers due to various reasons
(inactive project, ...), Nutch and Heritrix where the 2 crawlers in my
short list of possible candidates to be use.
In my mind, the crawler and
I have read the solr book and the other book is on its way for me to read.
I need some help in the mean time.
a) Using the example solr system how do I send the word document using curl
into the system.I want to have the ID as the full path of the document.
I have tried various commands
Hello Dominique Bejean,
Good job.
We identified almost 8 open source web crawlers
http://www.findbestopensource.com/tagged/webcrawler I don't know how far
yours would be different from the rest.
Your license states that it is not open source but it is free for personnel
use.
Regards
Aditya
ww
Hi,
is there any plan to open source it?
Regards,
Lukas
[OT] I tried HuriSearch, input "Java" into search field, it returned a lot
of references to coldfusion error pages. May be a recrawl would help?
On Wed, Mar 2, 2011 at 1:25 AM, Dominique Bejean
wrote:
> Hi,
>
> I would like to announce Cr
Nice job!
It would be good to be able to extract specific data from a given page
via XPATH though.
Regards,
Le 02/03/2011 01:25, Dominique Bejean a écrit :
Hi,
I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web
Crawler. It includes :
* a crawler
* a document pro
It turn out you don't need to use dateFormatTransformer at all. The reason
why the timestamp mysql column fail to be inserted to solr is because in
schema.xml i mistakenly set "index=false, stored=false". Of course that
won't make it come to index at all. No wonder schema browser always show no
Hey,
normally .. if i have problems with dih:
* i start having a look at the mysql-query-log, to check which queries
are executed.
* re-run the query myself, verify the return data
* Activate http://wiki.apache.org/solr/DataImportHandler#LogTransformer
and log the important data, check console ou
99 matches
Mail list logo