Hello,
We are trying to use data import handler and particularly on a collection which contains many file (one xml per document) Our configuration works for a small amount of files, but dataimport fails with OutofMemory Error when running it on 10M files (in several directories...) This is it the content of our config.xml: <entity name="noticebib" datasource="null" processor="FileListEntityProcessor" fileName="^.*\.xml$" recursive="true" baseDir="${noticesBIB.basedir}" rootEntity="false" > <entity name="processorDocument" processor="XPathEntityProcessor" url="${noticebib.fileAbsolutePath}" xsl="xslt/mnb/IXM_MNb.xsl" forEach="/record" transformer="fr.bnf.solr.BnfDateTransformer" > <all my mapping> When we try it on a directory which contains 10 subdirectoies each subdir containing 1000 subdirectories, each one containing 1000 xml files (10M files, so), indexation process doesn't work anymore, We have a java.outofmemory excpetion (even with 512 Mo and 1GB memory) ERROR 2013-05-24 15:26:25,733 http-9145-2 org.apache.solr.handler.dataimport.DataImporter (96) - Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassCastException: java.lang.OutOfMemoryError cannot be cast to java.lang.Exception at org.apache.solr.handler.dataimport.DocBuilder.execute (DocBuilder.java:266) at org.apache.solr.handler.dataimport.DataImporter.doFullImport (DataImporter.java:422) at org.apache.solr.handler.dataimport.DataImporter.runCmd (DataImporter.java:487) at org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody (DataImportHandler.java:179) at org.apache.solr.handler.RequestHandlerBase.handleRequest (RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817) Monitoring the jvm with visualvm, I've seen that most of time is taken by the method FileListEntityProcessor.accept (called by getFolderFiles), so I assumed that the error occured when filling list of files to be indexed: Indeed the list of files is done by this method which called by getFolderFiles. Basically, the list of files to index is done by getFolderFiles, itself called at first call to nextRow(). The indexation itself starts only after that. org/apache/solr/handler/dataimport/FileListEntityProcessor.java private void [More ...] getFolderFiles(File dir, final List<Map<String, Object>> fileDetails) { I found back the variable fileDetails which contains the list of my xml files. It contains 611345 entries (for approximatively 500 Mo of memory). And I have 10M xml files (more or less...). That why I think it's not finished yet. To get the entire list I guess I need something between 5 and 10 Go for my process. So I have several questions : _ Is it possible to have severalFileListEntityProcessor attached to only one XPathEntityProcessor in the data-config.xml : Like this I can do it in ten times, with my 10 directories of first level. _ Is there a roadmap to optimize this method, for example by not doing the list of all file in the first time, but each 1000 documents, for instance? _ Or to store the file list in a temporary file in order to save some memory? Regards, ----------------------------------------------- Jérôme Dupont ----------------------------------------------- Exposition Jean de Gonet, relieur - jusqu'au 21 juillet 2013 - BnF - François-Mitterrand / Galerie François 1 er Jean de Gonet dédicacera le catalogue de l'exposition le samedi 25 mai de 16h30 à 18 heures à l'entrée de l'exposition. Avant d'imprimer, pensez à l'environnement.