Skip to content

Bugs in GeoNetwork 3.12 #9206

@paul-hoeller

Description

@paul-hoeller

My company is maintaining a closed-source fork of GeoNetwork for a client. While we intend to eventually upgrade to version 4, that will require more work, so for the time being, we are still using version 3. As part of this maintenance work, I have spotted and fixed some bugs pertaining to the Lucene search index. These were not carried over to version 4 because that does not use Lucene anymore. However, I am told that there are other institutions that are likewise still running GeoNetwork 3. I don't know if you are going to care enough to merge these changes into the 3.12.x branch and make a new release, but I am still putting this here in hopes that at least the other users of version 3 will be able to find this and apply it to their projects.

These patches are intended to be applied on top of c3c3339 (this is where the 3.12.x branch is currently pointing).

Deleted documents are being retrieved from the index Document numbers that aren't live have to be filtered out, they can't just be read. This patch should eliminate inexplicable Document with no _id field skipped! Document is Document<> messages from your log.
diff --git a/core/src/main/java/org/fao/geonet/kernel/search/SearchManager.java b/core/src/main/java/org/fao/geonet/kernel/search/SearchManager.java
index ecf33b4533..eed1a3596b 100644
--- a/core/src/main/java/org/fao/geonet/kernel/search/SearchManager.java
+++ b/core/src/main/java/org/fao/geonet/kernel/search/SearchManager.java
@@ -51,6 +51,8 @@ import org.apache.lucene.facet.taxonomy.CategoryPath;
 import org.apache.lucene.index.AtomicReader;
 import org.apache.lucene.index.AtomicReaderContext;
 import org.apache.lucene.index.FieldInfo.IndexOptions;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.MultiFields;
 import org.apache.lucene.index.SlowCompositeReaderWrapper;
 import org.apache.lucene.index.Term;
 import org.apache.lucene.index.Terms;
@@ -60,6 +62,7 @@ import org.apache.lucene.search.IndexSearcher;
 import org.apache.lucene.search.Query;
 import org.apache.lucene.search.TermQuery;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
 import org.fao.geonet.ApplicationContextHolder;
 import org.fao.geonet.Util;
@@ -120,10 +123,12 @@ import java.util.Collections;
 import java.util.Comparator;
 import java.util.HashMap;
 import java.util.HashSet;
+import java.util.Iterator;
 import java.util.LinkedHashSet;
 import java.util.List;
 import java.util.Map;
 import java.util.Map.Entry;
+import java.util.NoSuchElementException;
 import java.util.Set;
 import java.util.SortedSet;
 import java.util.TreeSet;
@@ -823,6 +828,52 @@ public class SearchManager implements ISearchManager {
         _spatial.writer().delete(txts);
     }
 
+    private static class ReaderDocIterator implements Iterator<Integer> {
+
+        //private IndexReader _reader;
+        private Integer _next;
+        private Bits _live;
+        private int _max;
+
+        public ReaderDocIterator(IndexReader reader) {
+            //_reader = reader;
+            _next = -1;
+            _live = MultiFields.getLiveDocs(reader);
+            _max = reader.maxDoc();
+            if (_live == null) {
+                _live = new Bits.MatchAllBits(_max);
+            }
+            prepareNext();
+        }
+
+        private void prepareNext() {
+            do {
+                if (++_next >= _max) {
+                    _next = null;
+                    break;
+                }
+            } while (!_live.get(_next)); // document is deleted
+            // According to the documentation for Lucene 4.9.0, which is what we are using, getLiveDocs is slow and one should use something else instead:
+            // https://lucene.apache.org/core/4_9_0/core/org/apache/lucene/index/MultiFields.html#getLiveDocs(org.apache.lucene.index.IndexReader)
+            // However, that information is apparently outdated and getLiveDocs is now the only way to accomplish this:
+            // https://stackoverflow.com/questions/2311845/is-it-possible-to-iterate-through-documents-stored-in-lucene-index#comment132504958_2321781
+            // https://stackoverflow.com/a/18499503
+            // https://lucene.apache.org/core/4_0_0/MIGRATE.html
+        }
+
+        @Override
+        public boolean hasNext() {
+            return _next != null;
+        }
+
+        @Override
+        public Integer next() {
+            if (_next == null) throw new NoSuchElementException();
+            Integer result = _next;
+            prepareNext();
+            return result;
+        }
+    }
 
     public Set<Integer> getDocsWithXLinks() throws Exception {
         IndexAndTaxonomy indexAndTaxonomy = getNewIndexReader(null);
@@ -831,11 +882,14 @@ public class SearchManager implements ISearchManager {
             GeonetworkMultiReader reader = indexAndTaxonomy.indexReader;
 
             Set<Integer> docs = new LinkedHashSet<Integer>();
-            for (int i = 0; i < reader.maxDoc(); i++) {
+            Iterator<Integer> it = new ReaderDocIterator(reader);
+            while (it.hasNext()) {
+                int i = it.next();
                 // Commented this out for lucene 4.0 and NRT indexing.  It shouldn't be needed I would guess but leave it here
                 // for a bit longer:  Commented out since: Dec 10 2012
                 // FIXME: strange lucene hack: sometimes it tries to load a deleted document
                 // if (reader.isDeleted(i)) continue;
+                // -> UPDATE: See comment in ReaderDocIterator.prepeareNext()
 
                 DocumentStoredFieldVisitor idXLinkSelector = new DocumentStoredFieldVisitor(Geonet.IndexFieldNames.ID, "_hasxlinks");
                 reader.document(i, idXLinkSelector);
@@ -893,11 +947,14 @@ public class SearchManager implements ISearchManager {
 
             int capacity = (int) (reader.maxDoc() / 0.75) + 1;
             Map<String, String> docs = new HashMap<String, String>(capacity);
-            for (int i = 0; i < reader.maxDoc(); i++) {
+            Iterator<Integer> it = new ReaderDocIterator(reader);
+            while (it.hasNext()) {
+                int i = it.next();
                 // Commented this out for lucene 4.0 and NRT indexing.  It shouldn't be needed I would guess but leave it here
                 // for a bit longer:  Commented out since: Dec 10 2012
                 // FIXME: strange lucene hack: sometimes it tries to load a deleted document
                 // if (reader.isDeleted(i)) continue;
+                // -> UPDATE: See comment in ReaderDocIterator.prepeareNext()
 
                 DocumentStoredFieldVisitor idChangeDateSelector = new DocumentStoredFieldVisitor(Geonet.IndexFieldNames.ID, "_changeDate");
                 reader.document(i, idChangeDateSelector);
Minor bugs relating to GeoNetworkAnalyzer In GeoNetworkAnalzyer, wrapReader() is not called on the initial reader passed into createComponents(), only ones passed to setReader() later.

In CharToSpaceReader, if read() is called with a second argument greater 0, substitution is not performed correctly.

diff --git a/core/src/main/java/org/fao/geonet/kernel/search/CharToSpaceReader.java b/core/src/main/java/org/fao/geonet/kernel/search/CharToSpaceReader.java
index d744b9aa1c..da27676885 100644
--- a/core/src/main/java/org/fao/geonet/kernel/search/CharToSpaceReader.java
+++ b/core/src/main/java/org/fao/geonet/kernel/search/CharToSpaceReader.java
@@ -23,7 +23,6 @@
 
 package org.fao.geonet.kernel.search;
 
-import java.io.FilterReader;
 import java.io.IOException;
 import java.io.Reader;
 
@@ -38,36 +37,34 @@ import bak.pcj.set.CharOpenHashSet;
  *
  * @author jeichar
  */
-class CharToSpaceReader extends FilterReader {
+class CharToSpaceReader extends Reader {
 
-    char space = ' ';
-    private CharOpenHashSet set;
+    private Reader _reader;
+    private CharOpenHashSet _set;
 
     /**
      * @param charsToSetAsSpaces characters to convert to spaces
      */
     public CharToSpaceReader(Reader reader, char[] charsToSetAsSpaces) {
-        super(reader);
-        this.set = new CharOpenHashSet(charsToSetAsSpaces);
+        super();
+        _reader = reader;
+        _set = new CharOpenHashSet(charsToSetAsSpaces);
     }
 
     @Override
-    public int read(char[] in, int start, int end) throws IOException {
-        int read = super.read(in, start, end);
-        for (int i = start; i < read; i++) {
-            if (set.contains(in[i])) {
-                in[i] = space;
-            }
+    public int read(char[] cbuf, int off, int len) throws IOException {
+        int r = _reader.read(cbuf, off, len);
+        if (r < 0) return r;
+        int end = off + r;
+        for (int i = off; i < end; ++i) {
+            if (_set.contains(cbuf[i])) cbuf[i] = ' ';
         }
-        return read;
+        return r;
     }
 
     @Override
-    public int read() throws IOException {
-        int read = super.read();
-        char asChar = (char) read;
-        if (asChar == read && set.contains(asChar)) return ' ';
-        else return read;
+    public void close() throws IOException {
+        _reader.close();
     }
 
 }
diff --git a/core/src/main/java/org/fao/geonet/kernel/search/GeoNetworkAnalyzer.java b/core/src/main/java/org/fao/geonet/kernel/search/GeoNetworkAnalyzer.java
index 600450ed96..de66c6923b 100644
--- a/core/src/main/java/org/fao/geonet/kernel/search/GeoNetworkAnalyzer.java
+++ b/core/src/main/java/org/fao/geonet/kernel/search/GeoNetworkAnalyzer.java
@@ -23,6 +23,7 @@
 package org.fao.geonet.kernel.search;
 
 import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.TokenStream;
 import org.apache.lucene.analysis.Tokenizer;
 import org.apache.lucene.analysis.core.LowerCaseFilter;
 import org.apache.lucene.analysis.core.StopFilter;
@@ -35,6 +36,7 @@ import org.fao.geonet.utils.Log;
 
 import java.io.IOException;
 import java.io.Reader;
+import java.util.Arrays;
 import java.util.Set;
 
 /**
@@ -49,8 +51,8 @@ import java.util.Set;
  */
 public final class GeoNetworkAnalyzer extends Analyzer {
 
-    private final char[] charsToIgnore;
-    private final CharArraySet stopwords;
+    private final char[] _charsToIgnore;
+    private final CharArraySet _stopwords;
 
     /**
      * Creates this analyzer using no stopwords.
@@ -63,20 +65,15 @@ public final class GeoNetworkAnalyzer extends Analyzer {
      *
      */
     public GeoNetworkAnalyzer(final Set<String> stopwords, char[] charsToIgnore) {
-        if (stopwords == null || stopwords.isEmpty()) {
-            this.stopwords = CharArraySet.EMPTY_SET;
-        } else {
-            this.stopwords = new CharArraySet(Geonet.LUCENE_VERSION, stopwords, true);
-        }
-
-        if (charsToIgnore == null) {
-            this.charsToIgnore = new char[0];
-        } else {
-            this.charsToIgnore = new char[charsToIgnore.length];
-            System.arraycopy(charsToIgnore, 0, this.charsToIgnore, 0, charsToIgnore.length);
-            for (char s : charsToIgnore) {
-                Log.debug(getClass().getName(), "character to ignore: " + s);
-            }
+        super();
+        _stopwords = stopwords == null || stopwords.isEmpty() ?
+            CharArraySet.EMPTY_SET :
+            new CharArraySet(Geonet.LUCENE_VERSION, stopwords, true);
+        _charsToIgnore = charsToIgnore == null ?
+            new char[0] :
+            Arrays.copyOf(charsToIgnore, charsToIgnore.length);
+        for (char s : _charsToIgnore) {
+            Log.debug(getClass().getName(), "character to ignore: '" + s + "'");
         }
     }
 
@@ -89,33 +86,25 @@ public final class GeoNetworkAnalyzer extends Analyzer {
      * @return the {@link TokenStreamComponents} for this analyzer.
      */
     @Override
-    protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) {
+    protected Analyzer.TokenStreamComponents createComponents(final String fieldName, Reader reader) {
+        reader = wrapReader(reader);
         final Tokenizer source = new StandardTokenizer(Geonet.LUCENE_VERSION, reader);
-        ASCIIFoldingFilter asciiFoldingFilter = new ASCIIFoldingFilter(new LowerCaseFilter(Geonet.LUCENE_VERSION,
-            new StandardFilter(Geonet.LUCENE_VERSION, source)));
-
-        if (this.stopwords != null && !this.stopwords.isEmpty()) {
-            return new TokenStreamComponents(source, new StopFilter(Geonet.LUCENE_VERSION, asciiFoldingFilter, this.stopwords)) {
-                @Override
-                protected void setReader(final Reader reader) throws IOException {
-                    super.setReader(wrapReader(reader));
-                }
-            };
-        } else {
-            return new TokenStreamComponents(source, asciiFoldingFilter) {
-                @Override
-                protected void setReader(final Reader reader) throws IOException {
-                    super.setReader(wrapReader(reader));
-                }
-            };
-        }
+        TokenStream sink = source;
+        sink = new StandardFilter(Geonet.LUCENE_VERSION, sink);
+        sink = new LowerCaseFilter(Geonet.LUCENE_VERSION, sink);
+        sink = new ASCIIFoldingFilter(sink);
+        if (!_stopwords.isEmpty()) sink = new StopFilter(Geonet.LUCENE_VERSION, sink, _stopwords);
+        return new Analyzer.TokenStreamComponents(source, sink) {
+            @Override
+            protected void setReader(Reader reader2) throws IOException {
+                super.setReader(wrapReader(reader2));
+            }
+        };
     }
 
     private Reader wrapReader(final Reader reader) {
-        if (charsToIgnore != null && charsToIgnore.length > 0) {
-            return new CharToSpaceReader(reader, charsToIgnore);
-        } else {
-            return reader;
-        }
+        return _charsToIgnore.length == 0 ?
+            reader :
+            new CharToSpaceReader(reader, _charsToIgnore);
     }
 }

Furthermore, there is this bug in 3.12 that is unrelated to Lucene and hence continues its existence in 4 (I think).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions