Commit 025e4fa9 authored by Igor Kushnir's avatar Igor Kushnir
Browse files

Fix and improve LanguageController's MimeTypeCache

Bugs in the old implementation that are fixed in this commit:
  * The regular expression matching code was incorrect and so it never
matched the only pattern it handled: "CMakeLists.txt". As a result,
LanguageController::languagesForUrl(".../CMakeLists.txt") returned an
empty language list in background threads; resorted to extra work
culminating in a call to LanguageController::languagesForMimetype() in
the main thread.
  * The suffix matching optimization matched patterns case-sensitively
and so didn't match names like "X.CPP". As with the regular expression
matching bug, this resulted in a wrong return value in background
threads and extra work in the main thread.
  * The suffix matching optimization assumed that '*' is the only
wildcard character. While this is actually the case for glob patterns of
the mime types that can currently end up in mimeTypeCache, it might have
led to a surprising bug if more complex glob patterns became supported
in the future.

TestLanguageController::testLanguagesForUrlWithCache() fails on the
following data rows without this commit because of these bugs:
  - CMakeLists
  - cmakelists wrong case
  - upper-case
  - mixed-case

Improvements of the MimeTypeCache reimplementation in this commit:
  * Literal pattern optimization: "CMakeLists.txt" was the only pattern
that required regular expression matching in the old implementation. It
could not be handled by the suffix matching optimization. Now the new
separate pattern category m_literalPatterns handles this case so that
the slower regular expression matching never happens in practice.
  * The suffixes, literal patterns and regular expressions are now
created once and cached rather than constructed in each
languagesForUrl() call.
  * QRegularExpression is now used instead of the deprecated QRegExp.

This is the list of wildcard patterns supported by maintained KDevelop
plugins (collated from X-KDevelop-SupportedMimeTypes plugin entries and
/usr/share/mime/globs2):
kdevclangsupport
    text/x-chdr
        *.h
    text/x-c++hdr
        *.hh
        *.hpp
        *.hp
        *.h++
        *.hxx
    text/x-csrc
        *.c:cs
    text/x-c++src
        *.c++
        *.cc
        *.cxx
        *.C:cs
        *.cpp
    text/x-opencl-src
        *.cl
    text/vnd.nvidia.cuda.csrc
        *.cu
    text/vnd.nvidia.cuda.chdr
        *.cuh
    text/x-objcsrc
        *.m
kdevpatchreview
    text/x-patch
        *.patch
        *.diff
kdevqmljs
    text/x-qml
        *.qml
        *.qmlproject
        *.qmltypes
    application/javascript
        *.jsm
        *.mjs
        *.js
KDevCMakeManager
    text/x-cmake
        cmakelists.txt
        *.cmake
KDevCssSupport
    text/css
        *.css
    text/html
        *.html
        *.htm
KDevPhpSupport
    application/x-php
        *.phps
        *.php
        *.php3
        *.php4
        *.php5
kdevpythonsupport
    text/x-python
        *.wsgi
        *.py
        *.pyx
    text/x-python3
        *.py3x
        *.py3
        *.py
KDevRubySupport
    application/x-ruby
        *.rb

Only *.c and *.C out of all supported patterns should be matched
case-sensitively. But both of these patterns belong to the same plugin -
kdevclangsupport. So LanguageController can safely match all patterns
case-insensitively. See also
https://specifications.freedesktop.org/shared-mime-info-spec/shared-mime-info-spec-latest.html

Average BenchLanguageController results before and at this commit in
milliseconds per iteration:
        Data row                    Before      At
1. benchLanguagesForUrlNoCache()
    CMakeLists                      0.029       0.00046
    cmakelists wrong case           0.029       0.00046
    lower-case                      0.0023      0.00058
    upper-case                      0.029       0.00058
    mixed-case                      0.029       0.00058
    .C                              0.0023      0.00050
    .cl                             0.0023      0.00070
    existent C with extension       0.0022      0.00053
    .cc                             0.0023      0.00058
    .cmake                          0.0016      0.00039
    .diff                           0.00094     0.00037
    .qml                            0.0012      0.00036
    existent C w/o extension        0.16        0.16
    existent patch w/o extension    0.20        0.20
2. benchLanguagesForUrlFilledCache()
    CMakeLists                      0.032       0.0011
    cmakelists wrong case           0.031       0.0011
    lower-case                      0.0039      0.00083
    upper-case                      0.030       0.00083
    mixed-case                      0.030       0.00085
    .C                              0.0039      0.00072
    .cl                             0.0039      0.00091
    existent C with extension       0.0038      0.00064
    .cc                             0.0039      0.00080
    .cmake                          0.0039      0.00090
    .diff                           0.0039      0.00083
    .qml                            0.0039      0.00093
    existent C w/o extension        0.16        0.16
    existent patch w/o extension    0.20        0.20
3. benchLanguagesForUrlNoMatchNoCache()
    empty                           0.0021      0.0016
    archive                         0.024       0.023
    OpenDocument Text               0.024       0.023
    existent archive with extension 0.030       0.029
    existent archive w/o extension  0.15        0.15
4. benchLanguagesForUrlNoMatchFilledCache()
    empty                           0.0054      0.0018
    archive                         0.029       0.024
    OpenDocument Text               0.029       0.024
    existent archive with extension 0.035       0.030
    existent archive w/o extension  0.16        0.15

Almost every benchmark runs faster now. Many run more than ten times
faster thanks to this commit.
parent 1fedf941
......@@ -7,9 +7,14 @@
#include "languagecontroller.h"
#include <algorithm>
#include <utility>
#include <vector>
#include <QHash>
#include <QMimeDatabase>
#include <QMutexLocker>
#include <QRegularExpression>
#include <QThread>
#include <interfaces/idocument.h>
......@@ -31,8 +36,96 @@
namespace {
QString KEY_SupportedMimeTypes() { return QStringLiteral("X-KDevelop-SupportedMimeTypes"); }
QString KEY_ILanguageSupport() { return QStringLiteral("ILanguageSupport"); }
bool containsWildcardCharacter(QString::const_iterator first, QString::const_iterator last)
{
const auto isWildcardCharacter = [](QChar character) {
const auto u = character.unicode();
return u == '*' || u == '?' || u == '[';
};
return std::any_of(first, last, isWildcardCharacter);
}
using KDevelop::ILanguageSupport;
class MimeTypeCache
{
public:
void addMimeType(const QString& mimeTypeName, ILanguageSupport* language);
QList<ILanguageSupport*> languagesForFileName(const QString& fileName) const;
private:
void addGlobPattern(const QString& pattern, ILanguageSupport* language);
std::vector<std::pair<QString, ILanguageSupport*>> m_suffixes; ///< contains e.g. ".cpp"
std::vector<std::pair<QString, ILanguageSupport*>> m_literalPatterns; ///< contains e.g. "CMakeLists.txt"
/// fallback, hopefully empty in practice
std::vector<std::pair<QRegularExpression, ILanguageSupport*>> m_regularExpressions;
};
void MimeTypeCache::addMimeType(const QString& mimeTypeName, ILanguageSupport* language)
{
const QMimeType mime = QMimeDatabase().mimeTypeForName(mimeTypeName);
if (mime.isValid()) {
const auto globPatterns = mime.globPatterns();
for (const QString& pattern : globPatterns) {
addGlobPattern(pattern, language);
}
} else {
qCWarning(SHELL) << "could not create mime-type" << mimeTypeName;
}
}
QList<ILanguageSupport*> MimeTypeCache::languagesForFileName(const QString& fileName) const
{
QList<ILanguageSupport*> languages;
// lastLanguageEquals() helps to improve performance by skipping checks for an already
// added language. It also prevents duplicate elements of languages when there are
// multiple equivalent glob patterns for a given language (for example, clangsupport
// plugin supports *.c from text/x-csrc and *.C from text/x-c++src).
const auto lastLanguageEquals = [&languages](const ILanguageSupport* lang) {
return !languages.empty() && languages.constLast() == lang;
};
for (const auto& p : m_suffixes) {
if (!lastLanguageEquals(p.second) && fileName.endsWith(p.first, Qt::CaseInsensitive)) {
languages.push_back(p.second);
}
}
for (const auto& p : m_literalPatterns) {
if (fileName.compare(p.first, Qt::CaseInsensitive) == 0 && !lastLanguageEquals(p.second)) {
languages.push_back(p.second);
}
}
for (const auto& p : m_regularExpressions) {
if (!lastLanguageEquals(p.second) && p.first.match(fileName).hasMatch()) {
languages.push_back(p.second);
}
}
return languages;
}
void MimeTypeCache::addGlobPattern(const QString& pattern, ILanguageSupport* language)
{
if (pattern.startsWith(QLatin1Char{'*'})) {
if (!containsWildcardCharacter(pattern.cbegin() + 1, pattern.cend())) {
m_suffixes.emplace_back(pattern.mid(1), language);
return;
}
} else if (!containsWildcardCharacter(pattern.cbegin(), pattern.cend())) {
m_literalPatterns.emplace_back(pattern, language);
return;
}
QRegularExpression regularExpression(QRegularExpression::wildcardToRegularExpression(pattern),
QRegularExpression::CaseInsensitiveOption);
m_regularExpressions.emplace_back(std::move(regularExpression), language);
}
} // namespace
namespace KDevelop {
......@@ -66,8 +159,7 @@ public:
LanguageHash languages; //Maps language-names to languages
LanguageCache languageCache; //Maps mimetype-names to languages
using MimeTypeCache = QMultiHash<QMimeType, ILanguageSupport*>;
MimeTypeCache mimeTypeCache; //Maps mimetypes to languages
MimeTypeCache mimeTypeCache; //Maps mimetype-glob-patterns to languages
BackgroundParser* const backgroundParser;
StaticAssistantsManager* staticAssistantsManager;
......@@ -91,12 +183,7 @@ void LanguageControllerPrivate::addLanguageSupport(ILanguageSupport* languageSup
for (const QString& mimeTypeName : mimetypes) {
qCDebug(SHELL) << "adding supported mimetype:" << mimeTypeName << "language:" << languageSupport->name();
languageCache[mimeTypeName] << languageSupport;
QMimeType mime = QMimeDatabase().mimeTypeForName(mimeTypeName);
if (mime.isValid()) {
mimeTypeCache.insert(mime, languageSupport);
} else {
qCWarning(SHELL) << "could not create mime-type" << mimeTypeName;
}
mimeTypeCache.addMimeType(mimeTypeName, languageSupport);
}
}
......@@ -248,40 +335,11 @@ QList<ILanguageSupport*> LanguageController::languagesForUrl(const QUrl &url)
QMutexLocker lock(&d->dataMutex);
QList<ILanguageSupport*> languages;
if(d->m_cleanedUp)
return languages;
const QString fileName = url.fileName();
return {};
///TODO: cache regexp or simple string pattern for endsWith matching
QRegExp exp(QString(), Qt::CaseInsensitive, QRegExp::Wildcard);
///non-crashy part: Use the mime-types of known languages
for(LanguageControllerPrivate::MimeTypeCache::const_iterator it = d->mimeTypeCache.constBegin();
it != d->mimeTypeCache.constEnd(); ++it)
{
const auto globPatterns = it.key().globPatterns();
for (const QString& pattern : globPatterns) {
if(pattern.startsWith(QLatin1Char('*'))) {
const QStringRef subPattern = pattern.midRef(1);
if (!subPattern.contains(QLatin1Char('*'))) {
//optimize: we can skip the expensive QRegExp in this case
//and do a simple string compare (much faster)
if (fileName.endsWith(subPattern)) {
languages << *it;
}
continue;
}
}
exp.setPattern(pattern);
if(int position = exp.indexIn(fileName)) {
if(position != -1 && exp.matchedLength() + position == fileName.length())
languages << *it;
}
}
}
auto languages = d->mimeTypeCache.languagesForFileName(url.fileName());
//Never use findByUrl from within a background thread, and never load a language support
//from within the backgruond thread. Both is unsafe, and can lead to crashes
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment