Commit e6c5be3a authored by Nicolas Fella's avatar Nicolas Fella
Browse files

Use QStringTokenizer for string splitting

QStringTokenizer allows for allocation-free string splitting which is
very benefitial for performance.

The official class is only in Qt6 but KDToolBox (https://github.com/KDAB/KDToolBox/) offers a drop-in backport.
parent 8e434b41
......@@ -25,6 +25,7 @@ set(kerfuffle_SRCS
pluginsettingspage.cpp
archiveentry.cpp
options.cpp
qstringtokenizer.cpp
)
kconfig_add_kcfg_files(kerfuffle_SRCS settings.kcfgc GENERATE_MOC)
......
/****************************************************************************
** MIT License
**
** Copyright (C) 2020-2021 Klarälvdalens Datakonsult AB, a KDAB Group company, info@kdab.com, author Marc Mutz <marc.mutz@kdab.com>
**
** This file is part of KDToolBox (https://github.com/KDAB/KDToolBox).
**
** Permission is hereby granted, free of charge, to any person obtaining a copy
** of this software and associated documentation files (the "Software"), to deal
** in the Software without restriction, including without limitation the rights
** to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
** copies of the Software, ** and to permit persons to whom the Software is
** furnished to do so, subject to the following conditions:
**
** The above copyright notice and this permission notice (including the next paragraph)
** shall be included in all copies or substantial portions of the Software.
**
** THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
** IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
** FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
** AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
** LIABILITY, WHETHER IN AN ACTION OF ** CONTRACT, TORT OR OTHERWISE,
** ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
** DEALINGS IN THE SOFTWARE.
****************************************************************************/
#include "qstringtokenizer.h"
#include "qstringalgorithms.h"
/*!
\class QStringTokenizer
\brief The QStringTokenizer class splits strings into tokens along given separators
\reentrant
Splits a string into substrings wherever a given separator occurs,
and returns a (lazy) list of those strings. If the separator does
not match anywhere in the string, produces a single-element
containing this string. If the separator is empty,
QStringTokenizer produces an empty string, followed by each of the
string's characters, followed by another empty string. The two
enumerations Qt::SplitBehavior and Qt::CaseSensitivity further
control the output.
QStringTokenizer drives QStringView::tokenize(), but, at least with a
recent compiler, you can use it directly, too:
\code
for (auto it : QStringTokenizer{string, separator})
use(*it);
\endcode
\note You should never, ever, name the template arguments of a
QStringTokenizer explicitly. If you can use C++17 Class Template
Argument Deduction (CTAD), you may write
\c{QStringTokenizer{string, separator}} (without template
arguments). If you can't use C++17 CTAD, you must use the
QStringView::split() or QLatin1String::split() member functions
and store the return value only in \c{auto} variables:
\code
auto result = string.split(sep);
\endcode
This is because the template arguments of QStringTokenizer have a
very subtle dependency on the specific string and separator types
from with which they are constructed, and they don't usually
correspond to the actual types passed.
\section Lazy Sequences
QStringTokenizer acts as a so-called lazy sequence, that is, each
next element is only computed once you ask for it. Lazy sequences
have the advantage that they only require O(1) memory. They have
the disadvantage that, at least for QStringTokenizer, they only
allow forward, not random-access, iteration.
The intended use-case is that you just plug it into a ranged for loop:
\code
for (auto it : QStringTokenizer{string, separator})
use(*it);
\endcode
or a C++20 ranged algorithm:
\code
std::ranges::for_each(QStringTokenizer{string, separator},
[] (auto token) { use(token); });
\endcode
\section End Sentinel
The QStringTokenizer iterators cannot be used with classical STL
algorithms, because those require iterator/iterator pairs, while
QStringTokenizer uses sentinels, that is, it uses a different
type, QStringTokenizer::sentinel, to mark the end of the
range. This improves performance, because the sentinel is an empty
type. Sentinels are supported from C++17 (for ranged for)
and C++20 (for algorithms using the new ranges library).
QStringTokenizer falls back to a non-sentinel end iterator
implementation if the compiler doesn't support separate types for
begin and end iterators in ranged for loops
(\link{https://wg21.link/P0184}{P1084}), in which case traditional
STL algorthms will \em appear to be supported, but as you migrate
to a compiler that supports P0184, such code will break. We
recommend to use only the C++20 \c{std::ranges} algorithms, or, if
you're stuck on C++14/17 for the time being,
\link{https://github.com/ericniebler/range-v3}{Eric Niebler's
Ranges v3 library}, which has the same semantics as the C++20
\c{std::ranges} library.
\section Temporaries
QStringTokenizer is very carefully designed to avoid dangling
references. If you construct a tokenizer from a temporary string
(an rvalue), that argument is stored internally, so the referenced
data isn't deleted before it is tokenized:
\code
auto tok = QStringTokenizer{widget.text(), u','};
// return value of `widget.text()` is destroyed, but content was moved into `tok`
for (auto e : tok)
use(e);
\endcode
If you pass named objects (lvalues), then QStringTokenizer does
not store a copy. You are reponsible to keep the named object's
data around for longer than the tokenizer operates on it:
\code
auto text = widget.text();
auto tok = QStringTokenizer{text, u','};
text.clear(); // destroy content of `text`
for (auto e : tok) // ERROR: `tok` references deleted data!
use(e);
\endcode
\sa QStringView::split(), QLatin1Sting::split(), Qt::SplitBehavior, Qt::CaseSensitivity
*/
/*!
\typedef QStringTokenizer::value_type
Alias for \c{const QStringView} or \c{const QLatin1String},
depending on the tokenizer's \c Haystack template argument.
*/
/*!
\typedef QStringTokenizer::difference_type
Alias for qsizetype.
*/
/*!
\typedef QStringTokenizer::size_type
Alias for qsizetype.
*/
/*!
\typedef QStringTokenizer::reference
Alias for \c{value_type &}.
QStringTokenizer does not support mutable references, so this is
the same as const_reference.
*/
/*!
\typedef QStringTokenizer::const_reference
Alias for \c{value_type &}.
*/
/*!
\typedef QStringTokenizer::pointer
Alias for \c{value_type *}.
QStringTokenizer does not support mutable iterators, so this is
the same as const_pointer.
*/
/*!
\typedef QStringTokenizer::const_pointer
Alias for \c{value_type *}.
*/
/*!
\typedef QStringTokenizer::iterator
This typedef provides an STL-style const iterator for
QStringTokenizer.
QStringTokenizer does not support mutable iterators, so this is
the same as const_iterator.
\sa const_iterator
*/
/*!
\typedef QStringTokenizer::const_iterator
This typedef provides an STL-style const iterator for
QStringTokenizer.
\sa iterator
*/
/*!
\typedef QStringTokenizer::sentinel
This typedef provides an STL-style sentinel for
QStringTokenizer::iterator and QStringTokenizer::const_iterator.
\sa const_iterator
*/
/*!
\fn QStringTokenizer(Haystack haystack, String needle, Qt::CaseSensitivity cs, Qt::SplitBehavior sb)
\fn QStringTokenizer(Haystack haystack, String needle, Qt::SplitBehavior sb, Qt::CaseSensitivity cs)
Constructs a string tokenizer that splits the string \a haystack
into substrings wherever \a needle occurs, and allows iteration
over those strings as they are found. If \a needle does not match
anywhere in \a haystack, a single element containing \a haystack
is produced.
\a cs specifies whether \a needle should be matched case
sensitively or case insensitively.
If \a sb is QString::SkipEmptyParts, empty entries don't
appear in the result. By default, empty entries are included.
\sa QStringView::split(), QLatin1String::split(), Qt::CaseSensitivity, Qt::SplitBehavior
*/
/*!
\fn QStringTokenizer::const_iterator QStringTokenizer::begin() const
Returns a const \l{STL-style iterators}{STL-style iterator}
pointing to the first token in the list.
\sa end(), cbegin()
*/
/*!
\fn QStringTokenizer::const_iterator QStringTokenizer::cbegin() const
Same as begin().
\sa cend(), begin()
*/
/*!
\fn QStringTokenizer::sentinel QStringTokenizer::end() const
Returns a const \l{STL-style iterators}{STL-style sentinel}
pointing to the imaginary token after the last token in the list.
\sa begin(), cend()
*/
/*!
\fn QStringTokenizer::sentinel QStringTokenizer::cend() const
Same as end().
\sa cbegin(), end()
*/
/*!
\fn QStringTokenizer::toContainer(Container &&c) const &
Convenience method to convert the lazy sequence into a
(typically) random-access container.
This function is only available if \c Container has a \c value_type
matching this tokenizer's value_type.
If you pass in a named container (an lvalue), then that container
is filled, and a reference to it is returned.
If you pass in a temporary container (an rvalue, incl. the default
argument), then that container is filled, and returned by value.
\code
// assuming tok's value_type is QStringView, then...
auto tok = QStringTokenizer{~~~};
// ... rac1 is a QVector:
auto rac1 = tok.toContainer();
// ... rac2 is std::pmr::vector<QStringView>:
auto rac2 = tok.toContainer<std::pmr::vector<QStringView>>();
auto rac3 = QVarLengthArray<QStringView, 12>{};
// appends the token sequence produced by tok to rac3
// and returns a reference to rac3 (which we ignore here):
tok.toContainer(rac3);
\endcode
This gives you maximum flexibility in how you want the sequence to
be stored.
*/
/*!
\fn QStringTokenizer::toContainer(Container &&c) const &&
\overload
In addition to the constraints on the lvalue-this overload, this
rvalue-this overload is only available when this QStringTokenizer
does not store the haystack internally, as this could create a
container full of dangling references:
\code
auto tokens = QStringTokenizer{widget.text(), u','}.toContainer();
// ERROR: cannot call toContainer() on rvalue
// 'tokens' references the data of the copy of widget.text()
// stored inside the QStringTokenizer, which has since been deleted
\endcode
To fix, store the QStringTokenizer in a temporary:
\code
auto tokenizer = QStringTokenizer{widget.text90, u','};
auto tokens = tokenizer.toContainer();
// OK: the copy of widget.text() stored in 'tokenizer' keeps the data
// referenced by 'tokens' alive.
\endcode
You can force this function into existence by passing a view instead:
\code
func(QStringTokenizer{QStringView{widget.text()}, u','}.toContainer());
// OK: compiler keeps widget.text() around until after func() has executed
\endcode
*/
/*!
\fn qTokenize(Haystack &&haystack, Needle &&needle, Flags...flags)
\relates QStringTokenizer
Factory function for QStringTokenizer. You can use this function
if your compiler doesn't, yet, support C++17 Class Template
Argument Deduction (CTAD), but we recommend direct use of
QStringTokenizer with CTAD instead.
*/
/****************************************************************************
** MIT License
**
** Copyright (C) 2020-2021 Klarälvdalens Datakonsult AB, a KDAB Group company, info@kdab.com, author Marc Mutz <marc.mutz@kdab.com>
**
** This file is part of KDToolBox (https://github.com/KDAB/KDToolBox).
**
** Permission is hereby granted, free of charge, to any person obtaining a copy
** of this software and associated documentation files (the "Software"), to deal
** in the Software without restriction, including without limitation the rights
** to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
** copies of the Software, ** and to permit persons to whom the Software is
** furnished to do so, subject to the following conditions:
**
** The above copyright notice and this permission notice (including the next paragraph)
** shall be included in all copies or substantial portions of the Software.
**
** THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
** IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
** FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
** AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
** LIABILITY, WHETHER IN AN ACTION OF ** CONTRACT, TORT OR OTHERWISE,
** ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
** DEALINGS IN THE SOFTWARE.
****************************************************************************/
#ifndef QSTRINGTOKENIZER_H
#define QSTRINGTOKENIZER_H
#include <QtCore/qnamespace.h>
QT_BEGIN_NAMESPACE
template <typename, typename> class QStringBuilder;
template <typename> class QVector;
QT_END_NAMESPACE
#if defined(Q_QDOC) || (defined(__cpp_range_based_for) && __cpp_range_based_for >= 201603)
# define Q_STRINGTOKENIZER_USE_SENTINEL
#endif
class QStringTokenizerBaseBase
{
protected:
~QStringTokenizerBaseBase() = default;
Q_DECL_CONSTEXPR QStringTokenizerBaseBase(Qt::SplitBehavior sb, Qt::CaseSensitivity cs) noexcept
: m_sb{sb}, m_cs{cs} {}
struct tokenizer_state {
qsizetype start, end, extra;
friend constexpr bool operator==(tokenizer_state lhs, tokenizer_state rhs) noexcept
{ return lhs.start == rhs.start && lhs.end == rhs.end && lhs.extra == rhs.extra; }
friend constexpr bool operator!=(tokenizer_state lhs, tokenizer_state rhs) noexcept
{ return !operator==(lhs, rhs); }
};
Qt::SplitBehavior m_sb;
Qt::CaseSensitivity m_cs;
};
template <typename Haystack, typename Needle>
class QStringTokenizerBase : protected QStringTokenizerBaseBase
{
struct next_result {
Haystack value;
bool ok;
tokenizer_state state;
};
inline next_result next(tokenizer_state state) const noexcept;
inline next_result toFront() const noexcept { return next({}); }
public:
constexpr explicit QStringTokenizerBase(Haystack haystack, Needle needle, Qt::SplitBehavior sb, Qt::CaseSensitivity cs) noexcept
: QStringTokenizerBaseBase{sb, cs}, m_haystack{haystack}, m_needle{needle} {}
class iterator;
friend class iterator;
#ifdef Q_STRINGTOKENIZER_USE_SENTINEL
class sentinel {
friend constexpr bool operator==(sentinel, sentinel) noexcept { return true; }
friend constexpr bool operator!=(sentinel, sentinel) noexcept { return false; }
};
#else
using sentinel = iterator;
#endif
class iterator {
const QStringTokenizerBase *tokenizer;
next_result current;
friend class QStringTokenizerBase;
explicit iterator(const QStringTokenizerBase &t) noexcept
: tokenizer{&t}, current{t.toFront()} {}
public:
using difference_type = qsizetype;
using value_type = Haystack;
using pointer = const value_type*;
using reference = const value_type&;
using iterator_category = std::forward_iterator_tag;
iterator() noexcept = default;
// violates std::forward_iterator (returns a reference into the iterator)
Q_REQUIRED_RESULT constexpr const Haystack* operator->() const { return Q_ASSERT(current.ok), &current.value; }
Q_REQUIRED_RESULT constexpr const Haystack& operator*() const { return *operator->(); }
iterator& operator++() { advance(); return *this; }
iterator operator++(int) { auto tmp = *this; advance(); return tmp; }
friend constexpr bool operator==(const iterator &lhs, const iterator &rhs) noexcept
{ return lhs.current.ok == rhs.current.ok && (!lhs.current.ok || (Q_ASSERT(lhs.tokenizer == rhs.tokenizer), lhs.current.state == rhs.current.state)); }
friend constexpr bool operator!=(const iterator &lhs, const iterator &rhs) noexcept
{ return !operator==(lhs, rhs); }
#ifdef Q_STRINGTOKENIZER_USE_SENTINEL
friend constexpr bool operator==(const iterator &lhs, sentinel) noexcept
{ return !lhs.current.ok; }
friend constexpr bool operator!=(const iterator &lhs, sentinel) noexcept
{ return !operator==(lhs, sentinel{}); }
friend constexpr bool operator==(sentinel, const iterator &rhs) noexcept
{ return !rhs.current.ok; }
friend constexpr bool operator!=(sentinel, const iterator &rhs) noexcept
{ return !operator==(sentinel{}, rhs); }
#endif
private:
void advance() {
Q_ASSERT(current.ok);
current = tokenizer->next(current.state);
}
};
using const_iterator = iterator;
using size_type = std::size_t;
using difference_type = typename iterator::difference_type;
using value_type = typename iterator::value_type;
using pointer = typename iterator::pointer;
using const_pointer = pointer;
using reference = typename iterator::reference;
using const_reference = reference;
Q_REQUIRED_RESULT iterator begin() const noexcept { return iterator{*this}; }
Q_REQUIRED_RESULT iterator cbegin() const noexcept { return begin(); }
template <bool = std::is_same<iterator, sentinel>::value> // ODR protection
Q_REQUIRED_RESULT constexpr sentinel end() const noexcept { return {}; }
template <bool = std::is_same<iterator, sentinel>::value> // ODR protection
Q_REQUIRED_RESULT constexpr sentinel cend() const noexcept { return {}; }
private:
Haystack m_haystack;
Needle m_needle;
};
#include <QtCore/qstringview.h>
namespace QtPrivate {
namespace Tok {
Q_DECL_CONSTEXPR qsizetype size(QChar) noexcept { return 1; }
template <typename String>
constexpr qsizetype size(const String &s) noexcept { return static_cast<qsizetype>(s.size()); }
template <typename String> struct ViewForImpl {};
template <> struct ViewForImpl<QStringView> { using type = QStringView; };
template <> struct ViewForImpl<QLatin1String> { using type = QLatin1String; };
template <> struct ViewForImpl<QChar> { using type = QChar; };
template <> struct ViewForImpl<QString> : ViewForImpl<QStringView> {};
template <> struct ViewForImpl<QStringRef> : ViewForImpl<QStringView> {};
template <> struct ViewForImpl<QLatin1Char> : ViewForImpl<QChar> {};
template <> struct ViewForImpl<char16_t> : ViewForImpl<QChar> {};
template <> struct ViewForImpl<char16_t*> : ViewForImpl<QStringView> {};
template <> struct ViewForImpl<const char16_t*> : ViewForImpl<QStringView> {};
#if QT_VERSION >= QT_VERSION_CHECK(5, 15, 0)
template <typename LHS, typename RHS>
struct ViewForImpl<QStringBuilder<LHS, RHS>> : ViewForImpl<typename QStringBuilder<LHS,RHS>::ConvertTo> {};
#endif
template <typename Char, typename...Args>
struct ViewForImpl<std::basic_string<Char, Args...>> : ViewForImpl<Char*> {};
#ifdef __cpp_lib_string_view
template <typename Char, typename...Args>
struct ViewForImpl<std::basic_string_view<Char, Args...>> : ViewForImpl<Char*> {};
#endif
// This metafunction maps a StringLike to a View (currently, QChar,
// QStringView, QLatin1String). This is what QStringTokenizerBase
// operates on. QStringTokenizer adds pinning to keep rvalues alive
// for the duration of the algorithm.
template <typename String>
using ViewFor = typename ViewForImpl<typename std::decay<String>::type>::type;
// Pinning:
// rvalues of owning string types need to be moved into QStringTokenizer
// to keep them alive for the lifetime of the tokenizer. For lvalues, we
// assume the user takes care of that.
// default: don't pin anything (characters are pinned implicitly)
template <typename String>
struct PinForImpl { using type = ViewFor<String>; };
// rvalue QString -> QString
template <>
struct PinForImpl<QString> { using type = QString; };
// rvalue std::basic_string -> basic_string
template <typename Char, typename...Args>
struct PinForImpl<std::basic_string<Char, Args...>>
{ using type = std::basic_string<Char, Args...>; };
// rvalue QStringBuilder -> pin as the nested ConvertTo type
template <typename LHS, typename RHS>
struct PinForImpl<QStringBuilder<LHS, RHS>>
: PinForImpl<typename QStringBuilder<LHS, RHS>::ConvertTo> {};
template <typename StringLike>
using PinFor = typename PinForImpl<typename std::remove_cv<StringLike>::type>::type;
template <typename T> struct is_owning_string_type : std::false_type {};
template <> struct is_owning_string_type<QString> : std::true_type {};
template <typename...Args> struct is_owning_string_type<std::basic_string<Args...>> : std::true_type {};
// unpinned
template <typename T, bool pinned = is_owning_string_type<T>::value>
struct Pinning
{
// this is the storage for non-pinned types - no storage
constexpr Pinning(const T&) noexcept {}
// Since we don't store something, the view() method needs to be
// given something it can return.
constexpr T view(T t) const noexcept { return t; }
};
// pinned
template <typename T>
struct Pinning<T, true>
{
T m_string;
// specialisation for owning string types (QString, std::u16string):
// stores the string:
constexpr Pinning(T &&s) noexcept : m_string{std::move(s)} {}
// ... and thus view() uses that instead of the argument passed in:
constexpr QStringView view(const T&) const noexcept { return m_string; }
};
// NeedlePinning and HaystackPinning are there to distinguish them as
// base classes of QStringTokenizer. We use inheritance to reap the
// empty base class optimization.
template <typename T>
struct NeedlePinning : Pinning<T>
{
using Pinning<T>::Pinning;
template <typename Arg>
constexpr auto needleView(Arg &&a) const noexcept
-> decltype(this->view(std::forward<Arg>(a)))
{ return this->view(std::forward<Arg>(a)); }
};
template <typename T>
struct HaystackPinning : Pinning<T>
{
using Pinning<T>::Pinning;
template <typename Arg>
constexpr auto haystackView(Arg &&a) const noexcept
-> decltype(this->view(std::forward<Arg>(a)))
{ return this->view(std::forward<Arg>(a)); }
};
// The Base of a QStringTokenizer is QStringTokenizerBase for the views
// corresponding to the Haystack and Needle template arguments
//