Remove excessive whitespace from string [on hold]Remove useless whitespace from styled stringRemove last word...

Are there any modern advantages of a fire piston?

Why are the books in the Game of Thrones citadel library shelved spine inwards?

Can an insurance company drop you after receiving a bill and refusing to pay?

Why zero tolerance on nudity in space?

Is it a fallacy if someone claims they need an explanation for every word of your argument to the point where they don't understand common terms?

Caruana vs Carlsen game 10 (WCC) why not 18...Nxb6?

How can I install sudo without using su?

Does static make a difference for a const local variable?

Citing paywalled articles accessed via illegal web sharing

Early credit roll before the end of the film

Why do I have multiple (unassociated) temporal history tables?

Is there some relative to Dutch word "kijken" in German?

Lick explanation

Every character has a name

What is the most triangles you can make from a capital "H" and 3 straight lines?

Why did other German political parties disband so fast when Hitler was appointed chancellor?

Word or phrase for showing great skill at something without formal training in it

Why would the Pakistan airspace closure cancel flights not headed to Pakistan itself?

How would an AI self awareness kill switch work?

Program that converts a number to a letter of the alphabet

We are very unlucky in my court

Explain the objections to these measures against human trafficking

Injecting creativity into a cookbook

Strange blocking on readable secondary after reboot



Remove excessive whitespace from string [on hold]


Remove useless whitespace from styled stringRemove last word from stringEffective way to remove white spaces from stringRemove (not extract) timestamp from HTML stringRemove unwanted characters from a stringRemove duplicates from string without additional bufferRemove a query string parameter from a URLRemove duplications from a Java StringRemove duplicate chars from StringC++ program to remove duplicate characters from string













3












$begingroup$


Is that standards-compliant?



#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>

std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };
auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type const &ch) {
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
}
)
};
result.erase(end, std::end(result));
return result;
}

int main()
{
char const *foo{ "Hello World! " };
std::cout << '"' << remove_excessive_ws(foo) << ""n";
}


I'm unsure whether I should be accessing (&ch)[1]. I think it should be legal, but I am not sure. After all, std::remove_if() could copy the character and pass it to the function, so the pointer (&ch) + 1 might not be valid.










share|improve this question











$endgroup$



put on hold as off-topic by user673679, papagaga, t3chb0t, ferada, Zeta yesterday


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Code not implemented or not working as intended: Code Review is a community where programmers peer-review your working code to address issues such as security, maintainability, performance, and scalability. We require that the code be working correctly, to the best of the author's knowledge, before proceeding with a review." – user673679, papagaga, t3chb0t, ferada

If this question can be reworded to fit the rules in the help center, please edit the question.
















  • $begingroup$
    It's not standard (but you should have asked this question on stack overflow instead, code review is for working code). std:string is not required to end with a null character and dereferencing the end iterator is undefined behavior; if your input ends with a space, it's precisely what you'll end up doing.
    $endgroup$
    – papagaga
    Feb 26 at 8:25










  • $begingroup$
    I don't think std::remove_if() is allowed to make a copy of each element - if it did, it couldn't (sensibly) work with containers of move-only values.
    $endgroup$
    – Toby Speight
    2 days ago












  • $begingroup$
    @TobySpeight Yay, that's reasonable! :)
    $endgroup$
    – Swordfish
    yesterday






  • 1




    $begingroup$
    Please clarify, does or doesn't the current code work as intended?
    $endgroup$
    – Mast
    yesterday
















3












$begingroup$


Is that standards-compliant?



#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>

std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };
auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type const &ch) {
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
}
)
};
result.erase(end, std::end(result));
return result;
}

int main()
{
char const *foo{ "Hello World! " };
std::cout << '"' << remove_excessive_ws(foo) << ""n";
}


I'm unsure whether I should be accessing (&ch)[1]. I think it should be legal, but I am not sure. After all, std::remove_if() could copy the character and pass it to the function, so the pointer (&ch) + 1 might not be valid.










share|improve this question











$endgroup$



put on hold as off-topic by user673679, papagaga, t3chb0t, ferada, Zeta yesterday


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Code not implemented or not working as intended: Code Review is a community where programmers peer-review your working code to address issues such as security, maintainability, performance, and scalability. We require that the code be working correctly, to the best of the author's knowledge, before proceeding with a review." – user673679, papagaga, t3chb0t, ferada

If this question can be reworded to fit the rules in the help center, please edit the question.
















  • $begingroup$
    It's not standard (but you should have asked this question on stack overflow instead, code review is for working code). std:string is not required to end with a null character and dereferencing the end iterator is undefined behavior; if your input ends with a space, it's precisely what you'll end up doing.
    $endgroup$
    – papagaga
    Feb 26 at 8:25










  • $begingroup$
    I don't think std::remove_if() is allowed to make a copy of each element - if it did, it couldn't (sensibly) work with containers of move-only values.
    $endgroup$
    – Toby Speight
    2 days ago












  • $begingroup$
    @TobySpeight Yay, that's reasonable! :)
    $endgroup$
    – Swordfish
    yesterday






  • 1




    $begingroup$
    Please clarify, does or doesn't the current code work as intended?
    $endgroup$
    – Mast
    yesterday














3












3








3





$begingroup$


Is that standards-compliant?



#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>

std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };
auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type const &ch) {
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
}
)
};
result.erase(end, std::end(result));
return result;
}

int main()
{
char const *foo{ "Hello World! " };
std::cout << '"' << remove_excessive_ws(foo) << ""n";
}


I'm unsure whether I should be accessing (&ch)[1]. I think it should be legal, but I am not sure. After all, std::remove_if() could copy the character and pass it to the function, so the pointer (&ch) + 1 might not be valid.










share|improve this question











$endgroup$




Is that standards-compliant?



#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>

std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };
auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type const &ch) {
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
}
)
};
result.erase(end, std::end(result));
return result;
}

int main()
{
char const *foo{ "Hello World! " };
std::cout << '"' << remove_excessive_ws(foo) << ""n";
}


I'm unsure whether I should be accessing (&ch)[1]. I think it should be legal, but I am not sure. After all, std::remove_if() could copy the character and pass it to the function, so the pointer (&ch) + 1 might not be valid.







c++ strings






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 2 days ago









Toby Speight

25.2k741116




25.2k741116










asked Feb 26 at 7:39









SwordfishSwordfish

1669




1669




put on hold as off-topic by user673679, papagaga, t3chb0t, ferada, Zeta yesterday


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Code not implemented or not working as intended: Code Review is a community where programmers peer-review your working code to address issues such as security, maintainability, performance, and scalability. We require that the code be working correctly, to the best of the author's knowledge, before proceeding with a review." – user673679, papagaga, t3chb0t, ferada

If this question can be reworded to fit the rules in the help center, please edit the question.







put on hold as off-topic by user673679, papagaga, t3chb0t, ferada, Zeta yesterday


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Code not implemented or not working as intended: Code Review is a community where programmers peer-review your working code to address issues such as security, maintainability, performance, and scalability. We require that the code be working correctly, to the best of the author's knowledge, before proceeding with a review." – user673679, papagaga, t3chb0t, ferada

If this question can be reworded to fit the rules in the help center, please edit the question.












  • $begingroup$
    It's not standard (but you should have asked this question on stack overflow instead, code review is for working code). std:string is not required to end with a null character and dereferencing the end iterator is undefined behavior; if your input ends with a space, it's precisely what you'll end up doing.
    $endgroup$
    – papagaga
    Feb 26 at 8:25










  • $begingroup$
    I don't think std::remove_if() is allowed to make a copy of each element - if it did, it couldn't (sensibly) work with containers of move-only values.
    $endgroup$
    – Toby Speight
    2 days ago












  • $begingroup$
    @TobySpeight Yay, that's reasonable! :)
    $endgroup$
    – Swordfish
    yesterday






  • 1




    $begingroup$
    Please clarify, does or doesn't the current code work as intended?
    $endgroup$
    – Mast
    yesterday


















  • $begingroup$
    It's not standard (but you should have asked this question on stack overflow instead, code review is for working code). std:string is not required to end with a null character and dereferencing the end iterator is undefined behavior; if your input ends with a space, it's precisely what you'll end up doing.
    $endgroup$
    – papagaga
    Feb 26 at 8:25










  • $begingroup$
    I don't think std::remove_if() is allowed to make a copy of each element - if it did, it couldn't (sensibly) work with containers of move-only values.
    $endgroup$
    – Toby Speight
    2 days ago












  • $begingroup$
    @TobySpeight Yay, that's reasonable! :)
    $endgroup$
    – Swordfish
    yesterday






  • 1




    $begingroup$
    Please clarify, does or doesn't the current code work as intended?
    $endgroup$
    – Mast
    yesterday
















$begingroup$
It's not standard (but you should have asked this question on stack overflow instead, code review is for working code). std:string is not required to end with a null character and dereferencing the end iterator is undefined behavior; if your input ends with a space, it's precisely what you'll end up doing.
$endgroup$
– papagaga
Feb 26 at 8:25




$begingroup$
It's not standard (but you should have asked this question on stack overflow instead, code review is for working code). std:string is not required to end with a null character and dereferencing the end iterator is undefined behavior; if your input ends with a space, it's precisely what you'll end up doing.
$endgroup$
– papagaga
Feb 26 at 8:25












$begingroup$
I don't think std::remove_if() is allowed to make a copy of each element - if it did, it couldn't (sensibly) work with containers of move-only values.
$endgroup$
– Toby Speight
2 days ago






$begingroup$
I don't think std::remove_if() is allowed to make a copy of each element - if it did, it couldn't (sensibly) work with containers of move-only values.
$endgroup$
– Toby Speight
2 days ago














$begingroup$
@TobySpeight Yay, that's reasonable! :)
$endgroup$
– Swordfish
yesterday




$begingroup$
@TobySpeight Yay, that's reasonable! :)
$endgroup$
– Swordfish
yesterday




1




1




$begingroup$
Please clarify, does or doesn't the current code work as intended?
$endgroup$
– Mast
yesterday




$begingroup$
Please clarify, does or doesn't the current code work as intended?
$endgroup$
– Mast
yesterday










3 Answers
3






active

oldest

votes


















8












$begingroup$

Define the problem



It's not clear from the description what's considered "excessive" whitespace. From experimenting, it seems that the idea is to collapse multiple whitespace to a single whitespace character, except at the end of the string, where whitespace is to be completely removed. (Whitespace at the beginning of string seems to treated the same as inner whitespace). It's also not clear which whitespace character should be kept - the example code keeps the last one, but is that a requirement, or just an implementation choice?



Missing include



std::begin() and std::end() are declared in <iterator>. However, there's no reason not to use the begin() and end() member functions of std::string here, as we're not operating on generic values.



When an argument is to be copied, pass by value



We don't need to copy str into result:



std::string remove_excessive_ws(std::string str)


Bug



Like all the <cctype> functions, std::isspace() requires that its argument be either EOF or representable as unsigned char. Converting a (possibly signed) char direct to unsigned int can sign-extend to an out-of-range value. We need to convert char to unsigned char before widening to unsigned int:



static_cast<unsigned char>(ch)


Bug



Prior to C++11, accessing the character after the end of the string is undefined behaviour (C++11 requires an extra null to follow the string data). Thankfully, it's easy to avoid this bug by simply remembering whether the last character seen was a space.



Here's a C++11 version:



#include <algorithm>
#include <cctype>
#include <string>
#include <utility>

std::string remove_excessive_ws(std::string str)
{
bool seen_space = false;
auto end{ std::remove_if(str.begin(), str.end(),
[&seen_space](unsigned char ch) {
bool is_space = std::isspace(ch);
std::swap(seen_space, is_space);
return seen_space && is_space;
})};
// adjust end to remove end whitespace
if (end != str.begin() && std::isspace(static_cast<unsigned char>(end[-1]))) {
--end;
}
str.erase(end, str.end());
return str;
}


We might want to move seen_space into the lambda expression in later C++ versions that allow that.



This also is more readable, as we can perform the widening to unsigned int when calling the lambda, rather than having to write a cast.



Style-wise, I'd normally prefer to name the lambda, and keep the erase-remove call on a single line so that the idiom is obvious:



// Assuming C++17 now
std::string remove_excessive_ws(std::string s)
{
auto is_doubled_space =
[seen_space=false](unsigned char c) mutable {
return std::exchange(seen_space, std::isspace(c))
&& seen_space;
};
s.erase(std::remove_if(s.begin(), s.end(), is_doubled_space), s.end());
// remove trailing whitespace
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}
// convert all whitespace into ordinary space character
std::replace_if(s.begin(), s.end(),
[](unsigned char c) { return std::isspace(c); }, ' ');
return s;
}





share|improve this answer











$endgroup$













  • $begingroup$
    Personally I try really hard to avoid mutating lambdas. Unfortunately for this problem this would preclude using std::remove_if. Instead, some kind of reduction or prefix scan would need to be used, and is probably not as efficient.
    $endgroup$
    – Konrad Rudolph
    Feb 26 at 17:07












  • $begingroup$
    Yes, that's a good point @Konrad - mutable lambdas are harder to reason about. It might be that a plain old for (auto it = str.begin(); it != str.end(); ++it) loop (followed by erase) may actually be the clearest code. But it's good to see a few options to choose from!
    $endgroup$
    – Toby Speight
    2 days ago



















1












$begingroup$

std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };


As Toby mentioned, if you plan on copying and mutating the copy locally, you should pass the parameter str by value. It should also be noted that result will have the same capacity as str and won't shrink to fit automatically (or take advantage of SSO if allocated).





std::isspace(static_cast<unsigned>(ch))


Be aware that std::isspace removes spaces (' '), whitespaces ('n', 'v', 'f', 'r'), and tabs ('t').



You should cast to unsigned char.





(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '')



Is that standards-compliant?




Using the built-in subscript operator on a pointer is standards compliant. From the C++17 standard (n4659), Postfix Expressions § 8.2.1 Subscripting:




A postfix expression followed by an expression in square brackets is a postfix expression. One of the expressions shall be a glvalue of type “array of T” or a prvalue of type “pointer to T” and the other shall be a prvalue of unscoped enumeration or integral type.




When accessing memory out of bounds via the built-in subscript operator, the behavior is undefined. A well-defined approach would be to track the next index and access the next element using std::string::operator[] (element at size() returns CharT{}). std::string is not a null-terminated sequence and considers the null character (CharT{}) to be a valid character within a sequence.



using namespace std::string_literals;
std::string str = "ab"s;
std::cout << str << 'n'; // prints "ab"




For a standard library solution on removing duplicates, I would simply pass the predicate to std::unique. No pointer arithmetic is necessary. Just pass it a binary predicate that checks if both characters are whitespaces:



#include <algorithm>
#include <cctype>
#include <string>

std::string remove_excessive_ws(std::string s)
{
static auto const space_space =
[](unsigned char a, unsigned char b) {
return std::isspace(a) && std::isspace(b);
};

s.erase(std::unique(s.begin(), s.end(), space_space), s.end());

// trim final space
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}

return s;
}


Your function leaves a leading whitespace if one or more exists. Is this intended? Should there be a common character (single space) to merge the different characters std::isspace catches? If the ultimate goal was to trim all outer whitespace and join non-whitespace tokens with single spaces, I would use abseil's absl::StrSplit() and absl::StrJoin(). The resulting string would either take advantage of SSO if small enough or use a more appropriate capacity.



// remove_excess_whitespace
//
// Trims leading and trailing space, whitespace, and tab characters
// such that the resulting string is single space separated.
std::string remove_excess_whitespace(absl::string_view sv) {
return absl::StrJoin(absl::StrSplit(sv, ' ', absl::SkipWhitespace{}), " ");
}





share|improve this answer











$endgroup$













  • $begingroup$
    Be aware that std::isspace removes spaces (' '), whitespaces ('n', 'v', 'f', 'r'), and tabs ('t') – that's intended.
    $endgroup$
    – Swordfish
    2 days ago










  • $begingroup$
    As for std::string, this behavior is well-defined – yes, but what about remove_if(), is it guaranteed that it won't copy?
    $endgroup$
    – Swordfish
    2 days ago










  • $begingroup$
    std::remove_if copies individual characters that do not satisfy the predicate. The well-defined behavior I was talking about is accessing str[str.size()]. So, instead of checking &ch[1] == '', keep an index to the next variable and use str[index] (or better, use std::unique and don't worry about subscript access for adjacent elements).
    $endgroup$
    – Snowhawk
    2 days ago












  • $begingroup$
    I was talking about remove_if() probably copying the element before passing it to the predicate.
    $endgroup$
    – Swordfish
    2 days ago










  • $begingroup$
    Compilers are pretty good at inlining arguments and lambda predicates. I would be concerned with unnecessary heap allocations.
    $endgroup$
    – Snowhawk
    2 days ago





















0












$begingroup$


It's about accessing (&ch)[1]. I think it should be legal, but I am not sure.




Yes, it is legal as long as you have null terminated string, i.e. the size-th character of the string is the null character. The reason is that std::string::end() returns an iterator that is one short of the element that holds the terminating null character when the string object does have a terminating null character. Hence, (&ch)[1] will not access anything beyond the terminating null character. You can verify that by printing the value of &ch as a debugging guide.



Here's a updated version of your posted code that prints additional debugging info.



#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>

std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };

std::for_each(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
});

std::cout << "==========================n";

auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})
};

result.erase(end, std::end(result));
return result;
}

int main()
{
char const *foo{ "Hello World! "};
std::string res = remove_excessive_ws(foo);
std::cout << "n"" << res << ""n";
}


Here's its output.



pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8
==========================
pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8

"Hello World!"




Response to OP's comment



std::string::end() returns std::string::iterator(), which is a LegacyRandomAccessIterator. Please see the Member types of std::string.



LegacyRandomAccessIterator satisfies the requirements of LegacyBidirectionalIterator.



LegacyBidirectionalIterator satisfies the requirments of LegacyForwardIterator.



ForwardIterator satisfies the requirements of LegacyInputIterator.



When a LegacyInputIterator is dereferenced, it evaluates to a reference. Now, you have to go back to std::string to see what that means. As you can expect, it is a reference and not a copy.



You can also use a non-const reference to be doubly sure that you get a reference and not a copy.



auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})





share|improve this answer











$endgroup$









  • 1




    $begingroup$
    You can verify that by [...] – Standard reference? Also, Where is written that find_if() won't hand me a copy of the char instead a reference to it?
    $endgroup$
    – Swordfish
    2 days ago






  • 1




    $begingroup$
    Whether it's legal to dereference str.end() depends on the standards version; only since C++11 has str.data() been required to have a null-terminator.
    $endgroup$
    – Toby Speight
    2 days ago










  • $begingroup$
    @TobySpeight, agree in principle. However, if the string object is a null terminated string, my answer is valid. I update the answer to make that clearer.
    $endgroup$
    – R Sahu
    yesterday


















3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes









8












$begingroup$

Define the problem



It's not clear from the description what's considered "excessive" whitespace. From experimenting, it seems that the idea is to collapse multiple whitespace to a single whitespace character, except at the end of the string, where whitespace is to be completely removed. (Whitespace at the beginning of string seems to treated the same as inner whitespace). It's also not clear which whitespace character should be kept - the example code keeps the last one, but is that a requirement, or just an implementation choice?



Missing include



std::begin() and std::end() are declared in <iterator>. However, there's no reason not to use the begin() and end() member functions of std::string here, as we're not operating on generic values.



When an argument is to be copied, pass by value



We don't need to copy str into result:



std::string remove_excessive_ws(std::string str)


Bug



Like all the <cctype> functions, std::isspace() requires that its argument be either EOF or representable as unsigned char. Converting a (possibly signed) char direct to unsigned int can sign-extend to an out-of-range value. We need to convert char to unsigned char before widening to unsigned int:



static_cast<unsigned char>(ch)


Bug



Prior to C++11, accessing the character after the end of the string is undefined behaviour (C++11 requires an extra null to follow the string data). Thankfully, it's easy to avoid this bug by simply remembering whether the last character seen was a space.



Here's a C++11 version:



#include <algorithm>
#include <cctype>
#include <string>
#include <utility>

std::string remove_excessive_ws(std::string str)
{
bool seen_space = false;
auto end{ std::remove_if(str.begin(), str.end(),
[&seen_space](unsigned char ch) {
bool is_space = std::isspace(ch);
std::swap(seen_space, is_space);
return seen_space && is_space;
})};
// adjust end to remove end whitespace
if (end != str.begin() && std::isspace(static_cast<unsigned char>(end[-1]))) {
--end;
}
str.erase(end, str.end());
return str;
}


We might want to move seen_space into the lambda expression in later C++ versions that allow that.



This also is more readable, as we can perform the widening to unsigned int when calling the lambda, rather than having to write a cast.



Style-wise, I'd normally prefer to name the lambda, and keep the erase-remove call on a single line so that the idiom is obvious:



// Assuming C++17 now
std::string remove_excessive_ws(std::string s)
{
auto is_doubled_space =
[seen_space=false](unsigned char c) mutable {
return std::exchange(seen_space, std::isspace(c))
&& seen_space;
};
s.erase(std::remove_if(s.begin(), s.end(), is_doubled_space), s.end());
// remove trailing whitespace
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}
// convert all whitespace into ordinary space character
std::replace_if(s.begin(), s.end(),
[](unsigned char c) { return std::isspace(c); }, ' ');
return s;
}





share|improve this answer











$endgroup$













  • $begingroup$
    Personally I try really hard to avoid mutating lambdas. Unfortunately for this problem this would preclude using std::remove_if. Instead, some kind of reduction or prefix scan would need to be used, and is probably not as efficient.
    $endgroup$
    – Konrad Rudolph
    Feb 26 at 17:07












  • $begingroup$
    Yes, that's a good point @Konrad - mutable lambdas are harder to reason about. It might be that a plain old for (auto it = str.begin(); it != str.end(); ++it) loop (followed by erase) may actually be the clearest code. But it's good to see a few options to choose from!
    $endgroup$
    – Toby Speight
    2 days ago
















8












$begingroup$

Define the problem



It's not clear from the description what's considered "excessive" whitespace. From experimenting, it seems that the idea is to collapse multiple whitespace to a single whitespace character, except at the end of the string, where whitespace is to be completely removed. (Whitespace at the beginning of string seems to treated the same as inner whitespace). It's also not clear which whitespace character should be kept - the example code keeps the last one, but is that a requirement, or just an implementation choice?



Missing include



std::begin() and std::end() are declared in <iterator>. However, there's no reason not to use the begin() and end() member functions of std::string here, as we're not operating on generic values.



When an argument is to be copied, pass by value



We don't need to copy str into result:



std::string remove_excessive_ws(std::string str)


Bug



Like all the <cctype> functions, std::isspace() requires that its argument be either EOF or representable as unsigned char. Converting a (possibly signed) char direct to unsigned int can sign-extend to an out-of-range value. We need to convert char to unsigned char before widening to unsigned int:



static_cast<unsigned char>(ch)


Bug



Prior to C++11, accessing the character after the end of the string is undefined behaviour (C++11 requires an extra null to follow the string data). Thankfully, it's easy to avoid this bug by simply remembering whether the last character seen was a space.



Here's a C++11 version:



#include <algorithm>
#include <cctype>
#include <string>
#include <utility>

std::string remove_excessive_ws(std::string str)
{
bool seen_space = false;
auto end{ std::remove_if(str.begin(), str.end(),
[&seen_space](unsigned char ch) {
bool is_space = std::isspace(ch);
std::swap(seen_space, is_space);
return seen_space && is_space;
})};
// adjust end to remove end whitespace
if (end != str.begin() && std::isspace(static_cast<unsigned char>(end[-1]))) {
--end;
}
str.erase(end, str.end());
return str;
}


We might want to move seen_space into the lambda expression in later C++ versions that allow that.



This also is more readable, as we can perform the widening to unsigned int when calling the lambda, rather than having to write a cast.



Style-wise, I'd normally prefer to name the lambda, and keep the erase-remove call on a single line so that the idiom is obvious:



// Assuming C++17 now
std::string remove_excessive_ws(std::string s)
{
auto is_doubled_space =
[seen_space=false](unsigned char c) mutable {
return std::exchange(seen_space, std::isspace(c))
&& seen_space;
};
s.erase(std::remove_if(s.begin(), s.end(), is_doubled_space), s.end());
// remove trailing whitespace
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}
// convert all whitespace into ordinary space character
std::replace_if(s.begin(), s.end(),
[](unsigned char c) { return std::isspace(c); }, ' ');
return s;
}





share|improve this answer











$endgroup$













  • $begingroup$
    Personally I try really hard to avoid mutating lambdas. Unfortunately for this problem this would preclude using std::remove_if. Instead, some kind of reduction or prefix scan would need to be used, and is probably not as efficient.
    $endgroup$
    – Konrad Rudolph
    Feb 26 at 17:07












  • $begingroup$
    Yes, that's a good point @Konrad - mutable lambdas are harder to reason about. It might be that a plain old for (auto it = str.begin(); it != str.end(); ++it) loop (followed by erase) may actually be the clearest code. But it's good to see a few options to choose from!
    $endgroup$
    – Toby Speight
    2 days ago














8












8








8





$begingroup$

Define the problem



It's not clear from the description what's considered "excessive" whitespace. From experimenting, it seems that the idea is to collapse multiple whitespace to a single whitespace character, except at the end of the string, where whitespace is to be completely removed. (Whitespace at the beginning of string seems to treated the same as inner whitespace). It's also not clear which whitespace character should be kept - the example code keeps the last one, but is that a requirement, or just an implementation choice?



Missing include



std::begin() and std::end() are declared in <iterator>. However, there's no reason not to use the begin() and end() member functions of std::string here, as we're not operating on generic values.



When an argument is to be copied, pass by value



We don't need to copy str into result:



std::string remove_excessive_ws(std::string str)


Bug



Like all the <cctype> functions, std::isspace() requires that its argument be either EOF or representable as unsigned char. Converting a (possibly signed) char direct to unsigned int can sign-extend to an out-of-range value. We need to convert char to unsigned char before widening to unsigned int:



static_cast<unsigned char>(ch)


Bug



Prior to C++11, accessing the character after the end of the string is undefined behaviour (C++11 requires an extra null to follow the string data). Thankfully, it's easy to avoid this bug by simply remembering whether the last character seen was a space.



Here's a C++11 version:



#include <algorithm>
#include <cctype>
#include <string>
#include <utility>

std::string remove_excessive_ws(std::string str)
{
bool seen_space = false;
auto end{ std::remove_if(str.begin(), str.end(),
[&seen_space](unsigned char ch) {
bool is_space = std::isspace(ch);
std::swap(seen_space, is_space);
return seen_space && is_space;
})};
// adjust end to remove end whitespace
if (end != str.begin() && std::isspace(static_cast<unsigned char>(end[-1]))) {
--end;
}
str.erase(end, str.end());
return str;
}


We might want to move seen_space into the lambda expression in later C++ versions that allow that.



This also is more readable, as we can perform the widening to unsigned int when calling the lambda, rather than having to write a cast.



Style-wise, I'd normally prefer to name the lambda, and keep the erase-remove call on a single line so that the idiom is obvious:



// Assuming C++17 now
std::string remove_excessive_ws(std::string s)
{
auto is_doubled_space =
[seen_space=false](unsigned char c) mutable {
return std::exchange(seen_space, std::isspace(c))
&& seen_space;
};
s.erase(std::remove_if(s.begin(), s.end(), is_doubled_space), s.end());
// remove trailing whitespace
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}
// convert all whitespace into ordinary space character
std::replace_if(s.begin(), s.end(),
[](unsigned char c) { return std::isspace(c); }, ' ');
return s;
}





share|improve this answer











$endgroup$



Define the problem



It's not clear from the description what's considered "excessive" whitespace. From experimenting, it seems that the idea is to collapse multiple whitespace to a single whitespace character, except at the end of the string, where whitespace is to be completely removed. (Whitespace at the beginning of string seems to treated the same as inner whitespace). It's also not clear which whitespace character should be kept - the example code keeps the last one, but is that a requirement, or just an implementation choice?



Missing include



std::begin() and std::end() are declared in <iterator>. However, there's no reason not to use the begin() and end() member functions of std::string here, as we're not operating on generic values.



When an argument is to be copied, pass by value



We don't need to copy str into result:



std::string remove_excessive_ws(std::string str)


Bug



Like all the <cctype> functions, std::isspace() requires that its argument be either EOF or representable as unsigned char. Converting a (possibly signed) char direct to unsigned int can sign-extend to an out-of-range value. We need to convert char to unsigned char before widening to unsigned int:



static_cast<unsigned char>(ch)


Bug



Prior to C++11, accessing the character after the end of the string is undefined behaviour (C++11 requires an extra null to follow the string data). Thankfully, it's easy to avoid this bug by simply remembering whether the last character seen was a space.



Here's a C++11 version:



#include <algorithm>
#include <cctype>
#include <string>
#include <utility>

std::string remove_excessive_ws(std::string str)
{
bool seen_space = false;
auto end{ std::remove_if(str.begin(), str.end(),
[&seen_space](unsigned char ch) {
bool is_space = std::isspace(ch);
std::swap(seen_space, is_space);
return seen_space && is_space;
})};
// adjust end to remove end whitespace
if (end != str.begin() && std::isspace(static_cast<unsigned char>(end[-1]))) {
--end;
}
str.erase(end, str.end());
return str;
}


We might want to move seen_space into the lambda expression in later C++ versions that allow that.



This also is more readable, as we can perform the widening to unsigned int when calling the lambda, rather than having to write a cast.



Style-wise, I'd normally prefer to name the lambda, and keep the erase-remove call on a single line so that the idiom is obvious:



// Assuming C++17 now
std::string remove_excessive_ws(std::string s)
{
auto is_doubled_space =
[seen_space=false](unsigned char c) mutable {
return std::exchange(seen_space, std::isspace(c))
&& seen_space;
};
s.erase(std::remove_if(s.begin(), s.end(), is_doubled_space), s.end());
// remove trailing whitespace
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}
// convert all whitespace into ordinary space character
std::replace_if(s.begin(), s.end(),
[](unsigned char c) { return std::isspace(c); }, ' ');
return s;
}






share|improve this answer














share|improve this answer



share|improve this answer








edited 2 days ago

























answered Feb 26 at 8:46









Toby SpeightToby Speight

25.2k741116




25.2k741116












  • $begingroup$
    Personally I try really hard to avoid mutating lambdas. Unfortunately for this problem this would preclude using std::remove_if. Instead, some kind of reduction or prefix scan would need to be used, and is probably not as efficient.
    $endgroup$
    – Konrad Rudolph
    Feb 26 at 17:07












  • $begingroup$
    Yes, that's a good point @Konrad - mutable lambdas are harder to reason about. It might be that a plain old for (auto it = str.begin(); it != str.end(); ++it) loop (followed by erase) may actually be the clearest code. But it's good to see a few options to choose from!
    $endgroup$
    – Toby Speight
    2 days ago


















  • $begingroup$
    Personally I try really hard to avoid mutating lambdas. Unfortunately for this problem this would preclude using std::remove_if. Instead, some kind of reduction or prefix scan would need to be used, and is probably not as efficient.
    $endgroup$
    – Konrad Rudolph
    Feb 26 at 17:07












  • $begingroup$
    Yes, that's a good point @Konrad - mutable lambdas are harder to reason about. It might be that a plain old for (auto it = str.begin(); it != str.end(); ++it) loop (followed by erase) may actually be the clearest code. But it's good to see a few options to choose from!
    $endgroup$
    – Toby Speight
    2 days ago
















$begingroup$
Personally I try really hard to avoid mutating lambdas. Unfortunately for this problem this would preclude using std::remove_if. Instead, some kind of reduction or prefix scan would need to be used, and is probably not as efficient.
$endgroup$
– Konrad Rudolph
Feb 26 at 17:07






$begingroup$
Personally I try really hard to avoid mutating lambdas. Unfortunately for this problem this would preclude using std::remove_if. Instead, some kind of reduction or prefix scan would need to be used, and is probably not as efficient.
$endgroup$
– Konrad Rudolph
Feb 26 at 17:07














$begingroup$
Yes, that's a good point @Konrad - mutable lambdas are harder to reason about. It might be that a plain old for (auto it = str.begin(); it != str.end(); ++it) loop (followed by erase) may actually be the clearest code. But it's good to see a few options to choose from!
$endgroup$
– Toby Speight
2 days ago




$begingroup$
Yes, that's a good point @Konrad - mutable lambdas are harder to reason about. It might be that a plain old for (auto it = str.begin(); it != str.end(); ++it) loop (followed by erase) may actually be the clearest code. But it's good to see a few options to choose from!
$endgroup$
– Toby Speight
2 days ago













1












$begingroup$

std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };


As Toby mentioned, if you plan on copying and mutating the copy locally, you should pass the parameter str by value. It should also be noted that result will have the same capacity as str and won't shrink to fit automatically (or take advantage of SSO if allocated).





std::isspace(static_cast<unsigned>(ch))


Be aware that std::isspace removes spaces (' '), whitespaces ('n', 'v', 'f', 'r'), and tabs ('t').



You should cast to unsigned char.





(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '')



Is that standards-compliant?




Using the built-in subscript operator on a pointer is standards compliant. From the C++17 standard (n4659), Postfix Expressions § 8.2.1 Subscripting:




A postfix expression followed by an expression in square brackets is a postfix expression. One of the expressions shall be a glvalue of type “array of T” or a prvalue of type “pointer to T” and the other shall be a prvalue of unscoped enumeration or integral type.




When accessing memory out of bounds via the built-in subscript operator, the behavior is undefined. A well-defined approach would be to track the next index and access the next element using std::string::operator[] (element at size() returns CharT{}). std::string is not a null-terminated sequence and considers the null character (CharT{}) to be a valid character within a sequence.



using namespace std::string_literals;
std::string str = "ab"s;
std::cout << str << 'n'; // prints "ab"




For a standard library solution on removing duplicates, I would simply pass the predicate to std::unique. No pointer arithmetic is necessary. Just pass it a binary predicate that checks if both characters are whitespaces:



#include <algorithm>
#include <cctype>
#include <string>

std::string remove_excessive_ws(std::string s)
{
static auto const space_space =
[](unsigned char a, unsigned char b) {
return std::isspace(a) && std::isspace(b);
};

s.erase(std::unique(s.begin(), s.end(), space_space), s.end());

// trim final space
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}

return s;
}


Your function leaves a leading whitespace if one or more exists. Is this intended? Should there be a common character (single space) to merge the different characters std::isspace catches? If the ultimate goal was to trim all outer whitespace and join non-whitespace tokens with single spaces, I would use abseil's absl::StrSplit() and absl::StrJoin(). The resulting string would either take advantage of SSO if small enough or use a more appropriate capacity.



// remove_excess_whitespace
//
// Trims leading and trailing space, whitespace, and tab characters
// such that the resulting string is single space separated.
std::string remove_excess_whitespace(absl::string_view sv) {
return absl::StrJoin(absl::StrSplit(sv, ' ', absl::SkipWhitespace{}), " ");
}





share|improve this answer











$endgroup$













  • $begingroup$
    Be aware that std::isspace removes spaces (' '), whitespaces ('n', 'v', 'f', 'r'), and tabs ('t') – that's intended.
    $endgroup$
    – Swordfish
    2 days ago










  • $begingroup$
    As for std::string, this behavior is well-defined – yes, but what about remove_if(), is it guaranteed that it won't copy?
    $endgroup$
    – Swordfish
    2 days ago










  • $begingroup$
    std::remove_if copies individual characters that do not satisfy the predicate. The well-defined behavior I was talking about is accessing str[str.size()]. So, instead of checking &ch[1] == '', keep an index to the next variable and use str[index] (or better, use std::unique and don't worry about subscript access for adjacent elements).
    $endgroup$
    – Snowhawk
    2 days ago












  • $begingroup$
    I was talking about remove_if() probably copying the element before passing it to the predicate.
    $endgroup$
    – Swordfish
    2 days ago










  • $begingroup$
    Compilers are pretty good at inlining arguments and lambda predicates. I would be concerned with unnecessary heap allocations.
    $endgroup$
    – Snowhawk
    2 days ago


















1












$begingroup$

std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };


As Toby mentioned, if you plan on copying and mutating the copy locally, you should pass the parameter str by value. It should also be noted that result will have the same capacity as str and won't shrink to fit automatically (or take advantage of SSO if allocated).





std::isspace(static_cast<unsigned>(ch))


Be aware that std::isspace removes spaces (' '), whitespaces ('n', 'v', 'f', 'r'), and tabs ('t').



You should cast to unsigned char.





(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '')



Is that standards-compliant?




Using the built-in subscript operator on a pointer is standards compliant. From the C++17 standard (n4659), Postfix Expressions § 8.2.1 Subscripting:




A postfix expression followed by an expression in square brackets is a postfix expression. One of the expressions shall be a glvalue of type “array of T” or a prvalue of type “pointer to T” and the other shall be a prvalue of unscoped enumeration or integral type.




When accessing memory out of bounds via the built-in subscript operator, the behavior is undefined. A well-defined approach would be to track the next index and access the next element using std::string::operator[] (element at size() returns CharT{}). std::string is not a null-terminated sequence and considers the null character (CharT{}) to be a valid character within a sequence.



using namespace std::string_literals;
std::string str = "ab"s;
std::cout << str << 'n'; // prints "ab"




For a standard library solution on removing duplicates, I would simply pass the predicate to std::unique. No pointer arithmetic is necessary. Just pass it a binary predicate that checks if both characters are whitespaces:



#include <algorithm>
#include <cctype>
#include <string>

std::string remove_excessive_ws(std::string s)
{
static auto const space_space =
[](unsigned char a, unsigned char b) {
return std::isspace(a) && std::isspace(b);
};

s.erase(std::unique(s.begin(), s.end(), space_space), s.end());

// trim final space
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}

return s;
}


Your function leaves a leading whitespace if one or more exists. Is this intended? Should there be a common character (single space) to merge the different characters std::isspace catches? If the ultimate goal was to trim all outer whitespace and join non-whitespace tokens with single spaces, I would use abseil's absl::StrSplit() and absl::StrJoin(). The resulting string would either take advantage of SSO if small enough or use a more appropriate capacity.



// remove_excess_whitespace
//
// Trims leading and trailing space, whitespace, and tab characters
// such that the resulting string is single space separated.
std::string remove_excess_whitespace(absl::string_view sv) {
return absl::StrJoin(absl::StrSplit(sv, ' ', absl::SkipWhitespace{}), " ");
}





share|improve this answer











$endgroup$













  • $begingroup$
    Be aware that std::isspace removes spaces (' '), whitespaces ('n', 'v', 'f', 'r'), and tabs ('t') – that's intended.
    $endgroup$
    – Swordfish
    2 days ago










  • $begingroup$
    As for std::string, this behavior is well-defined – yes, but what about remove_if(), is it guaranteed that it won't copy?
    $endgroup$
    – Swordfish
    2 days ago










  • $begingroup$
    std::remove_if copies individual characters that do not satisfy the predicate. The well-defined behavior I was talking about is accessing str[str.size()]. So, instead of checking &ch[1] == '', keep an index to the next variable and use str[index] (or better, use std::unique and don't worry about subscript access for adjacent elements).
    $endgroup$
    – Snowhawk
    2 days ago












  • $begingroup$
    I was talking about remove_if() probably copying the element before passing it to the predicate.
    $endgroup$
    – Swordfish
    2 days ago










  • $begingroup$
    Compilers are pretty good at inlining arguments and lambda predicates. I would be concerned with unnecessary heap allocations.
    $endgroup$
    – Snowhawk
    2 days ago
















1












1








1





$begingroup$

std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };


As Toby mentioned, if you plan on copying and mutating the copy locally, you should pass the parameter str by value. It should also be noted that result will have the same capacity as str and won't shrink to fit automatically (or take advantage of SSO if allocated).





std::isspace(static_cast<unsigned>(ch))


Be aware that std::isspace removes spaces (' '), whitespaces ('n', 'v', 'f', 'r'), and tabs ('t').



You should cast to unsigned char.





(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '')



Is that standards-compliant?




Using the built-in subscript operator on a pointer is standards compliant. From the C++17 standard (n4659), Postfix Expressions § 8.2.1 Subscripting:




A postfix expression followed by an expression in square brackets is a postfix expression. One of the expressions shall be a glvalue of type “array of T” or a prvalue of type “pointer to T” and the other shall be a prvalue of unscoped enumeration or integral type.




When accessing memory out of bounds via the built-in subscript operator, the behavior is undefined. A well-defined approach would be to track the next index and access the next element using std::string::operator[] (element at size() returns CharT{}). std::string is not a null-terminated sequence and considers the null character (CharT{}) to be a valid character within a sequence.



using namespace std::string_literals;
std::string str = "ab"s;
std::cout << str << 'n'; // prints "ab"




For a standard library solution on removing duplicates, I would simply pass the predicate to std::unique. No pointer arithmetic is necessary. Just pass it a binary predicate that checks if both characters are whitespaces:



#include <algorithm>
#include <cctype>
#include <string>

std::string remove_excessive_ws(std::string s)
{
static auto const space_space =
[](unsigned char a, unsigned char b) {
return std::isspace(a) && std::isspace(b);
};

s.erase(std::unique(s.begin(), s.end(), space_space), s.end());

// trim final space
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}

return s;
}


Your function leaves a leading whitespace if one or more exists. Is this intended? Should there be a common character (single space) to merge the different characters std::isspace catches? If the ultimate goal was to trim all outer whitespace and join non-whitespace tokens with single spaces, I would use abseil's absl::StrSplit() and absl::StrJoin(). The resulting string would either take advantage of SSO if small enough or use a more appropriate capacity.



// remove_excess_whitespace
//
// Trims leading and trailing space, whitespace, and tab characters
// such that the resulting string is single space separated.
std::string remove_excess_whitespace(absl::string_view sv) {
return absl::StrJoin(absl::StrSplit(sv, ' ', absl::SkipWhitespace{}), " ");
}





share|improve this answer











$endgroup$



std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };


As Toby mentioned, if you plan on copying and mutating the copy locally, you should pass the parameter str by value. It should also be noted that result will have the same capacity as str and won't shrink to fit automatically (or take advantage of SSO if allocated).





std::isspace(static_cast<unsigned>(ch))


Be aware that std::isspace removes spaces (' '), whitespaces ('n', 'v', 'f', 'r'), and tabs ('t').



You should cast to unsigned char.





(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '')



Is that standards-compliant?




Using the built-in subscript operator on a pointer is standards compliant. From the C++17 standard (n4659), Postfix Expressions § 8.2.1 Subscripting:




A postfix expression followed by an expression in square brackets is a postfix expression. One of the expressions shall be a glvalue of type “array of T” or a prvalue of type “pointer to T” and the other shall be a prvalue of unscoped enumeration or integral type.




When accessing memory out of bounds via the built-in subscript operator, the behavior is undefined. A well-defined approach would be to track the next index and access the next element using std::string::operator[] (element at size() returns CharT{}). std::string is not a null-terminated sequence and considers the null character (CharT{}) to be a valid character within a sequence.



using namespace std::string_literals;
std::string str = "ab"s;
std::cout << str << 'n'; // prints "ab"




For a standard library solution on removing duplicates, I would simply pass the predicate to std::unique. No pointer arithmetic is necessary. Just pass it a binary predicate that checks if both characters are whitespaces:



#include <algorithm>
#include <cctype>
#include <string>

std::string remove_excessive_ws(std::string s)
{
static auto const space_space =
[](unsigned char a, unsigned char b) {
return std::isspace(a) && std::isspace(b);
};

s.erase(std::unique(s.begin(), s.end(), space_space), s.end());

// trim final space
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}

return s;
}


Your function leaves a leading whitespace if one or more exists. Is this intended? Should there be a common character (single space) to merge the different characters std::isspace catches? If the ultimate goal was to trim all outer whitespace and join non-whitespace tokens with single spaces, I would use abseil's absl::StrSplit() and absl::StrJoin(). The resulting string would either take advantage of SSO if small enough or use a more appropriate capacity.



// remove_excess_whitespace
//
// Trims leading and trailing space, whitespace, and tab characters
// such that the resulting string is single space separated.
std::string remove_excess_whitespace(absl::string_view sv) {
return absl::StrJoin(absl::StrSplit(sv, ' ', absl::SkipWhitespace{}), " ");
}






share|improve this answer














share|improve this answer



share|improve this answer








edited 2 days ago









Toby Speight

25.2k741116




25.2k741116










answered 2 days ago









SnowhawkSnowhawk

5,31211128




5,31211128












  • $begingroup$
    Be aware that std::isspace removes spaces (' '), whitespaces ('n', 'v', 'f', 'r'), and tabs ('t') – that's intended.
    $endgroup$
    – Swordfish
    2 days ago










  • $begingroup$
    As for std::string, this behavior is well-defined – yes, but what about remove_if(), is it guaranteed that it won't copy?
    $endgroup$
    – Swordfish
    2 days ago










  • $begingroup$
    std::remove_if copies individual characters that do not satisfy the predicate. The well-defined behavior I was talking about is accessing str[str.size()]. So, instead of checking &ch[1] == '', keep an index to the next variable and use str[index] (or better, use std::unique and don't worry about subscript access for adjacent elements).
    $endgroup$
    – Snowhawk
    2 days ago












  • $begingroup$
    I was talking about remove_if() probably copying the element before passing it to the predicate.
    $endgroup$
    – Swordfish
    2 days ago










  • $begingroup$
    Compilers are pretty good at inlining arguments and lambda predicates. I would be concerned with unnecessary heap allocations.
    $endgroup$
    – Snowhawk
    2 days ago




















  • $begingroup$
    Be aware that std::isspace removes spaces (' '), whitespaces ('n', 'v', 'f', 'r'), and tabs ('t') – that's intended.
    $endgroup$
    – Swordfish
    2 days ago










  • $begingroup$
    As for std::string, this behavior is well-defined – yes, but what about remove_if(), is it guaranteed that it won't copy?
    $endgroup$
    – Swordfish
    2 days ago










  • $begingroup$
    std::remove_if copies individual characters that do not satisfy the predicate. The well-defined behavior I was talking about is accessing str[str.size()]. So, instead of checking &ch[1] == '', keep an index to the next variable and use str[index] (or better, use std::unique and don't worry about subscript access for adjacent elements).
    $endgroup$
    – Snowhawk
    2 days ago












  • $begingroup$
    I was talking about remove_if() probably copying the element before passing it to the predicate.
    $endgroup$
    – Swordfish
    2 days ago










  • $begingroup$
    Compilers are pretty good at inlining arguments and lambda predicates. I would be concerned with unnecessary heap allocations.
    $endgroup$
    – Snowhawk
    2 days ago


















$begingroup$
Be aware that std::isspace removes spaces (' '), whitespaces ('n', 'v', 'f', 'r'), and tabs ('t') – that's intended.
$endgroup$
– Swordfish
2 days ago




$begingroup$
Be aware that std::isspace removes spaces (' '), whitespaces ('n', 'v', 'f', 'r'), and tabs ('t') – that's intended.
$endgroup$
– Swordfish
2 days ago












$begingroup$
As for std::string, this behavior is well-defined – yes, but what about remove_if(), is it guaranteed that it won't copy?
$endgroup$
– Swordfish
2 days ago




$begingroup$
As for std::string, this behavior is well-defined – yes, but what about remove_if(), is it guaranteed that it won't copy?
$endgroup$
– Swordfish
2 days ago












$begingroup$
std::remove_if copies individual characters that do not satisfy the predicate. The well-defined behavior I was talking about is accessing str[str.size()]. So, instead of checking &ch[1] == '', keep an index to the next variable and use str[index] (or better, use std::unique and don't worry about subscript access for adjacent elements).
$endgroup$
– Snowhawk
2 days ago






$begingroup$
std::remove_if copies individual characters that do not satisfy the predicate. The well-defined behavior I was talking about is accessing str[str.size()]. So, instead of checking &ch[1] == '', keep an index to the next variable and use str[index] (or better, use std::unique and don't worry about subscript access for adjacent elements).
$endgroup$
– Snowhawk
2 days ago














$begingroup$
I was talking about remove_if() probably copying the element before passing it to the predicate.
$endgroup$
– Swordfish
2 days ago




$begingroup$
I was talking about remove_if() probably copying the element before passing it to the predicate.
$endgroup$
– Swordfish
2 days ago












$begingroup$
Compilers are pretty good at inlining arguments and lambda predicates. I would be concerned with unnecessary heap allocations.
$endgroup$
– Snowhawk
2 days ago






$begingroup$
Compilers are pretty good at inlining arguments and lambda predicates. I would be concerned with unnecessary heap allocations.
$endgroup$
– Snowhawk
2 days ago













0












$begingroup$


It's about accessing (&ch)[1]. I think it should be legal, but I am not sure.




Yes, it is legal as long as you have null terminated string, i.e. the size-th character of the string is the null character. The reason is that std::string::end() returns an iterator that is one short of the element that holds the terminating null character when the string object does have a terminating null character. Hence, (&ch)[1] will not access anything beyond the terminating null character. You can verify that by printing the value of &ch as a debugging guide.



Here's a updated version of your posted code that prints additional debugging info.



#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>

std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };

std::for_each(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
});

std::cout << "==========================n";

auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})
};

result.erase(end, std::end(result));
return result;
}

int main()
{
char const *foo{ "Hello World! "};
std::string res = remove_excessive_ws(foo);
std::cout << "n"" << res << ""n";
}


Here's its output.



pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8
==========================
pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8

"Hello World!"




Response to OP's comment



std::string::end() returns std::string::iterator(), which is a LegacyRandomAccessIterator. Please see the Member types of std::string.



LegacyRandomAccessIterator satisfies the requirements of LegacyBidirectionalIterator.



LegacyBidirectionalIterator satisfies the requirments of LegacyForwardIterator.



ForwardIterator satisfies the requirements of LegacyInputIterator.



When a LegacyInputIterator is dereferenced, it evaluates to a reference. Now, you have to go back to std::string to see what that means. As you can expect, it is a reference and not a copy.



You can also use a non-const reference to be doubly sure that you get a reference and not a copy.



auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})





share|improve this answer











$endgroup$









  • 1




    $begingroup$
    You can verify that by [...] – Standard reference? Also, Where is written that find_if() won't hand me a copy of the char instead a reference to it?
    $endgroup$
    – Swordfish
    2 days ago






  • 1




    $begingroup$
    Whether it's legal to dereference str.end() depends on the standards version; only since C++11 has str.data() been required to have a null-terminator.
    $endgroup$
    – Toby Speight
    2 days ago










  • $begingroup$
    @TobySpeight, agree in principle. However, if the string object is a null terminated string, my answer is valid. I update the answer to make that clearer.
    $endgroup$
    – R Sahu
    yesterday
















0












$begingroup$


It's about accessing (&ch)[1]. I think it should be legal, but I am not sure.




Yes, it is legal as long as you have null terminated string, i.e. the size-th character of the string is the null character. The reason is that std::string::end() returns an iterator that is one short of the element that holds the terminating null character when the string object does have a terminating null character. Hence, (&ch)[1] will not access anything beyond the terminating null character. You can verify that by printing the value of &ch as a debugging guide.



Here's a updated version of your posted code that prints additional debugging info.



#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>

std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };

std::for_each(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
});

std::cout << "==========================n";

auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})
};

result.erase(end, std::end(result));
return result;
}

int main()
{
char const *foo{ "Hello World! "};
std::string res = remove_excessive_ws(foo);
std::cout << "n"" << res << ""n";
}


Here's its output.



pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8
==========================
pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8

"Hello World!"




Response to OP's comment



std::string::end() returns std::string::iterator(), which is a LegacyRandomAccessIterator. Please see the Member types of std::string.



LegacyRandomAccessIterator satisfies the requirements of LegacyBidirectionalIterator.



LegacyBidirectionalIterator satisfies the requirments of LegacyForwardIterator.



ForwardIterator satisfies the requirements of LegacyInputIterator.



When a LegacyInputIterator is dereferenced, it evaluates to a reference. Now, you have to go back to std::string to see what that means. As you can expect, it is a reference and not a copy.



You can also use a non-const reference to be doubly sure that you get a reference and not a copy.



auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})





share|improve this answer











$endgroup$









  • 1




    $begingroup$
    You can verify that by [...] – Standard reference? Also, Where is written that find_if() won't hand me a copy of the char instead a reference to it?
    $endgroup$
    – Swordfish
    2 days ago






  • 1




    $begingroup$
    Whether it's legal to dereference str.end() depends on the standards version; only since C++11 has str.data() been required to have a null-terminator.
    $endgroup$
    – Toby Speight
    2 days ago










  • $begingroup$
    @TobySpeight, agree in principle. However, if the string object is a null terminated string, my answer is valid. I update the answer to make that clearer.
    $endgroup$
    – R Sahu
    yesterday














0












0








0





$begingroup$


It's about accessing (&ch)[1]. I think it should be legal, but I am not sure.




Yes, it is legal as long as you have null terminated string, i.e. the size-th character of the string is the null character. The reason is that std::string::end() returns an iterator that is one short of the element that holds the terminating null character when the string object does have a terminating null character. Hence, (&ch)[1] will not access anything beyond the terminating null character. You can verify that by printing the value of &ch as a debugging guide.



Here's a updated version of your posted code that prints additional debugging info.



#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>

std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };

std::for_each(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
});

std::cout << "==========================n";

auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})
};

result.erase(end, std::end(result));
return result;
}

int main()
{
char const *foo{ "Hello World! "};
std::string res = remove_excessive_ws(foo);
std::cout << "n"" << res << ""n";
}


Here's its output.



pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8
==========================
pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8

"Hello World!"




Response to OP's comment



std::string::end() returns std::string::iterator(), which is a LegacyRandomAccessIterator. Please see the Member types of std::string.



LegacyRandomAccessIterator satisfies the requirements of LegacyBidirectionalIterator.



LegacyBidirectionalIterator satisfies the requirments of LegacyForwardIterator.



ForwardIterator satisfies the requirements of LegacyInputIterator.



When a LegacyInputIterator is dereferenced, it evaluates to a reference. Now, you have to go back to std::string to see what that means. As you can expect, it is a reference and not a copy.



You can also use a non-const reference to be doubly sure that you get a reference and not a copy.



auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})





share|improve this answer











$endgroup$




It's about accessing (&ch)[1]. I think it should be legal, but I am not sure.




Yes, it is legal as long as you have null terminated string, i.e. the size-th character of the string is the null character. The reason is that std::string::end() returns an iterator that is one short of the element that holds the terminating null character when the string object does have a terminating null character. Hence, (&ch)[1] will not access anything beyond the terminating null character. You can verify that by printing the value of &ch as a debugging guide.



Here's a updated version of your posted code that prints additional debugging info.



#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>

std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };

std::for_each(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
});

std::cout << "==========================n";

auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})
};

result.erase(end, std::end(result));
return result;
}

int main()
{
char const *foo{ "Hello World! "};
std::string res = remove_excessive_ws(foo);
std::cout << "n"" << res << ""n";
}


Here's its output.



pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8
==========================
pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8

"Hello World!"




Response to OP's comment



std::string::end() returns std::string::iterator(), which is a LegacyRandomAccessIterator. Please see the Member types of std::string.



LegacyRandomAccessIterator satisfies the requirements of LegacyBidirectionalIterator.



LegacyBidirectionalIterator satisfies the requirments of LegacyForwardIterator.



ForwardIterator satisfies the requirements of LegacyInputIterator.



When a LegacyInputIterator is dereferenced, it evaluates to a reference. Now, you have to go back to std::string to see what that means. As you can expect, it is a reference and not a copy.



You can also use a non-const reference to be doubly sure that you get a reference and not a copy.



auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})






share|improve this answer














share|improve this answer



share|improve this answer








edited yesterday

























answered 2 days ago









R SahuR Sahu

3,274617




3,274617








  • 1




    $begingroup$
    You can verify that by [...] – Standard reference? Also, Where is written that find_if() won't hand me a copy of the char instead a reference to it?
    $endgroup$
    – Swordfish
    2 days ago






  • 1




    $begingroup$
    Whether it's legal to dereference str.end() depends on the standards version; only since C++11 has str.data() been required to have a null-terminator.
    $endgroup$
    – Toby Speight
    2 days ago










  • $begingroup$
    @TobySpeight, agree in principle. However, if the string object is a null terminated string, my answer is valid. I update the answer to make that clearer.
    $endgroup$
    – R Sahu
    yesterday














  • 1




    $begingroup$
    You can verify that by [...] – Standard reference? Also, Where is written that find_if() won't hand me a copy of the char instead a reference to it?
    $endgroup$
    – Swordfish
    2 days ago






  • 1




    $begingroup$
    Whether it's legal to dereference str.end() depends on the standards version; only since C++11 has str.data() been required to have a null-terminator.
    $endgroup$
    – Toby Speight
    2 days ago










  • $begingroup$
    @TobySpeight, agree in principle. However, if the string object is a null terminated string, my answer is valid. I update the answer to make that clearer.
    $endgroup$
    – R Sahu
    yesterday








1




1




$begingroup$
You can verify that by [...] – Standard reference? Also, Where is written that find_if() won't hand me a copy of the char instead a reference to it?
$endgroup$
– Swordfish
2 days ago




$begingroup$
You can verify that by [...] – Standard reference? Also, Where is written that find_if() won't hand me a copy of the char instead a reference to it?
$endgroup$
– Swordfish
2 days ago




1




1




$begingroup$
Whether it's legal to dereference str.end() depends on the standards version; only since C++11 has str.data() been required to have a null-terminator.
$endgroup$
– Toby Speight
2 days ago




$begingroup$
Whether it's legal to dereference str.end() depends on the standards version; only since C++11 has str.data() been required to have a null-terminator.
$endgroup$
– Toby Speight
2 days ago












$begingroup$
@TobySpeight, agree in principle. However, if the string object is a null terminated string, my answer is valid. I update the answer to make that clearer.
$endgroup$
– R Sahu
yesterday




$begingroup$
@TobySpeight, agree in principle. However, if the string object is a null terminated string, my answer is valid. I update the answer to make that clearer.
$endgroup$
– R Sahu
yesterday



Popular posts from this blog

is 'sed' thread safeWhat should someone know about using Python scripts in the shell?Nexenta bash script uses...

How do i solve the “ No module named 'mlxtend' ” issue on Jupyter?

Pilgersdorf Inhaltsverzeichnis Geografie | Geschichte | Bevölkerungsentwicklung | Politik | Kultur...