Remove excessive whitespace from string [on hold]Remove useless whitespace from styled stringRemove last word...
Are there any modern advantages of a fire piston?
Why are the books in the Game of Thrones citadel library shelved spine inwards?
Can an insurance company drop you after receiving a bill and refusing to pay?
Why zero tolerance on nudity in space?
Is it a fallacy if someone claims they need an explanation for every word of your argument to the point where they don't understand common terms?
Caruana vs Carlsen game 10 (WCC) why not 18...Nxb6?
How can I install sudo without using su?
Does static make a difference for a const local variable?
Citing paywalled articles accessed via illegal web sharing
Early credit roll before the end of the film
Why do I have multiple (unassociated) temporal history tables?
Is there some relative to Dutch word "kijken" in German?
Lick explanation
Every character has a name
What is the most triangles you can make from a capital "H" and 3 straight lines?
Why did other German political parties disband so fast when Hitler was appointed chancellor?
Word or phrase for showing great skill at something without formal training in it
Why would the Pakistan airspace closure cancel flights not headed to Pakistan itself?
How would an AI self awareness kill switch work?
Program that converts a number to a letter of the alphabet
We are very unlucky in my court
Explain the objections to these measures against human trafficking
Injecting creativity into a cookbook
Strange blocking on readable secondary after reboot
Remove excessive whitespace from string [on hold]
Remove useless whitespace from styled stringRemove last word from stringEffective way to remove white spaces from stringRemove (not extract) timestamp from HTML stringRemove unwanted characters from a stringRemove duplicates from string without additional bufferRemove a query string parameter from a URLRemove duplications from a Java StringRemove duplicate chars from StringC++ program to remove duplicate characters from string
$begingroup$
Is that standards-compliant?
#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>
std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };
auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type const &ch) {
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
}
)
};
result.erase(end, std::end(result));
return result;
}
int main()
{
char const *foo{ "Hello World! " };
std::cout << '"' << remove_excessive_ws(foo) << ""n";
}
I'm unsure whether I should be accessing (&ch)[1]
. I think it should be legal, but I am not sure. After all, std::remove_if()
could copy the character and pass it to the function, so the pointer (&ch) + 1
might not be valid.
c++ strings
$endgroup$
put on hold as off-topic by user673679, papagaga, t3chb0t, ferada, Zeta yesterday
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Code not implemented or not working as intended: Code Review is a community where programmers peer-review your working code to address issues such as security, maintainability, performance, and scalability. We require that the code be working correctly, to the best of the author's knowledge, before proceeding with a review." – user673679, papagaga, t3chb0t, ferada
If this question can be reworded to fit the rules in the help center, please edit the question.
add a comment |
$begingroup$
Is that standards-compliant?
#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>
std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };
auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type const &ch) {
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
}
)
};
result.erase(end, std::end(result));
return result;
}
int main()
{
char const *foo{ "Hello World! " };
std::cout << '"' << remove_excessive_ws(foo) << ""n";
}
I'm unsure whether I should be accessing (&ch)[1]
. I think it should be legal, but I am not sure. After all, std::remove_if()
could copy the character and pass it to the function, so the pointer (&ch) + 1
might not be valid.
c++ strings
$endgroup$
put on hold as off-topic by user673679, papagaga, t3chb0t, ferada, Zeta yesterday
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Code not implemented or not working as intended: Code Review is a community where programmers peer-review your working code to address issues such as security, maintainability, performance, and scalability. We require that the code be working correctly, to the best of the author's knowledge, before proceeding with a review." – user673679, papagaga, t3chb0t, ferada
If this question can be reworded to fit the rules in the help center, please edit the question.
$begingroup$
It's not standard (but you should have asked this question on stack overflow instead, code review is for working code).std:string
is not required to end with a null character and dereferencing the end iterator is undefined behavior; if your input ends with a space, it's precisely what you'll end up doing.
$endgroup$
– papagaga
Feb 26 at 8:25
$begingroup$
I don't thinkstd::remove_if()
is allowed to make a copy of each element - if it did, it couldn't (sensibly) work with containers of move-only values.
$endgroup$
– Toby Speight
2 days ago
$begingroup$
@TobySpeight Yay, that's reasonable! :)
$endgroup$
– Swordfish
yesterday
1
$begingroup$
Please clarify, does or doesn't the current code work as intended?
$endgroup$
– Mast
yesterday
add a comment |
$begingroup$
Is that standards-compliant?
#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>
std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };
auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type const &ch) {
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
}
)
};
result.erase(end, std::end(result));
return result;
}
int main()
{
char const *foo{ "Hello World! " };
std::cout << '"' << remove_excessive_ws(foo) << ""n";
}
I'm unsure whether I should be accessing (&ch)[1]
. I think it should be legal, but I am not sure. After all, std::remove_if()
could copy the character and pass it to the function, so the pointer (&ch) + 1
might not be valid.
c++ strings
$endgroup$
Is that standards-compliant?
#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>
std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };
auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type const &ch) {
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
}
)
};
result.erase(end, std::end(result));
return result;
}
int main()
{
char const *foo{ "Hello World! " };
std::cout << '"' << remove_excessive_ws(foo) << ""n";
}
I'm unsure whether I should be accessing (&ch)[1]
. I think it should be legal, but I am not sure. After all, std::remove_if()
could copy the character and pass it to the function, so the pointer (&ch) + 1
might not be valid.
c++ strings
c++ strings
edited 2 days ago
Toby Speight
25.2k741116
25.2k741116
asked Feb 26 at 7:39
SwordfishSwordfish
1669
1669
put on hold as off-topic by user673679, papagaga, t3chb0t, ferada, Zeta yesterday
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Code not implemented or not working as intended: Code Review is a community where programmers peer-review your working code to address issues such as security, maintainability, performance, and scalability. We require that the code be working correctly, to the best of the author's knowledge, before proceeding with a review." – user673679, papagaga, t3chb0t, ferada
If this question can be reworded to fit the rules in the help center, please edit the question.
put on hold as off-topic by user673679, papagaga, t3chb0t, ferada, Zeta yesterday
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Code not implemented or not working as intended: Code Review is a community where programmers peer-review your working code to address issues such as security, maintainability, performance, and scalability. We require that the code be working correctly, to the best of the author's knowledge, before proceeding with a review." – user673679, papagaga, t3chb0t, ferada
If this question can be reworded to fit the rules in the help center, please edit the question.
$begingroup$
It's not standard (but you should have asked this question on stack overflow instead, code review is for working code).std:string
is not required to end with a null character and dereferencing the end iterator is undefined behavior; if your input ends with a space, it's precisely what you'll end up doing.
$endgroup$
– papagaga
Feb 26 at 8:25
$begingroup$
I don't thinkstd::remove_if()
is allowed to make a copy of each element - if it did, it couldn't (sensibly) work with containers of move-only values.
$endgroup$
– Toby Speight
2 days ago
$begingroup$
@TobySpeight Yay, that's reasonable! :)
$endgroup$
– Swordfish
yesterday
1
$begingroup$
Please clarify, does or doesn't the current code work as intended?
$endgroup$
– Mast
yesterday
add a comment |
$begingroup$
It's not standard (but you should have asked this question on stack overflow instead, code review is for working code).std:string
is not required to end with a null character and dereferencing the end iterator is undefined behavior; if your input ends with a space, it's precisely what you'll end up doing.
$endgroup$
– papagaga
Feb 26 at 8:25
$begingroup$
I don't thinkstd::remove_if()
is allowed to make a copy of each element - if it did, it couldn't (sensibly) work with containers of move-only values.
$endgroup$
– Toby Speight
2 days ago
$begingroup$
@TobySpeight Yay, that's reasonable! :)
$endgroup$
– Swordfish
yesterday
1
$begingroup$
Please clarify, does or doesn't the current code work as intended?
$endgroup$
– Mast
yesterday
$begingroup$
It's not standard (but you should have asked this question on stack overflow instead, code review is for working code).
std:string
is not required to end with a null character and dereferencing the end iterator is undefined behavior; if your input ends with a space, it's precisely what you'll end up doing.$endgroup$
– papagaga
Feb 26 at 8:25
$begingroup$
It's not standard (but you should have asked this question on stack overflow instead, code review is for working code).
std:string
is not required to end with a null character and dereferencing the end iterator is undefined behavior; if your input ends with a space, it's precisely what you'll end up doing.$endgroup$
– papagaga
Feb 26 at 8:25
$begingroup$
I don't think
std::remove_if()
is allowed to make a copy of each element - if it did, it couldn't (sensibly) work with containers of move-only values.$endgroup$
– Toby Speight
2 days ago
$begingroup$
I don't think
std::remove_if()
is allowed to make a copy of each element - if it did, it couldn't (sensibly) work with containers of move-only values.$endgroup$
– Toby Speight
2 days ago
$begingroup$
@TobySpeight Yay, that's reasonable! :)
$endgroup$
– Swordfish
yesterday
$begingroup$
@TobySpeight Yay, that's reasonable! :)
$endgroup$
– Swordfish
yesterday
1
1
$begingroup$
Please clarify, does or doesn't the current code work as intended?
$endgroup$
– Mast
yesterday
$begingroup$
Please clarify, does or doesn't the current code work as intended?
$endgroup$
– Mast
yesterday
add a comment |
3 Answers
3
active
oldest
votes
$begingroup$
Define the problem
It's not clear from the description what's considered "excessive" whitespace. From experimenting, it seems that the idea is to collapse multiple whitespace to a single whitespace character, except at the end of the string, where whitespace is to be completely removed. (Whitespace at the beginning of string seems to treated the same as inner whitespace). It's also not clear which whitespace character should be kept - the example code keeps the last one, but is that a requirement, or just an implementation choice?
Missing include
std::begin()
and std::end()
are declared in <iterator>
. However, there's no reason not to use the begin()
and end()
member functions of std::string
here, as we're not operating on generic values.
When an argument is to be copied, pass by value
We don't need to copy str
into result
:
std::string remove_excessive_ws(std::string str)
Bug
Like all the <cctype>
functions, std::isspace()
requires that its argument be either EOF
or representable as unsigned char
. Converting a (possibly signed) char
direct to unsigned int
can sign-extend to an out-of-range value. We need to convert char
to unsigned char
before widening to unsigned int
:
static_cast<unsigned char>(ch)
Bug
Prior to C++11, accessing the character after the end of the string is undefined behaviour (C++11 requires an extra null to follow the string data). Thankfully, it's easy to avoid this bug by simply remembering whether the last character seen was a space.
Here's a C++11 version:
#include <algorithm>
#include <cctype>
#include <string>
#include <utility>
std::string remove_excessive_ws(std::string str)
{
bool seen_space = false;
auto end{ std::remove_if(str.begin(), str.end(),
[&seen_space](unsigned char ch) {
bool is_space = std::isspace(ch);
std::swap(seen_space, is_space);
return seen_space && is_space;
})};
// adjust end to remove end whitespace
if (end != str.begin() && std::isspace(static_cast<unsigned char>(end[-1]))) {
--end;
}
str.erase(end, str.end());
return str;
}
We might want to move seen_space
into the lambda expression in later C++ versions that allow that.
This also is more readable, as we can perform the widening to unsigned int
when calling the lambda, rather than having to write a cast.
Style-wise, I'd normally prefer to name the lambda, and keep the erase-remove call on a single line so that the idiom is obvious:
// Assuming C++17 now
std::string remove_excessive_ws(std::string s)
{
auto is_doubled_space =
[seen_space=false](unsigned char c) mutable {
return std::exchange(seen_space, std::isspace(c))
&& seen_space;
};
s.erase(std::remove_if(s.begin(), s.end(), is_doubled_space), s.end());
// remove trailing whitespace
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}
// convert all whitespace into ordinary space character
std::replace_if(s.begin(), s.end(),
[](unsigned char c) { return std::isspace(c); }, ' ');
return s;
}
$endgroup$
$begingroup$
Personally I try really hard to avoid mutating lambdas. Unfortunately for this problem this would preclude usingstd::remove_if
. Instead, some kind of reduction or prefix scan would need to be used, and is probably not as efficient.
$endgroup$
– Konrad Rudolph
Feb 26 at 17:07
$begingroup$
Yes, that's a good point @Konrad -mutable
lambdas are harder to reason about. It might be that a plain oldfor (auto it = str.begin(); it != str.end(); ++it)
loop (followed byerase
) may actually be the clearest code. But it's good to see a few options to choose from!
$endgroup$
– Toby Speight
2 days ago
add a comment |
$begingroup$
std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };
As Toby mentioned, if you plan on copying and mutating the copy locally, you should pass the parameter str
by value. It should also be noted that result
will have the same capacity as str
and won't shrink to fit automatically (or take advantage of SSO if allocated).
std::isspace(static_cast<unsigned>(ch))
Be aware that std::isspace
removes spaces (' '
), whitespaces ('n', 'v', 'f', 'r'
), and tabs ('t'
).
You should cast to unsigned char
.
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '')
Is that standards-compliant?
Using the built-in subscript operator on a pointer is standards compliant. From the C++17 standard (n4659), Postfix Expressions § 8.2.1 Subscripting:
A postfix expression followed by an expression in square brackets is a postfix expression. One of the expressions shall be a glvalue of type “array of
T
” or a prvalue of type “pointer toT
” and the other shall be a prvalue of unscoped enumeration or integral type.
When accessing memory out of bounds via the built-in subscript operator, the behavior is undefined. A well-defined approach would be to track the next index and access the next element using std::string::operator[]
(element at size()
returns CharT{}
). std::string
is not a null-terminated sequence and considers the null character (CharT{}
) to be a valid character within a sequence.
using namespace std::string_literals;
std::string str = "ab"s;
std::cout << str << 'n'; // prints "ab"
For a standard library solution on removing duplicates, I would simply pass the predicate to std::unique
. No pointer arithmetic is necessary. Just pass it a binary predicate that checks if both characters are whitespaces:
#include <algorithm>
#include <cctype>
#include <string>
std::string remove_excessive_ws(std::string s)
{
static auto const space_space =
[](unsigned char a, unsigned char b) {
return std::isspace(a) && std::isspace(b);
};
s.erase(std::unique(s.begin(), s.end(), space_space), s.end());
// trim final space
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}
return s;
}
Your function leaves a leading whitespace if one or more exists. Is this intended? Should there be a common character (single space) to merge the different characters std::isspace
catches? If the ultimate goal was to trim all outer whitespace and join non-whitespace tokens with single spaces, I would use abseil's absl::StrSplit()
and absl::StrJoin()
. The resulting string would either take advantage of SSO if small enough or use a more appropriate capacity.
// remove_excess_whitespace
//
// Trims leading and trailing space, whitespace, and tab characters
// such that the resulting string is single space separated.
std::string remove_excess_whitespace(absl::string_view sv) {
return absl::StrJoin(absl::StrSplit(sv, ' ', absl::SkipWhitespace{}), " ");
}
$endgroup$
$begingroup$
Be aware that std::isspace removes spaces (' '), whitespaces ('n', 'v', 'f', 'r'), and tabs ('t') – that's intended.
$endgroup$
– Swordfish
2 days ago
$begingroup$
As for std::string, this behavior is well-defined – yes, but what aboutremove_if()
, is it guaranteed that it won't copy?
$endgroup$
– Swordfish
2 days ago
$begingroup$
std::remove_if
copies individual characters that do not satisfy the predicate. The well-defined behavior I was talking about is accessingstr[str.size()]
. So, instead of checking&ch[1] == ''
, keep an index to the next variable and usestr[index]
(or better, usestd::unique
and don't worry about subscript access for adjacent elements).
$endgroup$
– Snowhawk
2 days ago
$begingroup$
I was talking aboutremove_if()
probably copying the element before passing it to the predicate.
$endgroup$
– Swordfish
2 days ago
$begingroup$
Compilers are pretty good at inlining arguments and lambda predicates. I would be concerned with unnecessary heap allocations.
$endgroup$
– Snowhawk
2 days ago
|
show 1 more comment
$begingroup$
It's about accessing
(&ch)[1]
. I think it should be legal, but I am not sure.
Yes, it is legal as long as you have null terminated string, i.e. the size
-th character of the string is the null character. The reason is that std::string::end()
returns an iterator that is one short of the element that holds the terminating null character when the string object does have a terminating null character. Hence, (&ch)[1]
will not access anything beyond the terminating null character. You can verify that by printing the value of &ch
as a debugging guide.
Here's a updated version of your posted code that prints additional debugging info.
#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>
std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };
std::for_each(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
});
std::cout << "==========================n";
auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})
};
result.erase(end, std::end(result));
return result;
}
int main()
{
char const *foo{ "Hello World! "};
std::string res = remove_excessive_ws(foo);
std::cout << "n"" << res << ""n";
}
Here's its output.
pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8
==========================
pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8
"Hello World!"
Response to OP's comment
std::string::end()
returns std::string::iterator()
, which is a LegacyRandomAccessIterator. Please see the Member types of std::string
.
LegacyRandomAccessIterator satisfies the requirements of LegacyBidirectionalIterator.
LegacyBidirectionalIterator satisfies the requirments of LegacyForwardIterator.
ForwardIterator satisfies the requirements of LegacyInputIterator.
When a LegacyInputIterator is dereferenced, it evaluates to a reference
. Now, you have to go back to std::string
to see what that means. As you can expect, it is a reference and not a copy.
You can also use a non-const
reference to be doubly sure that you get a reference and not a copy.
auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})
$endgroup$
1
$begingroup$
You can verify that by [...] – Standard reference? Also, Where is written thatfind_if()
won't hand me a copy of thechar
instead a reference to it?
$endgroup$
– Swordfish
2 days ago
1
$begingroup$
Whether it's legal to dereferencestr.end()
depends on the standards version; only since C++11 hasstr.data()
been required to have a null-terminator.
$endgroup$
– Toby Speight
2 days ago
$begingroup$
@TobySpeight, agree in principle. However, if the string object is a null terminated string, my answer is valid. I update the answer to make that clearer.
$endgroup$
– R Sahu
yesterday
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Define the problem
It's not clear from the description what's considered "excessive" whitespace. From experimenting, it seems that the idea is to collapse multiple whitespace to a single whitespace character, except at the end of the string, where whitespace is to be completely removed. (Whitespace at the beginning of string seems to treated the same as inner whitespace). It's also not clear which whitespace character should be kept - the example code keeps the last one, but is that a requirement, or just an implementation choice?
Missing include
std::begin()
and std::end()
are declared in <iterator>
. However, there's no reason not to use the begin()
and end()
member functions of std::string
here, as we're not operating on generic values.
When an argument is to be copied, pass by value
We don't need to copy str
into result
:
std::string remove_excessive_ws(std::string str)
Bug
Like all the <cctype>
functions, std::isspace()
requires that its argument be either EOF
or representable as unsigned char
. Converting a (possibly signed) char
direct to unsigned int
can sign-extend to an out-of-range value. We need to convert char
to unsigned char
before widening to unsigned int
:
static_cast<unsigned char>(ch)
Bug
Prior to C++11, accessing the character after the end of the string is undefined behaviour (C++11 requires an extra null to follow the string data). Thankfully, it's easy to avoid this bug by simply remembering whether the last character seen was a space.
Here's a C++11 version:
#include <algorithm>
#include <cctype>
#include <string>
#include <utility>
std::string remove_excessive_ws(std::string str)
{
bool seen_space = false;
auto end{ std::remove_if(str.begin(), str.end(),
[&seen_space](unsigned char ch) {
bool is_space = std::isspace(ch);
std::swap(seen_space, is_space);
return seen_space && is_space;
})};
// adjust end to remove end whitespace
if (end != str.begin() && std::isspace(static_cast<unsigned char>(end[-1]))) {
--end;
}
str.erase(end, str.end());
return str;
}
We might want to move seen_space
into the lambda expression in later C++ versions that allow that.
This also is more readable, as we can perform the widening to unsigned int
when calling the lambda, rather than having to write a cast.
Style-wise, I'd normally prefer to name the lambda, and keep the erase-remove call on a single line so that the idiom is obvious:
// Assuming C++17 now
std::string remove_excessive_ws(std::string s)
{
auto is_doubled_space =
[seen_space=false](unsigned char c) mutable {
return std::exchange(seen_space, std::isspace(c))
&& seen_space;
};
s.erase(std::remove_if(s.begin(), s.end(), is_doubled_space), s.end());
// remove trailing whitespace
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}
// convert all whitespace into ordinary space character
std::replace_if(s.begin(), s.end(),
[](unsigned char c) { return std::isspace(c); }, ' ');
return s;
}
$endgroup$
$begingroup$
Personally I try really hard to avoid mutating lambdas. Unfortunately for this problem this would preclude usingstd::remove_if
. Instead, some kind of reduction or prefix scan would need to be used, and is probably not as efficient.
$endgroup$
– Konrad Rudolph
Feb 26 at 17:07
$begingroup$
Yes, that's a good point @Konrad -mutable
lambdas are harder to reason about. It might be that a plain oldfor (auto it = str.begin(); it != str.end(); ++it)
loop (followed byerase
) may actually be the clearest code. But it's good to see a few options to choose from!
$endgroup$
– Toby Speight
2 days ago
add a comment |
$begingroup$
Define the problem
It's not clear from the description what's considered "excessive" whitespace. From experimenting, it seems that the idea is to collapse multiple whitespace to a single whitespace character, except at the end of the string, where whitespace is to be completely removed. (Whitespace at the beginning of string seems to treated the same as inner whitespace). It's also not clear which whitespace character should be kept - the example code keeps the last one, but is that a requirement, or just an implementation choice?
Missing include
std::begin()
and std::end()
are declared in <iterator>
. However, there's no reason not to use the begin()
and end()
member functions of std::string
here, as we're not operating on generic values.
When an argument is to be copied, pass by value
We don't need to copy str
into result
:
std::string remove_excessive_ws(std::string str)
Bug
Like all the <cctype>
functions, std::isspace()
requires that its argument be either EOF
or representable as unsigned char
. Converting a (possibly signed) char
direct to unsigned int
can sign-extend to an out-of-range value. We need to convert char
to unsigned char
before widening to unsigned int
:
static_cast<unsigned char>(ch)
Bug
Prior to C++11, accessing the character after the end of the string is undefined behaviour (C++11 requires an extra null to follow the string data). Thankfully, it's easy to avoid this bug by simply remembering whether the last character seen was a space.
Here's a C++11 version:
#include <algorithm>
#include <cctype>
#include <string>
#include <utility>
std::string remove_excessive_ws(std::string str)
{
bool seen_space = false;
auto end{ std::remove_if(str.begin(), str.end(),
[&seen_space](unsigned char ch) {
bool is_space = std::isspace(ch);
std::swap(seen_space, is_space);
return seen_space && is_space;
})};
// adjust end to remove end whitespace
if (end != str.begin() && std::isspace(static_cast<unsigned char>(end[-1]))) {
--end;
}
str.erase(end, str.end());
return str;
}
We might want to move seen_space
into the lambda expression in later C++ versions that allow that.
This also is more readable, as we can perform the widening to unsigned int
when calling the lambda, rather than having to write a cast.
Style-wise, I'd normally prefer to name the lambda, and keep the erase-remove call on a single line so that the idiom is obvious:
// Assuming C++17 now
std::string remove_excessive_ws(std::string s)
{
auto is_doubled_space =
[seen_space=false](unsigned char c) mutable {
return std::exchange(seen_space, std::isspace(c))
&& seen_space;
};
s.erase(std::remove_if(s.begin(), s.end(), is_doubled_space), s.end());
// remove trailing whitespace
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}
// convert all whitespace into ordinary space character
std::replace_if(s.begin(), s.end(),
[](unsigned char c) { return std::isspace(c); }, ' ');
return s;
}
$endgroup$
$begingroup$
Personally I try really hard to avoid mutating lambdas. Unfortunately for this problem this would preclude usingstd::remove_if
. Instead, some kind of reduction or prefix scan would need to be used, and is probably not as efficient.
$endgroup$
– Konrad Rudolph
Feb 26 at 17:07
$begingroup$
Yes, that's a good point @Konrad -mutable
lambdas are harder to reason about. It might be that a plain oldfor (auto it = str.begin(); it != str.end(); ++it)
loop (followed byerase
) may actually be the clearest code. But it's good to see a few options to choose from!
$endgroup$
– Toby Speight
2 days ago
add a comment |
$begingroup$
Define the problem
It's not clear from the description what's considered "excessive" whitespace. From experimenting, it seems that the idea is to collapse multiple whitespace to a single whitespace character, except at the end of the string, where whitespace is to be completely removed. (Whitespace at the beginning of string seems to treated the same as inner whitespace). It's also not clear which whitespace character should be kept - the example code keeps the last one, but is that a requirement, or just an implementation choice?
Missing include
std::begin()
and std::end()
are declared in <iterator>
. However, there's no reason not to use the begin()
and end()
member functions of std::string
here, as we're not operating on generic values.
When an argument is to be copied, pass by value
We don't need to copy str
into result
:
std::string remove_excessive_ws(std::string str)
Bug
Like all the <cctype>
functions, std::isspace()
requires that its argument be either EOF
or representable as unsigned char
. Converting a (possibly signed) char
direct to unsigned int
can sign-extend to an out-of-range value. We need to convert char
to unsigned char
before widening to unsigned int
:
static_cast<unsigned char>(ch)
Bug
Prior to C++11, accessing the character after the end of the string is undefined behaviour (C++11 requires an extra null to follow the string data). Thankfully, it's easy to avoid this bug by simply remembering whether the last character seen was a space.
Here's a C++11 version:
#include <algorithm>
#include <cctype>
#include <string>
#include <utility>
std::string remove_excessive_ws(std::string str)
{
bool seen_space = false;
auto end{ std::remove_if(str.begin(), str.end(),
[&seen_space](unsigned char ch) {
bool is_space = std::isspace(ch);
std::swap(seen_space, is_space);
return seen_space && is_space;
})};
// adjust end to remove end whitespace
if (end != str.begin() && std::isspace(static_cast<unsigned char>(end[-1]))) {
--end;
}
str.erase(end, str.end());
return str;
}
We might want to move seen_space
into the lambda expression in later C++ versions that allow that.
This also is more readable, as we can perform the widening to unsigned int
when calling the lambda, rather than having to write a cast.
Style-wise, I'd normally prefer to name the lambda, and keep the erase-remove call on a single line so that the idiom is obvious:
// Assuming C++17 now
std::string remove_excessive_ws(std::string s)
{
auto is_doubled_space =
[seen_space=false](unsigned char c) mutable {
return std::exchange(seen_space, std::isspace(c))
&& seen_space;
};
s.erase(std::remove_if(s.begin(), s.end(), is_doubled_space), s.end());
// remove trailing whitespace
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}
// convert all whitespace into ordinary space character
std::replace_if(s.begin(), s.end(),
[](unsigned char c) { return std::isspace(c); }, ' ');
return s;
}
$endgroup$
Define the problem
It's not clear from the description what's considered "excessive" whitespace. From experimenting, it seems that the idea is to collapse multiple whitespace to a single whitespace character, except at the end of the string, where whitespace is to be completely removed. (Whitespace at the beginning of string seems to treated the same as inner whitespace). It's also not clear which whitespace character should be kept - the example code keeps the last one, but is that a requirement, or just an implementation choice?
Missing include
std::begin()
and std::end()
are declared in <iterator>
. However, there's no reason not to use the begin()
and end()
member functions of std::string
here, as we're not operating on generic values.
When an argument is to be copied, pass by value
We don't need to copy str
into result
:
std::string remove_excessive_ws(std::string str)
Bug
Like all the <cctype>
functions, std::isspace()
requires that its argument be either EOF
or representable as unsigned char
. Converting a (possibly signed) char
direct to unsigned int
can sign-extend to an out-of-range value. We need to convert char
to unsigned char
before widening to unsigned int
:
static_cast<unsigned char>(ch)
Bug
Prior to C++11, accessing the character after the end of the string is undefined behaviour (C++11 requires an extra null to follow the string data). Thankfully, it's easy to avoid this bug by simply remembering whether the last character seen was a space.
Here's a C++11 version:
#include <algorithm>
#include <cctype>
#include <string>
#include <utility>
std::string remove_excessive_ws(std::string str)
{
bool seen_space = false;
auto end{ std::remove_if(str.begin(), str.end(),
[&seen_space](unsigned char ch) {
bool is_space = std::isspace(ch);
std::swap(seen_space, is_space);
return seen_space && is_space;
})};
// adjust end to remove end whitespace
if (end != str.begin() && std::isspace(static_cast<unsigned char>(end[-1]))) {
--end;
}
str.erase(end, str.end());
return str;
}
We might want to move seen_space
into the lambda expression in later C++ versions that allow that.
This also is more readable, as we can perform the widening to unsigned int
when calling the lambda, rather than having to write a cast.
Style-wise, I'd normally prefer to name the lambda, and keep the erase-remove call on a single line so that the idiom is obvious:
// Assuming C++17 now
std::string remove_excessive_ws(std::string s)
{
auto is_doubled_space =
[seen_space=false](unsigned char c) mutable {
return std::exchange(seen_space, std::isspace(c))
&& seen_space;
};
s.erase(std::remove_if(s.begin(), s.end(), is_doubled_space), s.end());
// remove trailing whitespace
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}
// convert all whitespace into ordinary space character
std::replace_if(s.begin(), s.end(),
[](unsigned char c) { return std::isspace(c); }, ' ');
return s;
}
edited 2 days ago
answered Feb 26 at 8:46
Toby SpeightToby Speight
25.2k741116
25.2k741116
$begingroup$
Personally I try really hard to avoid mutating lambdas. Unfortunately for this problem this would preclude usingstd::remove_if
. Instead, some kind of reduction or prefix scan would need to be used, and is probably not as efficient.
$endgroup$
– Konrad Rudolph
Feb 26 at 17:07
$begingroup$
Yes, that's a good point @Konrad -mutable
lambdas are harder to reason about. It might be that a plain oldfor (auto it = str.begin(); it != str.end(); ++it)
loop (followed byerase
) may actually be the clearest code. But it's good to see a few options to choose from!
$endgroup$
– Toby Speight
2 days ago
add a comment |
$begingroup$
Personally I try really hard to avoid mutating lambdas. Unfortunately for this problem this would preclude usingstd::remove_if
. Instead, some kind of reduction or prefix scan would need to be used, and is probably not as efficient.
$endgroup$
– Konrad Rudolph
Feb 26 at 17:07
$begingroup$
Yes, that's a good point @Konrad -mutable
lambdas are harder to reason about. It might be that a plain oldfor (auto it = str.begin(); it != str.end(); ++it)
loop (followed byerase
) may actually be the clearest code. But it's good to see a few options to choose from!
$endgroup$
– Toby Speight
2 days ago
$begingroup$
Personally I try really hard to avoid mutating lambdas. Unfortunately for this problem this would preclude using
std::remove_if
. Instead, some kind of reduction or prefix scan would need to be used, and is probably not as efficient.$endgroup$
– Konrad Rudolph
Feb 26 at 17:07
$begingroup$
Personally I try really hard to avoid mutating lambdas. Unfortunately for this problem this would preclude using
std::remove_if
. Instead, some kind of reduction or prefix scan would need to be used, and is probably not as efficient.$endgroup$
– Konrad Rudolph
Feb 26 at 17:07
$begingroup$
Yes, that's a good point @Konrad -
mutable
lambdas are harder to reason about. It might be that a plain old for (auto it = str.begin(); it != str.end(); ++it)
loop (followed by erase
) may actually be the clearest code. But it's good to see a few options to choose from!$endgroup$
– Toby Speight
2 days ago
$begingroup$
Yes, that's a good point @Konrad -
mutable
lambdas are harder to reason about. It might be that a plain old for (auto it = str.begin(); it != str.end(); ++it)
loop (followed by erase
) may actually be the clearest code. But it's good to see a few options to choose from!$endgroup$
– Toby Speight
2 days ago
add a comment |
$begingroup$
std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };
As Toby mentioned, if you plan on copying and mutating the copy locally, you should pass the parameter str
by value. It should also be noted that result
will have the same capacity as str
and won't shrink to fit automatically (or take advantage of SSO if allocated).
std::isspace(static_cast<unsigned>(ch))
Be aware that std::isspace
removes spaces (' '
), whitespaces ('n', 'v', 'f', 'r'
), and tabs ('t'
).
You should cast to unsigned char
.
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '')
Is that standards-compliant?
Using the built-in subscript operator on a pointer is standards compliant. From the C++17 standard (n4659), Postfix Expressions § 8.2.1 Subscripting:
A postfix expression followed by an expression in square brackets is a postfix expression. One of the expressions shall be a glvalue of type “array of
T
” or a prvalue of type “pointer toT
” and the other shall be a prvalue of unscoped enumeration or integral type.
When accessing memory out of bounds via the built-in subscript operator, the behavior is undefined. A well-defined approach would be to track the next index and access the next element using std::string::operator[]
(element at size()
returns CharT{}
). std::string
is not a null-terminated sequence and considers the null character (CharT{}
) to be a valid character within a sequence.
using namespace std::string_literals;
std::string str = "ab"s;
std::cout << str << 'n'; // prints "ab"
For a standard library solution on removing duplicates, I would simply pass the predicate to std::unique
. No pointer arithmetic is necessary. Just pass it a binary predicate that checks if both characters are whitespaces:
#include <algorithm>
#include <cctype>
#include <string>
std::string remove_excessive_ws(std::string s)
{
static auto const space_space =
[](unsigned char a, unsigned char b) {
return std::isspace(a) && std::isspace(b);
};
s.erase(std::unique(s.begin(), s.end(), space_space), s.end());
// trim final space
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}
return s;
}
Your function leaves a leading whitespace if one or more exists. Is this intended? Should there be a common character (single space) to merge the different characters std::isspace
catches? If the ultimate goal was to trim all outer whitespace and join non-whitespace tokens with single spaces, I would use abseil's absl::StrSplit()
and absl::StrJoin()
. The resulting string would either take advantage of SSO if small enough or use a more appropriate capacity.
// remove_excess_whitespace
//
// Trims leading and trailing space, whitespace, and tab characters
// such that the resulting string is single space separated.
std::string remove_excess_whitespace(absl::string_view sv) {
return absl::StrJoin(absl::StrSplit(sv, ' ', absl::SkipWhitespace{}), " ");
}
$endgroup$
$begingroup$
Be aware that std::isspace removes spaces (' '), whitespaces ('n', 'v', 'f', 'r'), and tabs ('t') – that's intended.
$endgroup$
– Swordfish
2 days ago
$begingroup$
As for std::string, this behavior is well-defined – yes, but what aboutremove_if()
, is it guaranteed that it won't copy?
$endgroup$
– Swordfish
2 days ago
$begingroup$
std::remove_if
copies individual characters that do not satisfy the predicate. The well-defined behavior I was talking about is accessingstr[str.size()]
. So, instead of checking&ch[1] == ''
, keep an index to the next variable and usestr[index]
(or better, usestd::unique
and don't worry about subscript access for adjacent elements).
$endgroup$
– Snowhawk
2 days ago
$begingroup$
I was talking aboutremove_if()
probably copying the element before passing it to the predicate.
$endgroup$
– Swordfish
2 days ago
$begingroup$
Compilers are pretty good at inlining arguments and lambda predicates. I would be concerned with unnecessary heap allocations.
$endgroup$
– Snowhawk
2 days ago
|
show 1 more comment
$begingroup$
std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };
As Toby mentioned, if you plan on copying and mutating the copy locally, you should pass the parameter str
by value. It should also be noted that result
will have the same capacity as str
and won't shrink to fit automatically (or take advantage of SSO if allocated).
std::isspace(static_cast<unsigned>(ch))
Be aware that std::isspace
removes spaces (' '
), whitespaces ('n', 'v', 'f', 'r'
), and tabs ('t'
).
You should cast to unsigned char
.
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '')
Is that standards-compliant?
Using the built-in subscript operator on a pointer is standards compliant. From the C++17 standard (n4659), Postfix Expressions § 8.2.1 Subscripting:
A postfix expression followed by an expression in square brackets is a postfix expression. One of the expressions shall be a glvalue of type “array of
T
” or a prvalue of type “pointer toT
” and the other shall be a prvalue of unscoped enumeration or integral type.
When accessing memory out of bounds via the built-in subscript operator, the behavior is undefined. A well-defined approach would be to track the next index and access the next element using std::string::operator[]
(element at size()
returns CharT{}
). std::string
is not a null-terminated sequence and considers the null character (CharT{}
) to be a valid character within a sequence.
using namespace std::string_literals;
std::string str = "ab"s;
std::cout << str << 'n'; // prints "ab"
For a standard library solution on removing duplicates, I would simply pass the predicate to std::unique
. No pointer arithmetic is necessary. Just pass it a binary predicate that checks if both characters are whitespaces:
#include <algorithm>
#include <cctype>
#include <string>
std::string remove_excessive_ws(std::string s)
{
static auto const space_space =
[](unsigned char a, unsigned char b) {
return std::isspace(a) && std::isspace(b);
};
s.erase(std::unique(s.begin(), s.end(), space_space), s.end());
// trim final space
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}
return s;
}
Your function leaves a leading whitespace if one or more exists. Is this intended? Should there be a common character (single space) to merge the different characters std::isspace
catches? If the ultimate goal was to trim all outer whitespace and join non-whitespace tokens with single spaces, I would use abseil's absl::StrSplit()
and absl::StrJoin()
. The resulting string would either take advantage of SSO if small enough or use a more appropriate capacity.
// remove_excess_whitespace
//
// Trims leading and trailing space, whitespace, and tab characters
// such that the resulting string is single space separated.
std::string remove_excess_whitespace(absl::string_view sv) {
return absl::StrJoin(absl::StrSplit(sv, ' ', absl::SkipWhitespace{}), " ");
}
$endgroup$
$begingroup$
Be aware that std::isspace removes spaces (' '), whitespaces ('n', 'v', 'f', 'r'), and tabs ('t') – that's intended.
$endgroup$
– Swordfish
2 days ago
$begingroup$
As for std::string, this behavior is well-defined – yes, but what aboutremove_if()
, is it guaranteed that it won't copy?
$endgroup$
– Swordfish
2 days ago
$begingroup$
std::remove_if
copies individual characters that do not satisfy the predicate. The well-defined behavior I was talking about is accessingstr[str.size()]
. So, instead of checking&ch[1] == ''
, keep an index to the next variable and usestr[index]
(or better, usestd::unique
and don't worry about subscript access for adjacent elements).
$endgroup$
– Snowhawk
2 days ago
$begingroup$
I was talking aboutremove_if()
probably copying the element before passing it to the predicate.
$endgroup$
– Swordfish
2 days ago
$begingroup$
Compilers are pretty good at inlining arguments and lambda predicates. I would be concerned with unnecessary heap allocations.
$endgroup$
– Snowhawk
2 days ago
|
show 1 more comment
$begingroup$
std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };
As Toby mentioned, if you plan on copying and mutating the copy locally, you should pass the parameter str
by value. It should also be noted that result
will have the same capacity as str
and won't shrink to fit automatically (or take advantage of SSO if allocated).
std::isspace(static_cast<unsigned>(ch))
Be aware that std::isspace
removes spaces (' '
), whitespaces ('n', 'v', 'f', 'r'
), and tabs ('t'
).
You should cast to unsigned char
.
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '')
Is that standards-compliant?
Using the built-in subscript operator on a pointer is standards compliant. From the C++17 standard (n4659), Postfix Expressions § 8.2.1 Subscripting:
A postfix expression followed by an expression in square brackets is a postfix expression. One of the expressions shall be a glvalue of type “array of
T
” or a prvalue of type “pointer toT
” and the other shall be a prvalue of unscoped enumeration or integral type.
When accessing memory out of bounds via the built-in subscript operator, the behavior is undefined. A well-defined approach would be to track the next index and access the next element using std::string::operator[]
(element at size()
returns CharT{}
). std::string
is not a null-terminated sequence and considers the null character (CharT{}
) to be a valid character within a sequence.
using namespace std::string_literals;
std::string str = "ab"s;
std::cout << str << 'n'; // prints "ab"
For a standard library solution on removing duplicates, I would simply pass the predicate to std::unique
. No pointer arithmetic is necessary. Just pass it a binary predicate that checks if both characters are whitespaces:
#include <algorithm>
#include <cctype>
#include <string>
std::string remove_excessive_ws(std::string s)
{
static auto const space_space =
[](unsigned char a, unsigned char b) {
return std::isspace(a) && std::isspace(b);
};
s.erase(std::unique(s.begin(), s.end(), space_space), s.end());
// trim final space
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}
return s;
}
Your function leaves a leading whitespace if one or more exists. Is this intended? Should there be a common character (single space) to merge the different characters std::isspace
catches? If the ultimate goal was to trim all outer whitespace and join non-whitespace tokens with single spaces, I would use abseil's absl::StrSplit()
and absl::StrJoin()
. The resulting string would either take advantage of SSO if small enough or use a more appropriate capacity.
// remove_excess_whitespace
//
// Trims leading and trailing space, whitespace, and tab characters
// such that the resulting string is single space separated.
std::string remove_excess_whitespace(absl::string_view sv) {
return absl::StrJoin(absl::StrSplit(sv, ' ', absl::SkipWhitespace{}), " ");
}
$endgroup$
std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };
As Toby mentioned, if you plan on copying and mutating the copy locally, you should pass the parameter str
by value. It should also be noted that result
will have the same capacity as str
and won't shrink to fit automatically (or take advantage of SSO if allocated).
std::isspace(static_cast<unsigned>(ch))
Be aware that std::isspace
removes spaces (' '
), whitespaces ('n', 'v', 'f', 'r'
), and tabs ('t'
).
You should cast to unsigned char
.
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '')
Is that standards-compliant?
Using the built-in subscript operator on a pointer is standards compliant. From the C++17 standard (n4659), Postfix Expressions § 8.2.1 Subscripting:
A postfix expression followed by an expression in square brackets is a postfix expression. One of the expressions shall be a glvalue of type “array of
T
” or a prvalue of type “pointer toT
” and the other shall be a prvalue of unscoped enumeration or integral type.
When accessing memory out of bounds via the built-in subscript operator, the behavior is undefined. A well-defined approach would be to track the next index and access the next element using std::string::operator[]
(element at size()
returns CharT{}
). std::string
is not a null-terminated sequence and considers the null character (CharT{}
) to be a valid character within a sequence.
using namespace std::string_literals;
std::string str = "ab"s;
std::cout << str << 'n'; // prints "ab"
For a standard library solution on removing duplicates, I would simply pass the predicate to std::unique
. No pointer arithmetic is necessary. Just pass it a binary predicate that checks if both characters are whitespaces:
#include <algorithm>
#include <cctype>
#include <string>
std::string remove_excessive_ws(std::string s)
{
static auto const space_space =
[](unsigned char a, unsigned char b) {
return std::isspace(a) && std::isspace(b);
};
s.erase(std::unique(s.begin(), s.end(), space_space), s.end());
// trim final space
if (!s.empty() && std::isspace(static_cast<unsigned char>(s.back()))) {
s.pop_back();
}
return s;
}
Your function leaves a leading whitespace if one or more exists. Is this intended? Should there be a common character (single space) to merge the different characters std::isspace
catches? If the ultimate goal was to trim all outer whitespace and join non-whitespace tokens with single spaces, I would use abseil's absl::StrSplit()
and absl::StrJoin()
. The resulting string would either take advantage of SSO if small enough or use a more appropriate capacity.
// remove_excess_whitespace
//
// Trims leading and trailing space, whitespace, and tab characters
// such that the resulting string is single space separated.
std::string remove_excess_whitespace(absl::string_view sv) {
return absl::StrJoin(absl::StrSplit(sv, ' ', absl::SkipWhitespace{}), " ");
}
edited 2 days ago
Toby Speight
25.2k741116
25.2k741116
answered 2 days ago
SnowhawkSnowhawk
5,31211128
5,31211128
$begingroup$
Be aware that std::isspace removes spaces (' '), whitespaces ('n', 'v', 'f', 'r'), and tabs ('t') – that's intended.
$endgroup$
– Swordfish
2 days ago
$begingroup$
As for std::string, this behavior is well-defined – yes, but what aboutremove_if()
, is it guaranteed that it won't copy?
$endgroup$
– Swordfish
2 days ago
$begingroup$
std::remove_if
copies individual characters that do not satisfy the predicate. The well-defined behavior I was talking about is accessingstr[str.size()]
. So, instead of checking&ch[1] == ''
, keep an index to the next variable and usestr[index]
(or better, usestd::unique
and don't worry about subscript access for adjacent elements).
$endgroup$
– Snowhawk
2 days ago
$begingroup$
I was talking aboutremove_if()
probably copying the element before passing it to the predicate.
$endgroup$
– Swordfish
2 days ago
$begingroup$
Compilers are pretty good at inlining arguments and lambda predicates. I would be concerned with unnecessary heap allocations.
$endgroup$
– Snowhawk
2 days ago
|
show 1 more comment
$begingroup$
Be aware that std::isspace removes spaces (' '), whitespaces ('n', 'v', 'f', 'r'), and tabs ('t') – that's intended.
$endgroup$
– Swordfish
2 days ago
$begingroup$
As for std::string, this behavior is well-defined – yes, but what aboutremove_if()
, is it guaranteed that it won't copy?
$endgroup$
– Swordfish
2 days ago
$begingroup$
std::remove_if
copies individual characters that do not satisfy the predicate. The well-defined behavior I was talking about is accessingstr[str.size()]
. So, instead of checking&ch[1] == ''
, keep an index to the next variable and usestr[index]
(or better, usestd::unique
and don't worry about subscript access for adjacent elements).
$endgroup$
– Snowhawk
2 days ago
$begingroup$
I was talking aboutremove_if()
probably copying the element before passing it to the predicate.
$endgroup$
– Swordfish
2 days ago
$begingroup$
Compilers are pretty good at inlining arguments and lambda predicates. I would be concerned with unnecessary heap allocations.
$endgroup$
– Snowhawk
2 days ago
$begingroup$
Be aware that std::isspace removes spaces (' '), whitespaces ('n', 'v', 'f', 'r'), and tabs ('t') – that's intended.
$endgroup$
– Swordfish
2 days ago
$begingroup$
Be aware that std::isspace removes spaces (' '), whitespaces ('n', 'v', 'f', 'r'), and tabs ('t') – that's intended.
$endgroup$
– Swordfish
2 days ago
$begingroup$
As for std::string, this behavior is well-defined – yes, but what about
remove_if()
, is it guaranteed that it won't copy?$endgroup$
– Swordfish
2 days ago
$begingroup$
As for std::string, this behavior is well-defined – yes, but what about
remove_if()
, is it guaranteed that it won't copy?$endgroup$
– Swordfish
2 days ago
$begingroup$
std::remove_if
copies individual characters that do not satisfy the predicate. The well-defined behavior I was talking about is accessing str[str.size()]
. So, instead of checking &ch[1] == ''
, keep an index to the next variable and use str[index]
(or better, use std::unique
and don't worry about subscript access for adjacent elements).$endgroup$
– Snowhawk
2 days ago
$begingroup$
std::remove_if
copies individual characters that do not satisfy the predicate. The well-defined behavior I was talking about is accessing str[str.size()]
. So, instead of checking &ch[1] == ''
, keep an index to the next variable and use str[index]
(or better, use std::unique
and don't worry about subscript access for adjacent elements).$endgroup$
– Snowhawk
2 days ago
$begingroup$
I was talking about
remove_if()
probably copying the element before passing it to the predicate.$endgroup$
– Swordfish
2 days ago
$begingroup$
I was talking about
remove_if()
probably copying the element before passing it to the predicate.$endgroup$
– Swordfish
2 days ago
$begingroup$
Compilers are pretty good at inlining arguments and lambda predicates. I would be concerned with unnecessary heap allocations.
$endgroup$
– Snowhawk
2 days ago
$begingroup$
Compilers are pretty good at inlining arguments and lambda predicates. I would be concerned with unnecessary heap allocations.
$endgroup$
– Snowhawk
2 days ago
|
show 1 more comment
$begingroup$
It's about accessing
(&ch)[1]
. I think it should be legal, but I am not sure.
Yes, it is legal as long as you have null terminated string, i.e. the size
-th character of the string is the null character. The reason is that std::string::end()
returns an iterator that is one short of the element that holds the terminating null character when the string object does have a terminating null character. Hence, (&ch)[1]
will not access anything beyond the terminating null character. You can verify that by printing the value of &ch
as a debugging guide.
Here's a updated version of your posted code that prints additional debugging info.
#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>
std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };
std::for_each(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
});
std::cout << "==========================n";
auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})
};
result.erase(end, std::end(result));
return result;
}
int main()
{
char const *foo{ "Hello World! "};
std::string res = remove_excessive_ws(foo);
std::cout << "n"" << res << ""n";
}
Here's its output.
pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8
==========================
pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8
"Hello World!"
Response to OP's comment
std::string::end()
returns std::string::iterator()
, which is a LegacyRandomAccessIterator. Please see the Member types of std::string
.
LegacyRandomAccessIterator satisfies the requirements of LegacyBidirectionalIterator.
LegacyBidirectionalIterator satisfies the requirments of LegacyForwardIterator.
ForwardIterator satisfies the requirements of LegacyInputIterator.
When a LegacyInputIterator is dereferenced, it evaluates to a reference
. Now, you have to go back to std::string
to see what that means. As you can expect, it is a reference and not a copy.
You can also use a non-const
reference to be doubly sure that you get a reference and not a copy.
auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})
$endgroup$
1
$begingroup$
You can verify that by [...] – Standard reference? Also, Where is written thatfind_if()
won't hand me a copy of thechar
instead a reference to it?
$endgroup$
– Swordfish
2 days ago
1
$begingroup$
Whether it's legal to dereferencestr.end()
depends on the standards version; only since C++11 hasstr.data()
been required to have a null-terminator.
$endgroup$
– Toby Speight
2 days ago
$begingroup$
@TobySpeight, agree in principle. However, if the string object is a null terminated string, my answer is valid. I update the answer to make that clearer.
$endgroup$
– R Sahu
yesterday
add a comment |
$begingroup$
It's about accessing
(&ch)[1]
. I think it should be legal, but I am not sure.
Yes, it is legal as long as you have null terminated string, i.e. the size
-th character of the string is the null character. The reason is that std::string::end()
returns an iterator that is one short of the element that holds the terminating null character when the string object does have a terminating null character. Hence, (&ch)[1]
will not access anything beyond the terminating null character. You can verify that by printing the value of &ch
as a debugging guide.
Here's a updated version of your posted code that prints additional debugging info.
#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>
std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };
std::for_each(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
});
std::cout << "==========================n";
auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})
};
result.erase(end, std::end(result));
return result;
}
int main()
{
char const *foo{ "Hello World! "};
std::string res = remove_excessive_ws(foo);
std::cout << "n"" << res << ""n";
}
Here's its output.
pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8
==========================
pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8
"Hello World!"
Response to OP's comment
std::string::end()
returns std::string::iterator()
, which is a LegacyRandomAccessIterator. Please see the Member types of std::string
.
LegacyRandomAccessIterator satisfies the requirements of LegacyBidirectionalIterator.
LegacyBidirectionalIterator satisfies the requirments of LegacyForwardIterator.
ForwardIterator satisfies the requirements of LegacyInputIterator.
When a LegacyInputIterator is dereferenced, it evaluates to a reference
. Now, you have to go back to std::string
to see what that means. As you can expect, it is a reference and not a copy.
You can also use a non-const
reference to be doubly sure that you get a reference and not a copy.
auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})
$endgroup$
1
$begingroup$
You can verify that by [...] – Standard reference? Also, Where is written thatfind_if()
won't hand me a copy of thechar
instead a reference to it?
$endgroup$
– Swordfish
2 days ago
1
$begingroup$
Whether it's legal to dereferencestr.end()
depends on the standards version; only since C++11 hasstr.data()
been required to have a null-terminator.
$endgroup$
– Toby Speight
2 days ago
$begingroup$
@TobySpeight, agree in principle. However, if the string object is a null terminated string, my answer is valid. I update the answer to make that clearer.
$endgroup$
– R Sahu
yesterday
add a comment |
$begingroup$
It's about accessing
(&ch)[1]
. I think it should be legal, but I am not sure.
Yes, it is legal as long as you have null terminated string, i.e. the size
-th character of the string is the null character. The reason is that std::string::end()
returns an iterator that is one short of the element that holds the terminating null character when the string object does have a terminating null character. Hence, (&ch)[1]
will not access anything beyond the terminating null character. You can verify that by printing the value of &ch
as a debugging guide.
Here's a updated version of your posted code that prints additional debugging info.
#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>
std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };
std::for_each(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
});
std::cout << "==========================n";
auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})
};
result.erase(end, std::end(result));
return result;
}
int main()
{
char const *foo{ "Hello World! "};
std::string res = remove_excessive_ws(foo);
std::cout << "n"" << res << ""n";
}
Here's its output.
pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8
==========================
pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8
"Hello World!"
Response to OP's comment
std::string::end()
returns std::string::iterator()
, which is a LegacyRandomAccessIterator. Please see the Member types of std::string
.
LegacyRandomAccessIterator satisfies the requirements of LegacyBidirectionalIterator.
LegacyBidirectionalIterator satisfies the requirments of LegacyForwardIterator.
ForwardIterator satisfies the requirements of LegacyInputIterator.
When a LegacyInputIterator is dereferenced, it evaluates to a reference
. Now, you have to go back to std::string
to see what that means. As you can expect, it is a reference and not a copy.
You can also use a non-const
reference to be doubly sure that you get a reference and not a copy.
auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})
$endgroup$
It's about accessing
(&ch)[1]
. I think it should be legal, but I am not sure.
Yes, it is legal as long as you have null terminated string, i.e. the size
-th character of the string is the null character. The reason is that std::string::end()
returns an iterator that is one short of the element that holds the terminating null character when the string object does have a terminating null character. Hence, (&ch)[1]
will not access anything beyond the terminating null character. You can verify that by printing the value of &ch
as a debugging guide.
Here's a updated version of your posted code that prints additional debugging info.
#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>
std::string remove_excessive_ws(std::string const &str)
{
std::string result{ str };
std::for_each(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
});
std::cout << "==========================n";
auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type const &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void const*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})
};
result.erase(end, std::end(result));
return result;
}
int main()
{
char const *foo{ "Hello World! "};
std::string res = remove_excessive_ws(foo);
std::cout << "n"" << res << ""n";
}
Here's its output.
pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8
==========================
pointer value: 0x600012be8
pointer value: 0x600012be9
pointer value: 0x600012bea
pointer value: 0x600012beb
pointer value: 0x600012bec
pointer value: 0x600012bed
pointer value: 0x600012bee
pointer value: 0x600012bef
pointer value: 0x600012bf0
pointer value: 0x600012bf1
pointer value: 0x600012bf2
pointer value: 0x600012bf3
pointer value: 0x600012bf4
pointer value: 0x600012bf5
pointer value: 0x600012bf6
pointer value: 0x600012bf7
pointer value: 0x600012bf8
"Hello World!"
Response to OP's comment
std::string::end()
returns std::string::iterator()
, which is a LegacyRandomAccessIterator. Please see the Member types of std::string
.
LegacyRandomAccessIterator satisfies the requirements of LegacyBidirectionalIterator.
LegacyBidirectionalIterator satisfies the requirments of LegacyForwardIterator.
ForwardIterator satisfies the requirements of LegacyInputIterator.
When a LegacyInputIterator is dereferenced, it evaluates to a reference
. Now, you have to go back to std::string
to see what that means. As you can expect, it is a reference and not a copy.
You can also use a non-const
reference to be doubly sure that you get a reference and not a copy.
auto end{ std::remove_if(std::begin(result), std::end(result),
[](std::string::value_type &ch)
{
std::cout << "pointer value: " << reinterpret_cast<void*>(&ch) << std::endl;
return std::isspace(static_cast<unsigned>(ch)) &&
(std::isspace(static_cast<unsigned>((&ch)[1])) || (&ch)[1] == '');
})
edited yesterday
answered 2 days ago
R SahuR Sahu
3,274617
3,274617
1
$begingroup$
You can verify that by [...] – Standard reference? Also, Where is written thatfind_if()
won't hand me a copy of thechar
instead a reference to it?
$endgroup$
– Swordfish
2 days ago
1
$begingroup$
Whether it's legal to dereferencestr.end()
depends on the standards version; only since C++11 hasstr.data()
been required to have a null-terminator.
$endgroup$
– Toby Speight
2 days ago
$begingroup$
@TobySpeight, agree in principle. However, if the string object is a null terminated string, my answer is valid. I update the answer to make that clearer.
$endgroup$
– R Sahu
yesterday
add a comment |
1
$begingroup$
You can verify that by [...] – Standard reference? Also, Where is written thatfind_if()
won't hand me a copy of thechar
instead a reference to it?
$endgroup$
– Swordfish
2 days ago
1
$begingroup$
Whether it's legal to dereferencestr.end()
depends on the standards version; only since C++11 hasstr.data()
been required to have a null-terminator.
$endgroup$
– Toby Speight
2 days ago
$begingroup$
@TobySpeight, agree in principle. However, if the string object is a null terminated string, my answer is valid. I update the answer to make that clearer.
$endgroup$
– R Sahu
yesterday
1
1
$begingroup$
You can verify that by [...] – Standard reference? Also, Where is written that
find_if()
won't hand me a copy of the char
instead a reference to it?$endgroup$
– Swordfish
2 days ago
$begingroup$
You can verify that by [...] – Standard reference? Also, Where is written that
find_if()
won't hand me a copy of the char
instead a reference to it?$endgroup$
– Swordfish
2 days ago
1
1
$begingroup$
Whether it's legal to dereference
str.end()
depends on the standards version; only since C++11 has str.data()
been required to have a null-terminator.$endgroup$
– Toby Speight
2 days ago
$begingroup$
Whether it's legal to dereference
str.end()
depends on the standards version; only since C++11 has str.data()
been required to have a null-terminator.$endgroup$
– Toby Speight
2 days ago
$begingroup$
@TobySpeight, agree in principle. However, if the string object is a null terminated string, my answer is valid. I update the answer to make that clearer.
$endgroup$
– R Sahu
yesterday
$begingroup$
@TobySpeight, agree in principle. However, if the string object is a null terminated string, my answer is valid. I update the answer to make that clearer.
$endgroup$
– R Sahu
yesterday
add a comment |
$begingroup$
It's not standard (but you should have asked this question on stack overflow instead, code review is for working code).
std:string
is not required to end with a null character and dereferencing the end iterator is undefined behavior; if your input ends with a space, it's precisely what you'll end up doing.$endgroup$
– papagaga
Feb 26 at 8:25
$begingroup$
I don't think
std::remove_if()
is allowed to make a copy of each element - if it did, it couldn't (sensibly) work with containers of move-only values.$endgroup$
– Toby Speight
2 days ago
$begingroup$
@TobySpeight Yay, that's reasonable! :)
$endgroup$
– Swordfish
yesterday
1
$begingroup$
Please clarify, does or doesn't the current code work as intended?
$endgroup$
– Mast
yesterday