Verify two numbers from a list will add up to a number in the same list using PythonPrime factorization of a...
What is GPS' 19 year rollover and does it present a cybersecurity issue?
What does "enim et" mean?
declaring a variable twice in IIFE
Infinite past with a beginning?
"which" command doesn't work / path of Safari?
Should I join an office cleaning event for free?
Is it legal to have the "// (c) 2019 John Smith" header in all files when there are hundreds of contributors?
Why Is Death Allowed In the Matrix?
Extreme, but not acceptable situation and I can't start the work tomorrow morning
XeLaTeX and pdfLaTeX ignore hyphenation
Simulate Bitwise Cyclic Tag
Can town administrative "code" overule state laws like those forbidding trespassing?
Prevent a directory in /tmp from being deleted
What would happen to a modern skyscraper if it rains micro blackholes?
How can bays and straits be determined in a procedurally generated map?
Chess with symmetric move-square
Why has Russell's definition of numbers using equivalence classes been finally abandoned? ( If it has actually been abandoned).
What is the command to reset a PC without deleting any files
Patience, young "Padovan"
How do I create uniquely male characters?
Is there a familial term for apples and pears?
Are tax years 2016 & 2017 back taxes deductible for tax year 2018?
How do you conduct xenoanthropology after first contact?
Copenhagen passport control - US citizen
Verify two numbers from a list will add up to a number in the same list using Python
Prime factorization of a numberGenerator to Tuple to List?Calculating parsimonies using Python; performance is lackingDynamic Programming for printing additive numbers up to digits nProject Euler: #36 Binary and Base 10 PalindromesFinding all divisors of an integerGenerate set of random numbers and remove lowestFind two distinct subsets of a list that add up to the same valueFirstDuplicate FinderSearch father (x) of the number D(x) and measure time and memory consumed
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}
$begingroup$
I am trying to solve this question:
- Given a list of integers, verifies if any two numbers from this list add up to another number in the same list.
I came up with:
def func(l):
sum_set = set() # using a set to store sums of two numbers.
for index in range(len(l)-1):
counter = 1
while counter + index < len(l):
sum = l[index] + l[index+counter]
sum_set.add(sum)
counter += 1
for number in l:
if number in sum_set:
return True
return False
print(func([1,2,3,4]))
print(func([1,2,1,4]))
print(func([1,1,1]))
print(func([1]))
print(func([1,2,5,9]))
print(func([2,1,1]))
print(func([1,-1,0]))
From the tests I ran above, my function is working. But there is at least one way I can think of my approach is lacking of:
- I should filter the original list to get rid of duplicate numbers. If a number appears more than 2 times, its rest of appearance should be disregarded. E.g.
l = [1,1,1,1,2]
should be optimised tol = [1,1,2]
Is there any other aspect that I can improve my code snippet on to make it more Pythonic / more efficient?
python
$endgroup$
add a comment |
$begingroup$
I am trying to solve this question:
- Given a list of integers, verifies if any two numbers from this list add up to another number in the same list.
I came up with:
def func(l):
sum_set = set() # using a set to store sums of two numbers.
for index in range(len(l)-1):
counter = 1
while counter + index < len(l):
sum = l[index] + l[index+counter]
sum_set.add(sum)
counter += 1
for number in l:
if number in sum_set:
return True
return False
print(func([1,2,3,4]))
print(func([1,2,1,4]))
print(func([1,1,1]))
print(func([1]))
print(func([1,2,5,9]))
print(func([2,1,1]))
print(func([1,-1,0]))
From the tests I ran above, my function is working. But there is at least one way I can think of my approach is lacking of:
- I should filter the original list to get rid of duplicate numbers. If a number appears more than 2 times, its rest of appearance should be disregarded. E.g.
l = [1,1,1,1,2]
should be optimised tol = [1,1,2]
Is there any other aspect that I can improve my code snippet on to make it more Pythonic / more efficient?
python
$endgroup$
add a comment |
$begingroup$
I am trying to solve this question:
- Given a list of integers, verifies if any two numbers from this list add up to another number in the same list.
I came up with:
def func(l):
sum_set = set() # using a set to store sums of two numbers.
for index in range(len(l)-1):
counter = 1
while counter + index < len(l):
sum = l[index] + l[index+counter]
sum_set.add(sum)
counter += 1
for number in l:
if number in sum_set:
return True
return False
print(func([1,2,3,4]))
print(func([1,2,1,4]))
print(func([1,1,1]))
print(func([1]))
print(func([1,2,5,9]))
print(func([2,1,1]))
print(func([1,-1,0]))
From the tests I ran above, my function is working. But there is at least one way I can think of my approach is lacking of:
- I should filter the original list to get rid of duplicate numbers. If a number appears more than 2 times, its rest of appearance should be disregarded. E.g.
l = [1,1,1,1,2]
should be optimised tol = [1,1,2]
Is there any other aspect that I can improve my code snippet on to make it more Pythonic / more efficient?
python
$endgroup$
I am trying to solve this question:
- Given a list of integers, verifies if any two numbers from this list add up to another number in the same list.
I came up with:
def func(l):
sum_set = set() # using a set to store sums of two numbers.
for index in range(len(l)-1):
counter = 1
while counter + index < len(l):
sum = l[index] + l[index+counter]
sum_set.add(sum)
counter += 1
for number in l:
if number in sum_set:
return True
return False
print(func([1,2,3,4]))
print(func([1,2,1,4]))
print(func([1,1,1]))
print(func([1]))
print(func([1,2,5,9]))
print(func([2,1,1]))
print(func([1,-1,0]))
From the tests I ran above, my function is working. But there is at least one way I can think of my approach is lacking of:
- I should filter the original list to get rid of duplicate numbers. If a number appears more than 2 times, its rest of appearance should be disregarded. E.g.
l = [1,1,1,1,2]
should be optimised tol = [1,1,2]
Is there any other aspect that I can improve my code snippet on to make it more Pythonic / more efficient?
python
python
asked Apr 2 at 5:45
Yu ZhangYu Zhang
314312
314312
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
The thing to be concerned with when writing functions like this is the size of the data it's going to be processing. If you're only ever going to be dealing with small sets of data (less than say 100 items) then there really isn't an issue with the way this is written. At the scale of maybe tens of thousands of items this starts to be a problem.
I'm not a data scientist or anything, I'm just a web developer and I very rarely delve into trying to optimise code, so any advice I'm giving is based on my personal experience + uni, and doesn't utilize math libraries in Python like numpy
which you might want to look into depending on what this code is for. Another answerer could likely help more with this specific circumstance, my answer is going to be a bit more general; assuming you're going to be dealing with large amounts of data (or otherwise prematurely optimising is sometimes considered the root of all evil):
EDIT: I just noticed that one of your examples included negative numbers, so the example code below will need to be adapted with that in mind since it affects the movement of i
and j
; sorry about that, but the code provided can still be run as an example with positive numbers.
Ignoring data types used for the most part & how pythonic the code is, I see probably two areas for improvement. The first I'll touch on at the end in a code example & may depend a bit on your particular data-set as to how helpful it will be, but presently, the code has no chance of finishing early (assuming random inputs that'd give you an substantial speedup especially if your data is likely to find a match in the early items of the list).
However, the primary areas for improvement can be determined through analysing your code via Big-O / Big(O) notation. I'd suggest a read of the concept if you're unfamiliar, but according to Wikipedia:
big O notation is used to classify algorithms according to how their running time or space requirements grow as the input size grows
Given the following section of your code:
for index in range(len(l)-1): # (1)
...
while counter + index < len(l): # (2)
...
sum_set.add(sum) # (3)
...
for number in l: # (4)
If we look at, for instance, a 100,000 element array containing the numbers between 1 and 100,000, if we run your code as it's written, we will end up running the sum_set.add(sum)
statement (3) about 4.99 billion times. At just 10,000 elements, on my machine, the code as written takes multiple seconds to finish.
Looking at (1), we see this statement runs through all elements in the list, therefore O(N); the time taken for the outer loop depends on a linear relationship to the input, e.g. O(N) means an input array of 200 elements, ignoring any constant time overhead, should take ~roughly~ 200x longer than an array with 1 element. Line (2) passes through N-1 elements on the first loop, then N-2, ... finally 1 item in last loop; averaging out at half as many loops as the outer index, but is still O(N) as it's linearly related to the amount of items included in the list. As you have an O(N) loop within an O(N), this gives it the overall O(N^2) performance.
(4) is tricky to estimate, as it depends on the passed in data. Iterating through the list l
is O(N) again, and if we assume worst case, each element in sum_set
could be unique, i.e. if the passed in array was something like [1, 11, 111, ...], which would mean there are ~N^2 terms in sum_set
, actually causing this loop to degrade to O(N^3) performance. Best case, the sum_set
is very small, but even assuming only 1 item, that would still be O(N) as we need to touch each element in l
. Additionally, sum_set
could potentially become very large, causing the loop not only to be expensive in time but also possibly memory (although as you are using a set there aren't going to be any duplicates, so it will totally depend on the input data. E.g. 100,000 elements, but the values range between 0 & 100, so sum_set
ranges between 0 & 200).
I'd say your suggestion of pre-filtering to remove duplicates is a good idea, ideally something O(N) like the following (though there are likely more optimal approaches):
>>> input_list = [1,1,1,1,2,2,2,3,3,3]
>>> filtering_dict = collections.defaultdict(int)
>>> for item in input_list:
... filtering_dict[item] += 1
...
>>> newList = []
>>> for key, value in filtering_dict.items():
... newList.append(key)
... if value > 1:
... newList.append(key)
...
>>> newList
... [1, 1, 2, 2, 3, 3]
I'd then try and take advantage of sorting the array using an O(Nlog(N)) sort like Mergesort / Quicksort, or depending on your data an O(N) sort like Counting Sort. If you know your data is going to be ordered, you can skip this step. With sorted data, we don't have to use the sum_set
set; we can instead pick an index in the array & determine whether it is the total of two other elements. We know that any index we suspect to be our sum
will have to be made up of elements that are lower indexes than it in the list, i.e. [1, 2, 3, 4, 5]
-> If we start looking at 3, we know we don't need to consider elements 4 & 5, as they will be larger than 3, so couldn't possibly sum to it. Finally, the halfway point for a number is also relevant, I.e. [1, 3, 5, 7, 9, 11, 99, 117] if we're looking at 99, we first look to add the next lowest index & the first index; however, since 11 < 99/2 we know we won't be able to find a match that adds to 99; on average this should be another speedup assuming the data isn't perfectly uniform.
Finally, since we aren't pushing results into sum_set
& only checking once for each total, this will cause some repetition in our search. However, since we can return immediately upon finding a match, our best/average case just got a lot better.
def func2(l):
# l = filter_inputs(l)
# l.sort()
for index in range(2, len(l)):
i = 0
j = index - 1
half_val = l[index] / 2;
while ( i < j and l[i] <= half_val and l[j] >= half_val ):
if l[index] > l[i] + l[j]:
i = i + 1
elif l[index] < l[i] + l[j]:
j = j - 1
else:
print(str(l[i]) + " + " + str(l[j]) + " = " + str(l[index]))
return True
return False
Using timeit, and comparing func & func2, using code like the following:
from timeit import timeit
timeit('func2(<TEST ARRAY>)', setup="from __main__ import func2; import random", number=20)
# Use the function that's been predefined in the python interpreter,
# pass the array, repeat the test 20 times, and output how many seconds taken
# All times listed are in seconds for 20 repeats
# list of all positive odd numbers up to 9,999 == [x for x in range(1,100000) if x % 2 is 1]
# (absolute worst-case condition for func2, behaves nearly identically to your original function)
# (this is because i, j, and half_val achieve 0% improvement & every value is checked)
# func2 # func
>>> 73.89 >>> 73.86
# all integers between 1 & 9,999 == [x for x in range(1,10000)]
# func2 # func
>>> 0.02 >>> 297.54
# 9,999 random integers between 1 & 10,000 == [random.randint(0,10000) for x in range (1,10000)]
# Re-ran this one about 5 times for func2 since it was so quick,
# with 20 loops its lowest was 0.25 & highest 0.32 seconds taken
# You'll also need to sort 'l' for this one to work with func2
# func2 # func
>>> ~0.3 >>> 312.83
Again, with a low number of entries in l
, the cost of removing duplicates & sorting the array would probably cause my function to run slower than yours. None of these speedups change the fact that the overall operation is worst-case O(N^2); however, they should drastically improve the best/average-case scenarios. Additionally, getting a large average speedup with an O(N^2) operation is huge when it comes to a big dataset, as it will be the limiting factor:
I.e. 100,000 items:
Filtering = ~2x O(N) = ~200,000 ops
Mergesort = O(NLog(N)) = ~1.6 million ops
Main Loop = 1/2 O(N^2) = ~5 billion ops
If you can come up with a better way to take advantage of the data such that you can get O(N^2) down to O(Nlog(N)) or similar, I think that'd be key here for optimizing the worst-case scenario.
New contributor
$endgroup$
$begingroup$
Thank you so much for your huge effort. Much appreciated.
$endgroup$
– Yu Zhang
Apr 3 at 10:49
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f216696%2fverify-two-numbers-from-a-list-will-add-up-to-a-number-in-the-same-list-using-py%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
The thing to be concerned with when writing functions like this is the size of the data it's going to be processing. If you're only ever going to be dealing with small sets of data (less than say 100 items) then there really isn't an issue with the way this is written. At the scale of maybe tens of thousands of items this starts to be a problem.
I'm not a data scientist or anything, I'm just a web developer and I very rarely delve into trying to optimise code, so any advice I'm giving is based on my personal experience + uni, and doesn't utilize math libraries in Python like numpy
which you might want to look into depending on what this code is for. Another answerer could likely help more with this specific circumstance, my answer is going to be a bit more general; assuming you're going to be dealing with large amounts of data (or otherwise prematurely optimising is sometimes considered the root of all evil):
EDIT: I just noticed that one of your examples included negative numbers, so the example code below will need to be adapted with that in mind since it affects the movement of i
and j
; sorry about that, but the code provided can still be run as an example with positive numbers.
Ignoring data types used for the most part & how pythonic the code is, I see probably two areas for improvement. The first I'll touch on at the end in a code example & may depend a bit on your particular data-set as to how helpful it will be, but presently, the code has no chance of finishing early (assuming random inputs that'd give you an substantial speedup especially if your data is likely to find a match in the early items of the list).
However, the primary areas for improvement can be determined through analysing your code via Big-O / Big(O) notation. I'd suggest a read of the concept if you're unfamiliar, but according to Wikipedia:
big O notation is used to classify algorithms according to how their running time or space requirements grow as the input size grows
Given the following section of your code:
for index in range(len(l)-1): # (1)
...
while counter + index < len(l): # (2)
...
sum_set.add(sum) # (3)
...
for number in l: # (4)
If we look at, for instance, a 100,000 element array containing the numbers between 1 and 100,000, if we run your code as it's written, we will end up running the sum_set.add(sum)
statement (3) about 4.99 billion times. At just 10,000 elements, on my machine, the code as written takes multiple seconds to finish.
Looking at (1), we see this statement runs through all elements in the list, therefore O(N); the time taken for the outer loop depends on a linear relationship to the input, e.g. O(N) means an input array of 200 elements, ignoring any constant time overhead, should take ~roughly~ 200x longer than an array with 1 element. Line (2) passes through N-1 elements on the first loop, then N-2, ... finally 1 item in last loop; averaging out at half as many loops as the outer index, but is still O(N) as it's linearly related to the amount of items included in the list. As you have an O(N) loop within an O(N), this gives it the overall O(N^2) performance.
(4) is tricky to estimate, as it depends on the passed in data. Iterating through the list l
is O(N) again, and if we assume worst case, each element in sum_set
could be unique, i.e. if the passed in array was something like [1, 11, 111, ...], which would mean there are ~N^2 terms in sum_set
, actually causing this loop to degrade to O(N^3) performance. Best case, the sum_set
is very small, but even assuming only 1 item, that would still be O(N) as we need to touch each element in l
. Additionally, sum_set
could potentially become very large, causing the loop not only to be expensive in time but also possibly memory (although as you are using a set there aren't going to be any duplicates, so it will totally depend on the input data. E.g. 100,000 elements, but the values range between 0 & 100, so sum_set
ranges between 0 & 200).
I'd say your suggestion of pre-filtering to remove duplicates is a good idea, ideally something O(N) like the following (though there are likely more optimal approaches):
>>> input_list = [1,1,1,1,2,2,2,3,3,3]
>>> filtering_dict = collections.defaultdict(int)
>>> for item in input_list:
... filtering_dict[item] += 1
...
>>> newList = []
>>> for key, value in filtering_dict.items():
... newList.append(key)
... if value > 1:
... newList.append(key)
...
>>> newList
... [1, 1, 2, 2, 3, 3]
I'd then try and take advantage of sorting the array using an O(Nlog(N)) sort like Mergesort / Quicksort, or depending on your data an O(N) sort like Counting Sort. If you know your data is going to be ordered, you can skip this step. With sorted data, we don't have to use the sum_set
set; we can instead pick an index in the array & determine whether it is the total of two other elements. We know that any index we suspect to be our sum
will have to be made up of elements that are lower indexes than it in the list, i.e. [1, 2, 3, 4, 5]
-> If we start looking at 3, we know we don't need to consider elements 4 & 5, as they will be larger than 3, so couldn't possibly sum to it. Finally, the halfway point for a number is also relevant, I.e. [1, 3, 5, 7, 9, 11, 99, 117] if we're looking at 99, we first look to add the next lowest index & the first index; however, since 11 < 99/2 we know we won't be able to find a match that adds to 99; on average this should be another speedup assuming the data isn't perfectly uniform.
Finally, since we aren't pushing results into sum_set
& only checking once for each total, this will cause some repetition in our search. However, since we can return immediately upon finding a match, our best/average case just got a lot better.
def func2(l):
# l = filter_inputs(l)
# l.sort()
for index in range(2, len(l)):
i = 0
j = index - 1
half_val = l[index] / 2;
while ( i < j and l[i] <= half_val and l[j] >= half_val ):
if l[index] > l[i] + l[j]:
i = i + 1
elif l[index] < l[i] + l[j]:
j = j - 1
else:
print(str(l[i]) + " + " + str(l[j]) + " = " + str(l[index]))
return True
return False
Using timeit, and comparing func & func2, using code like the following:
from timeit import timeit
timeit('func2(<TEST ARRAY>)', setup="from __main__ import func2; import random", number=20)
# Use the function that's been predefined in the python interpreter,
# pass the array, repeat the test 20 times, and output how many seconds taken
# All times listed are in seconds for 20 repeats
# list of all positive odd numbers up to 9,999 == [x for x in range(1,100000) if x % 2 is 1]
# (absolute worst-case condition for func2, behaves nearly identically to your original function)
# (this is because i, j, and half_val achieve 0% improvement & every value is checked)
# func2 # func
>>> 73.89 >>> 73.86
# all integers between 1 & 9,999 == [x for x in range(1,10000)]
# func2 # func
>>> 0.02 >>> 297.54
# 9,999 random integers between 1 & 10,000 == [random.randint(0,10000) for x in range (1,10000)]
# Re-ran this one about 5 times for func2 since it was so quick,
# with 20 loops its lowest was 0.25 & highest 0.32 seconds taken
# You'll also need to sort 'l' for this one to work with func2
# func2 # func
>>> ~0.3 >>> 312.83
Again, with a low number of entries in l
, the cost of removing duplicates & sorting the array would probably cause my function to run slower than yours. None of these speedups change the fact that the overall operation is worst-case O(N^2); however, they should drastically improve the best/average-case scenarios. Additionally, getting a large average speedup with an O(N^2) operation is huge when it comes to a big dataset, as it will be the limiting factor:
I.e. 100,000 items:
Filtering = ~2x O(N) = ~200,000 ops
Mergesort = O(NLog(N)) = ~1.6 million ops
Main Loop = 1/2 O(N^2) = ~5 billion ops
If you can come up with a better way to take advantage of the data such that you can get O(N^2) down to O(Nlog(N)) or similar, I think that'd be key here for optimizing the worst-case scenario.
New contributor
$endgroup$
$begingroup$
Thank you so much for your huge effort. Much appreciated.
$endgroup$
– Yu Zhang
Apr 3 at 10:49
add a comment |
$begingroup$
The thing to be concerned with when writing functions like this is the size of the data it's going to be processing. If you're only ever going to be dealing with small sets of data (less than say 100 items) then there really isn't an issue with the way this is written. At the scale of maybe tens of thousands of items this starts to be a problem.
I'm not a data scientist or anything, I'm just a web developer and I very rarely delve into trying to optimise code, so any advice I'm giving is based on my personal experience + uni, and doesn't utilize math libraries in Python like numpy
which you might want to look into depending on what this code is for. Another answerer could likely help more with this specific circumstance, my answer is going to be a bit more general; assuming you're going to be dealing with large amounts of data (or otherwise prematurely optimising is sometimes considered the root of all evil):
EDIT: I just noticed that one of your examples included negative numbers, so the example code below will need to be adapted with that in mind since it affects the movement of i
and j
; sorry about that, but the code provided can still be run as an example with positive numbers.
Ignoring data types used for the most part & how pythonic the code is, I see probably two areas for improvement. The first I'll touch on at the end in a code example & may depend a bit on your particular data-set as to how helpful it will be, but presently, the code has no chance of finishing early (assuming random inputs that'd give you an substantial speedup especially if your data is likely to find a match in the early items of the list).
However, the primary areas for improvement can be determined through analysing your code via Big-O / Big(O) notation. I'd suggest a read of the concept if you're unfamiliar, but according to Wikipedia:
big O notation is used to classify algorithms according to how their running time or space requirements grow as the input size grows
Given the following section of your code:
for index in range(len(l)-1): # (1)
...
while counter + index < len(l): # (2)
...
sum_set.add(sum) # (3)
...
for number in l: # (4)
If we look at, for instance, a 100,000 element array containing the numbers between 1 and 100,000, if we run your code as it's written, we will end up running the sum_set.add(sum)
statement (3) about 4.99 billion times. At just 10,000 elements, on my machine, the code as written takes multiple seconds to finish.
Looking at (1), we see this statement runs through all elements in the list, therefore O(N); the time taken for the outer loop depends on a linear relationship to the input, e.g. O(N) means an input array of 200 elements, ignoring any constant time overhead, should take ~roughly~ 200x longer than an array with 1 element. Line (2) passes through N-1 elements on the first loop, then N-2, ... finally 1 item in last loop; averaging out at half as many loops as the outer index, but is still O(N) as it's linearly related to the amount of items included in the list. As you have an O(N) loop within an O(N), this gives it the overall O(N^2) performance.
(4) is tricky to estimate, as it depends on the passed in data. Iterating through the list l
is O(N) again, and if we assume worst case, each element in sum_set
could be unique, i.e. if the passed in array was something like [1, 11, 111, ...], which would mean there are ~N^2 terms in sum_set
, actually causing this loop to degrade to O(N^3) performance. Best case, the sum_set
is very small, but even assuming only 1 item, that would still be O(N) as we need to touch each element in l
. Additionally, sum_set
could potentially become very large, causing the loop not only to be expensive in time but also possibly memory (although as you are using a set there aren't going to be any duplicates, so it will totally depend on the input data. E.g. 100,000 elements, but the values range between 0 & 100, so sum_set
ranges between 0 & 200).
I'd say your suggestion of pre-filtering to remove duplicates is a good idea, ideally something O(N) like the following (though there are likely more optimal approaches):
>>> input_list = [1,1,1,1,2,2,2,3,3,3]
>>> filtering_dict = collections.defaultdict(int)
>>> for item in input_list:
... filtering_dict[item] += 1
...
>>> newList = []
>>> for key, value in filtering_dict.items():
... newList.append(key)
... if value > 1:
... newList.append(key)
...
>>> newList
... [1, 1, 2, 2, 3, 3]
I'd then try and take advantage of sorting the array using an O(Nlog(N)) sort like Mergesort / Quicksort, or depending on your data an O(N) sort like Counting Sort. If you know your data is going to be ordered, you can skip this step. With sorted data, we don't have to use the sum_set
set; we can instead pick an index in the array & determine whether it is the total of two other elements. We know that any index we suspect to be our sum
will have to be made up of elements that are lower indexes than it in the list, i.e. [1, 2, 3, 4, 5]
-> If we start looking at 3, we know we don't need to consider elements 4 & 5, as they will be larger than 3, so couldn't possibly sum to it. Finally, the halfway point for a number is also relevant, I.e. [1, 3, 5, 7, 9, 11, 99, 117] if we're looking at 99, we first look to add the next lowest index & the first index; however, since 11 < 99/2 we know we won't be able to find a match that adds to 99; on average this should be another speedup assuming the data isn't perfectly uniform.
Finally, since we aren't pushing results into sum_set
& only checking once for each total, this will cause some repetition in our search. However, since we can return immediately upon finding a match, our best/average case just got a lot better.
def func2(l):
# l = filter_inputs(l)
# l.sort()
for index in range(2, len(l)):
i = 0
j = index - 1
half_val = l[index] / 2;
while ( i < j and l[i] <= half_val and l[j] >= half_val ):
if l[index] > l[i] + l[j]:
i = i + 1
elif l[index] < l[i] + l[j]:
j = j - 1
else:
print(str(l[i]) + " + " + str(l[j]) + " = " + str(l[index]))
return True
return False
Using timeit, and comparing func & func2, using code like the following:
from timeit import timeit
timeit('func2(<TEST ARRAY>)', setup="from __main__ import func2; import random", number=20)
# Use the function that's been predefined in the python interpreter,
# pass the array, repeat the test 20 times, and output how many seconds taken
# All times listed are in seconds for 20 repeats
# list of all positive odd numbers up to 9,999 == [x for x in range(1,100000) if x % 2 is 1]
# (absolute worst-case condition for func2, behaves nearly identically to your original function)
# (this is because i, j, and half_val achieve 0% improvement & every value is checked)
# func2 # func
>>> 73.89 >>> 73.86
# all integers between 1 & 9,999 == [x for x in range(1,10000)]
# func2 # func
>>> 0.02 >>> 297.54
# 9,999 random integers between 1 & 10,000 == [random.randint(0,10000) for x in range (1,10000)]
# Re-ran this one about 5 times for func2 since it was so quick,
# with 20 loops its lowest was 0.25 & highest 0.32 seconds taken
# You'll also need to sort 'l' for this one to work with func2
# func2 # func
>>> ~0.3 >>> 312.83
Again, with a low number of entries in l
, the cost of removing duplicates & sorting the array would probably cause my function to run slower than yours. None of these speedups change the fact that the overall operation is worst-case O(N^2); however, they should drastically improve the best/average-case scenarios. Additionally, getting a large average speedup with an O(N^2) operation is huge when it comes to a big dataset, as it will be the limiting factor:
I.e. 100,000 items:
Filtering = ~2x O(N) = ~200,000 ops
Mergesort = O(NLog(N)) = ~1.6 million ops
Main Loop = 1/2 O(N^2) = ~5 billion ops
If you can come up with a better way to take advantage of the data such that you can get O(N^2) down to O(Nlog(N)) or similar, I think that'd be key here for optimizing the worst-case scenario.
New contributor
$endgroup$
$begingroup$
Thank you so much for your huge effort. Much appreciated.
$endgroup$
– Yu Zhang
Apr 3 at 10:49
add a comment |
$begingroup$
The thing to be concerned with when writing functions like this is the size of the data it's going to be processing. If you're only ever going to be dealing with small sets of data (less than say 100 items) then there really isn't an issue with the way this is written. At the scale of maybe tens of thousands of items this starts to be a problem.
I'm not a data scientist or anything, I'm just a web developer and I very rarely delve into trying to optimise code, so any advice I'm giving is based on my personal experience + uni, and doesn't utilize math libraries in Python like numpy
which you might want to look into depending on what this code is for. Another answerer could likely help more with this specific circumstance, my answer is going to be a bit more general; assuming you're going to be dealing with large amounts of data (or otherwise prematurely optimising is sometimes considered the root of all evil):
EDIT: I just noticed that one of your examples included negative numbers, so the example code below will need to be adapted with that in mind since it affects the movement of i
and j
; sorry about that, but the code provided can still be run as an example with positive numbers.
Ignoring data types used for the most part & how pythonic the code is, I see probably two areas for improvement. The first I'll touch on at the end in a code example & may depend a bit on your particular data-set as to how helpful it will be, but presently, the code has no chance of finishing early (assuming random inputs that'd give you an substantial speedup especially if your data is likely to find a match in the early items of the list).
However, the primary areas for improvement can be determined through analysing your code via Big-O / Big(O) notation. I'd suggest a read of the concept if you're unfamiliar, but according to Wikipedia:
big O notation is used to classify algorithms according to how their running time or space requirements grow as the input size grows
Given the following section of your code:
for index in range(len(l)-1): # (1)
...
while counter + index < len(l): # (2)
...
sum_set.add(sum) # (3)
...
for number in l: # (4)
If we look at, for instance, a 100,000 element array containing the numbers between 1 and 100,000, if we run your code as it's written, we will end up running the sum_set.add(sum)
statement (3) about 4.99 billion times. At just 10,000 elements, on my machine, the code as written takes multiple seconds to finish.
Looking at (1), we see this statement runs through all elements in the list, therefore O(N); the time taken for the outer loop depends on a linear relationship to the input, e.g. O(N) means an input array of 200 elements, ignoring any constant time overhead, should take ~roughly~ 200x longer than an array with 1 element. Line (2) passes through N-1 elements on the first loop, then N-2, ... finally 1 item in last loop; averaging out at half as many loops as the outer index, but is still O(N) as it's linearly related to the amount of items included in the list. As you have an O(N) loop within an O(N), this gives it the overall O(N^2) performance.
(4) is tricky to estimate, as it depends on the passed in data. Iterating through the list l
is O(N) again, and if we assume worst case, each element in sum_set
could be unique, i.e. if the passed in array was something like [1, 11, 111, ...], which would mean there are ~N^2 terms in sum_set
, actually causing this loop to degrade to O(N^3) performance. Best case, the sum_set
is very small, but even assuming only 1 item, that would still be O(N) as we need to touch each element in l
. Additionally, sum_set
could potentially become very large, causing the loop not only to be expensive in time but also possibly memory (although as you are using a set there aren't going to be any duplicates, so it will totally depend on the input data. E.g. 100,000 elements, but the values range between 0 & 100, so sum_set
ranges between 0 & 200).
I'd say your suggestion of pre-filtering to remove duplicates is a good idea, ideally something O(N) like the following (though there are likely more optimal approaches):
>>> input_list = [1,1,1,1,2,2,2,3,3,3]
>>> filtering_dict = collections.defaultdict(int)
>>> for item in input_list:
... filtering_dict[item] += 1
...
>>> newList = []
>>> for key, value in filtering_dict.items():
... newList.append(key)
... if value > 1:
... newList.append(key)
...
>>> newList
... [1, 1, 2, 2, 3, 3]
I'd then try and take advantage of sorting the array using an O(Nlog(N)) sort like Mergesort / Quicksort, or depending on your data an O(N) sort like Counting Sort. If you know your data is going to be ordered, you can skip this step. With sorted data, we don't have to use the sum_set
set; we can instead pick an index in the array & determine whether it is the total of two other elements. We know that any index we suspect to be our sum
will have to be made up of elements that are lower indexes than it in the list, i.e. [1, 2, 3, 4, 5]
-> If we start looking at 3, we know we don't need to consider elements 4 & 5, as they will be larger than 3, so couldn't possibly sum to it. Finally, the halfway point for a number is also relevant, I.e. [1, 3, 5, 7, 9, 11, 99, 117] if we're looking at 99, we first look to add the next lowest index & the first index; however, since 11 < 99/2 we know we won't be able to find a match that adds to 99; on average this should be another speedup assuming the data isn't perfectly uniform.
Finally, since we aren't pushing results into sum_set
& only checking once for each total, this will cause some repetition in our search. However, since we can return immediately upon finding a match, our best/average case just got a lot better.
def func2(l):
# l = filter_inputs(l)
# l.sort()
for index in range(2, len(l)):
i = 0
j = index - 1
half_val = l[index] / 2;
while ( i < j and l[i] <= half_val and l[j] >= half_val ):
if l[index] > l[i] + l[j]:
i = i + 1
elif l[index] < l[i] + l[j]:
j = j - 1
else:
print(str(l[i]) + " + " + str(l[j]) + " = " + str(l[index]))
return True
return False
Using timeit, and comparing func & func2, using code like the following:
from timeit import timeit
timeit('func2(<TEST ARRAY>)', setup="from __main__ import func2; import random", number=20)
# Use the function that's been predefined in the python interpreter,
# pass the array, repeat the test 20 times, and output how many seconds taken
# All times listed are in seconds for 20 repeats
# list of all positive odd numbers up to 9,999 == [x for x in range(1,100000) if x % 2 is 1]
# (absolute worst-case condition for func2, behaves nearly identically to your original function)
# (this is because i, j, and half_val achieve 0% improvement & every value is checked)
# func2 # func
>>> 73.89 >>> 73.86
# all integers between 1 & 9,999 == [x for x in range(1,10000)]
# func2 # func
>>> 0.02 >>> 297.54
# 9,999 random integers between 1 & 10,000 == [random.randint(0,10000) for x in range (1,10000)]
# Re-ran this one about 5 times for func2 since it was so quick,
# with 20 loops its lowest was 0.25 & highest 0.32 seconds taken
# You'll also need to sort 'l' for this one to work with func2
# func2 # func
>>> ~0.3 >>> 312.83
Again, with a low number of entries in l
, the cost of removing duplicates & sorting the array would probably cause my function to run slower than yours. None of these speedups change the fact that the overall operation is worst-case O(N^2); however, they should drastically improve the best/average-case scenarios. Additionally, getting a large average speedup with an O(N^2) operation is huge when it comes to a big dataset, as it will be the limiting factor:
I.e. 100,000 items:
Filtering = ~2x O(N) = ~200,000 ops
Mergesort = O(NLog(N)) = ~1.6 million ops
Main Loop = 1/2 O(N^2) = ~5 billion ops
If you can come up with a better way to take advantage of the data such that you can get O(N^2) down to O(Nlog(N)) or similar, I think that'd be key here for optimizing the worst-case scenario.
New contributor
$endgroup$
The thing to be concerned with when writing functions like this is the size of the data it's going to be processing. If you're only ever going to be dealing with small sets of data (less than say 100 items) then there really isn't an issue with the way this is written. At the scale of maybe tens of thousands of items this starts to be a problem.
I'm not a data scientist or anything, I'm just a web developer and I very rarely delve into trying to optimise code, so any advice I'm giving is based on my personal experience + uni, and doesn't utilize math libraries in Python like numpy
which you might want to look into depending on what this code is for. Another answerer could likely help more with this specific circumstance, my answer is going to be a bit more general; assuming you're going to be dealing with large amounts of data (or otherwise prematurely optimising is sometimes considered the root of all evil):
EDIT: I just noticed that one of your examples included negative numbers, so the example code below will need to be adapted with that in mind since it affects the movement of i
and j
; sorry about that, but the code provided can still be run as an example with positive numbers.
Ignoring data types used for the most part & how pythonic the code is, I see probably two areas for improvement. The first I'll touch on at the end in a code example & may depend a bit on your particular data-set as to how helpful it will be, but presently, the code has no chance of finishing early (assuming random inputs that'd give you an substantial speedup especially if your data is likely to find a match in the early items of the list).
However, the primary areas for improvement can be determined through analysing your code via Big-O / Big(O) notation. I'd suggest a read of the concept if you're unfamiliar, but according to Wikipedia:
big O notation is used to classify algorithms according to how their running time or space requirements grow as the input size grows
Given the following section of your code:
for index in range(len(l)-1): # (1)
...
while counter + index < len(l): # (2)
...
sum_set.add(sum) # (3)
...
for number in l: # (4)
If we look at, for instance, a 100,000 element array containing the numbers between 1 and 100,000, if we run your code as it's written, we will end up running the sum_set.add(sum)
statement (3) about 4.99 billion times. At just 10,000 elements, on my machine, the code as written takes multiple seconds to finish.
Looking at (1), we see this statement runs through all elements in the list, therefore O(N); the time taken for the outer loop depends on a linear relationship to the input, e.g. O(N) means an input array of 200 elements, ignoring any constant time overhead, should take ~roughly~ 200x longer than an array with 1 element. Line (2) passes through N-1 elements on the first loop, then N-2, ... finally 1 item in last loop; averaging out at half as many loops as the outer index, but is still O(N) as it's linearly related to the amount of items included in the list. As you have an O(N) loop within an O(N), this gives it the overall O(N^2) performance.
(4) is tricky to estimate, as it depends on the passed in data. Iterating through the list l
is O(N) again, and if we assume worst case, each element in sum_set
could be unique, i.e. if the passed in array was something like [1, 11, 111, ...], which would mean there are ~N^2 terms in sum_set
, actually causing this loop to degrade to O(N^3) performance. Best case, the sum_set
is very small, but even assuming only 1 item, that would still be O(N) as we need to touch each element in l
. Additionally, sum_set
could potentially become very large, causing the loop not only to be expensive in time but also possibly memory (although as you are using a set there aren't going to be any duplicates, so it will totally depend on the input data. E.g. 100,000 elements, but the values range between 0 & 100, so sum_set
ranges between 0 & 200).
I'd say your suggestion of pre-filtering to remove duplicates is a good idea, ideally something O(N) like the following (though there are likely more optimal approaches):
>>> input_list = [1,1,1,1,2,2,2,3,3,3]
>>> filtering_dict = collections.defaultdict(int)
>>> for item in input_list:
... filtering_dict[item] += 1
...
>>> newList = []
>>> for key, value in filtering_dict.items():
... newList.append(key)
... if value > 1:
... newList.append(key)
...
>>> newList
... [1, 1, 2, 2, 3, 3]
I'd then try and take advantage of sorting the array using an O(Nlog(N)) sort like Mergesort / Quicksort, or depending on your data an O(N) sort like Counting Sort. If you know your data is going to be ordered, you can skip this step. With sorted data, we don't have to use the sum_set
set; we can instead pick an index in the array & determine whether it is the total of two other elements. We know that any index we suspect to be our sum
will have to be made up of elements that are lower indexes than it in the list, i.e. [1, 2, 3, 4, 5]
-> If we start looking at 3, we know we don't need to consider elements 4 & 5, as they will be larger than 3, so couldn't possibly sum to it. Finally, the halfway point for a number is also relevant, I.e. [1, 3, 5, 7, 9, 11, 99, 117] if we're looking at 99, we first look to add the next lowest index & the first index; however, since 11 < 99/2 we know we won't be able to find a match that adds to 99; on average this should be another speedup assuming the data isn't perfectly uniform.
Finally, since we aren't pushing results into sum_set
& only checking once for each total, this will cause some repetition in our search. However, since we can return immediately upon finding a match, our best/average case just got a lot better.
def func2(l):
# l = filter_inputs(l)
# l.sort()
for index in range(2, len(l)):
i = 0
j = index - 1
half_val = l[index] / 2;
while ( i < j and l[i] <= half_val and l[j] >= half_val ):
if l[index] > l[i] + l[j]:
i = i + 1
elif l[index] < l[i] + l[j]:
j = j - 1
else:
print(str(l[i]) + " + " + str(l[j]) + " = " + str(l[index]))
return True
return False
Using timeit, and comparing func & func2, using code like the following:
from timeit import timeit
timeit('func2(<TEST ARRAY>)', setup="from __main__ import func2; import random", number=20)
# Use the function that's been predefined in the python interpreter,
# pass the array, repeat the test 20 times, and output how many seconds taken
# All times listed are in seconds for 20 repeats
# list of all positive odd numbers up to 9,999 == [x for x in range(1,100000) if x % 2 is 1]
# (absolute worst-case condition for func2, behaves nearly identically to your original function)
# (this is because i, j, and half_val achieve 0% improvement & every value is checked)
# func2 # func
>>> 73.89 >>> 73.86
# all integers between 1 & 9,999 == [x for x in range(1,10000)]
# func2 # func
>>> 0.02 >>> 297.54
# 9,999 random integers between 1 & 10,000 == [random.randint(0,10000) for x in range (1,10000)]
# Re-ran this one about 5 times for func2 since it was so quick,
# with 20 loops its lowest was 0.25 & highest 0.32 seconds taken
# You'll also need to sort 'l' for this one to work with func2
# func2 # func
>>> ~0.3 >>> 312.83
Again, with a low number of entries in l
, the cost of removing duplicates & sorting the array would probably cause my function to run slower than yours. None of these speedups change the fact that the overall operation is worst-case O(N^2); however, they should drastically improve the best/average-case scenarios. Additionally, getting a large average speedup with an O(N^2) operation is huge when it comes to a big dataset, as it will be the limiting factor:
I.e. 100,000 items:
Filtering = ~2x O(N) = ~200,000 ops
Mergesort = O(NLog(N)) = ~1.6 million ops
Main Loop = 1/2 O(N^2) = ~5 billion ops
If you can come up with a better way to take advantage of the data such that you can get O(N^2) down to O(Nlog(N)) or similar, I think that'd be key here for optimizing the worst-case scenario.
New contributor
edited Apr 2 at 16:13
New contributor
answered Apr 2 at 15:31
LovethenakedgunLovethenakedgun
363
363
New contributor
New contributor
$begingroup$
Thank you so much for your huge effort. Much appreciated.
$endgroup$
– Yu Zhang
Apr 3 at 10:49
add a comment |
$begingroup$
Thank you so much for your huge effort. Much appreciated.
$endgroup$
– Yu Zhang
Apr 3 at 10:49
$begingroup$
Thank you so much for your huge effort. Much appreciated.
$endgroup$
– Yu Zhang
Apr 3 at 10:49
$begingroup$
Thank you so much for your huge effort. Much appreciated.
$endgroup$
– Yu Zhang
Apr 3 at 10:49
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f216696%2fverify-two-numbers-from-a-list-will-add-up-to-a-number-in-the-same-list-using-py%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown