Model to predict based on frequency of occurrenceUsing machine learning specifically for feature analysis,...
Has Britain negotiated with any other countries outside the EU in preparation for the exit?
What language shall they sing in?
How can the probability of a fumble decrease linearly with more dice?
What is the difference between rolling more dice versus fewer dice?
Can I announce prefix 161.117.25.0/24 even though I don't have all of /24 IPs
Which communication protocol is used in AdLib sound card?
How to access internet and run apt-get through a middle server?
Limits of a density function
What will happen if I transfer large sums of money into my bank account from a pre-paid debit card or gift card?
Does Skippy chunky peanut butter contain trans fat?
Looking for a specific 6502 Assembler
Early credit roll before the end of the film
Has any human ever had the choice to leave Earth permanently?
Short story where statues have their heads replaced by those of carved insect heads
Why does photorec keep finding files after I have filled the disk free space as root?
Existence of Riemann surface, holomorphic maps
Why don't key signatures indicate the tonic?
What will happen if Parliament votes "no" on each of the Brexit-related votes to be held on the 12th, 13th and 14th of March?
Non-Cancer terminal illness that can affect young (age 10-13) girls?
What is the difference between "...", '...', $'...', and $"..." quotes?
How to deal with possible delayed baggage?
A Missing Symbol for This Logo
Is using an 'empty' metaphor considered bad style?
Is "the fire consumed everything on its way" correct?
Model to predict based on frequency of occurrence
Using machine learning specifically for feature analysis, not predictionsApplying random forest model to a dataframe with multiple types of dataReproducing randomForest Proximity Matrix from R package in PythonHow to create a global model with personalized features for multi-label classification problemImbalanced data causing mis-classification on multiclass datasettime series forecasting - sliding window methodHow can I implement tangent distance for k-nearest neighbor in python/scikit-learn?model to predict annual outcome based on previous years dataHow to structure a scikit-learn machine learning project for predicting sports outcomes?how to predict content based demand
$begingroup$
I have the following dataset
+-----------------------------------+
| Passenger | Trip |
+-----------------------------------+
| John | London |
| Jack | Paris |
| Joe | Sydney |
| John | London |
| John | London |
| Jill | New york |
| Jim | Sydney |
| Jack | Paris |
| James | Sydney |
+-----------------------------------+
And am trying to use scikit library to predict the likelihood of next possible trip of a passenger based on the frequency ( In this case John => London).
As a novice am unsure on which model / function to use.
Update 2:
If I have over 10 million records , how different should I approach this problem ?
Update 3:
The following code worked for the larger dataset !
series_px = df_px_dest.groupby('Passenger')['Trip'].apply(lambda x: x.value_counts().head(1))
df_px = series_px.to_frame()
df_px.index = df_px.index.set_names(['UID', 'DEST'])
df_px.reset_index(inplace=True)
def getNextPossibleDestByUserID(name,df=df_px):
return df.query('UID==@name')['DEST'].to_string(index=False)
My next target is to expose that as an API (using Flask maybe) , Will probably raise a new question for that !!
scikit-learn machine-learning-model
$endgroup$
add a comment |
$begingroup$
I have the following dataset
+-----------------------------------+
| Passenger | Trip |
+-----------------------------------+
| John | London |
| Jack | Paris |
| Joe | Sydney |
| John | London |
| John | London |
| Jill | New york |
| Jim | Sydney |
| Jack | Paris |
| James | Sydney |
+-----------------------------------+
And am trying to use scikit library to predict the likelihood of next possible trip of a passenger based on the frequency ( In this case John => London).
As a novice am unsure on which model / function to use.
Update 2:
If I have over 10 million records , how different should I approach this problem ?
Update 3:
The following code worked for the larger dataset !
series_px = df_px_dest.groupby('Passenger')['Trip'].apply(lambda x: x.value_counts().head(1))
df_px = series_px.to_frame()
df_px.index = df_px.index.set_names(['UID', 'DEST'])
df_px.reset_index(inplace=True)
def getNextPossibleDestByUserID(name,df=df_px):
return df.query('UID==@name')['DEST'].to_string(index=False)
My next target is to expose that as an API (using Flask maybe) , Will probably raise a new question for that !!
scikit-learn machine-learning-model
$endgroup$
add a comment |
$begingroup$
I have the following dataset
+-----------------------------------+
| Passenger | Trip |
+-----------------------------------+
| John | London |
| Jack | Paris |
| Joe | Sydney |
| John | London |
| John | London |
| Jill | New york |
| Jim | Sydney |
| Jack | Paris |
| James | Sydney |
+-----------------------------------+
And am trying to use scikit library to predict the likelihood of next possible trip of a passenger based on the frequency ( In this case John => London).
As a novice am unsure on which model / function to use.
Update 2:
If I have over 10 million records , how different should I approach this problem ?
Update 3:
The following code worked for the larger dataset !
series_px = df_px_dest.groupby('Passenger')['Trip'].apply(lambda x: x.value_counts().head(1))
df_px = series_px.to_frame()
df_px.index = df_px.index.set_names(['UID', 'DEST'])
df_px.reset_index(inplace=True)
def getNextPossibleDestByUserID(name,df=df_px):
return df.query('UID==@name')['DEST'].to_string(index=False)
My next target is to expose that as an API (using Flask maybe) , Will probably raise a new question for that !!
scikit-learn machine-learning-model
$endgroup$
I have the following dataset
+-----------------------------------+
| Passenger | Trip |
+-----------------------------------+
| John | London |
| Jack | Paris |
| Joe | Sydney |
| John | London |
| John | London |
| Jill | New york |
| Jim | Sydney |
| Jack | Paris |
| James | Sydney |
+-----------------------------------+
And am trying to use scikit library to predict the likelihood of next possible trip of a passenger based on the frequency ( In this case John => London).
As a novice am unsure on which model / function to use.
Update 2:
If I have over 10 million records , how different should I approach this problem ?
Update 3:
The following code worked for the larger dataset !
series_px = df_px_dest.groupby('Passenger')['Trip'].apply(lambda x: x.value_counts().head(1))
df_px = series_px.to_frame()
df_px.index = df_px.index.set_names(['UID', 'DEST'])
df_px.reset_index(inplace=True)
def getNextPossibleDestByUserID(name,df=df_px):
return df.query('UID==@name')['DEST'].to_string(index=False)
My next target is to expose that as an API (using Flask maybe) , Will probably raise a new question for that !!
scikit-learn machine-learning-model
scikit-learn machine-learning-model
edited 6 mins ago
Maddy
asked Feb 22 at 1:44
MaddyMaddy
464
464
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
For something like this, you could go with a simpler approach. One idea is to sample randomly among the cities that a given passenger has visited using the amount of times each city has been visited as probabilites.
Here's a way you could do so. I've added a few more examples to the dataframe so that the application is seen more clearly. Say you instead have:
Passenger Trip
0 John London
1 Jack Girona
2 Jack Paris
3 Joe Sydney
4 Joe Amsterdam
5 Joe Barcelona
6 Joe Barcelona
7 John London
8 John Paris
9 Jill Newyork
10 Jim Sydney
11 Jack Paris
12 James Sydney
You could define a function like the folllowing in order to randomly sample from the existing data in the dataframe:
def random_sample(df, name):
import numpy as np
# group the dataframe by Passenger and count
# the different trips
g = df.groupby('Passenger').Trip.value_counts()
# Make the probabilities add up to 1
freq = g[name] / g[name].sum()
# random destination based on
# its probabilities
random_name = np.random.choice(a=freq.index, size=1,
p = freq.values)[0]
# return likelyhood of next randomly chosen
# destination and destination
return freq[random_name], random_name
Usage
Say we want to select a a randomly samples destination for say Joe
and also to know which is the likelihood. Considering that the destinations where Joe
has been are:
Trip
Barcelona 2
Amsterdam 1
Sydney 1
We could get for example:
for _ in range(5):
freq, dest = random_sample(df, 'Joe')
print('Chosen destination {} with a probability of {}'.format(dest, freq))
Chosen destination Sydney with a probability of 0.25
Chosen destination Barcelona with a probability of 0.5
Chosen destination Barcelona with a probability of 0.5
Chosen destination Barcelona with a probability of 0.5
Chosen destination Sydney with a probability of 0.25
New contributor
$endgroup$
$begingroup$
Thanks Alex for your answer. For a smaller subset this works perfectly fine. Now my next step is to try this with a large dataset (Say, I have over 10 million records). Will the same approach works in that case ? P.S I have also updated my question with this constraint now.
$endgroup$
– Maddy
2 days ago
1
$begingroup$
For that you could do the groupby only once, so before calling the funcction, and send it as an extra agument. Should be fast enough. Let me know :)
$endgroup$
– yatu
20 hours ago
$begingroup$
updated my question with the groupby approach !
$endgroup$
– Maddy
5 mins ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45989%2fmodel-to-predict-based-on-frequency-of-occurrence%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
For something like this, you could go with a simpler approach. One idea is to sample randomly among the cities that a given passenger has visited using the amount of times each city has been visited as probabilites.
Here's a way you could do so. I've added a few more examples to the dataframe so that the application is seen more clearly. Say you instead have:
Passenger Trip
0 John London
1 Jack Girona
2 Jack Paris
3 Joe Sydney
4 Joe Amsterdam
5 Joe Barcelona
6 Joe Barcelona
7 John London
8 John Paris
9 Jill Newyork
10 Jim Sydney
11 Jack Paris
12 James Sydney
You could define a function like the folllowing in order to randomly sample from the existing data in the dataframe:
def random_sample(df, name):
import numpy as np
# group the dataframe by Passenger and count
# the different trips
g = df.groupby('Passenger').Trip.value_counts()
# Make the probabilities add up to 1
freq = g[name] / g[name].sum()
# random destination based on
# its probabilities
random_name = np.random.choice(a=freq.index, size=1,
p = freq.values)[0]
# return likelyhood of next randomly chosen
# destination and destination
return freq[random_name], random_name
Usage
Say we want to select a a randomly samples destination for say Joe
and also to know which is the likelihood. Considering that the destinations where Joe
has been are:
Trip
Barcelona 2
Amsterdam 1
Sydney 1
We could get for example:
for _ in range(5):
freq, dest = random_sample(df, 'Joe')
print('Chosen destination {} with a probability of {}'.format(dest, freq))
Chosen destination Sydney with a probability of 0.25
Chosen destination Barcelona with a probability of 0.5
Chosen destination Barcelona with a probability of 0.5
Chosen destination Barcelona with a probability of 0.5
Chosen destination Sydney with a probability of 0.25
New contributor
$endgroup$
$begingroup$
Thanks Alex for your answer. For a smaller subset this works perfectly fine. Now my next step is to try this with a large dataset (Say, I have over 10 million records). Will the same approach works in that case ? P.S I have also updated my question with this constraint now.
$endgroup$
– Maddy
2 days ago
1
$begingroup$
For that you could do the groupby only once, so before calling the funcction, and send it as an extra agument. Should be fast enough. Let me know :)
$endgroup$
– yatu
20 hours ago
$begingroup$
updated my question with the groupby approach !
$endgroup$
– Maddy
5 mins ago
add a comment |
$begingroup$
For something like this, you could go with a simpler approach. One idea is to sample randomly among the cities that a given passenger has visited using the amount of times each city has been visited as probabilites.
Here's a way you could do so. I've added a few more examples to the dataframe so that the application is seen more clearly. Say you instead have:
Passenger Trip
0 John London
1 Jack Girona
2 Jack Paris
3 Joe Sydney
4 Joe Amsterdam
5 Joe Barcelona
6 Joe Barcelona
7 John London
8 John Paris
9 Jill Newyork
10 Jim Sydney
11 Jack Paris
12 James Sydney
You could define a function like the folllowing in order to randomly sample from the existing data in the dataframe:
def random_sample(df, name):
import numpy as np
# group the dataframe by Passenger and count
# the different trips
g = df.groupby('Passenger').Trip.value_counts()
# Make the probabilities add up to 1
freq = g[name] / g[name].sum()
# random destination based on
# its probabilities
random_name = np.random.choice(a=freq.index, size=1,
p = freq.values)[0]
# return likelyhood of next randomly chosen
# destination and destination
return freq[random_name], random_name
Usage
Say we want to select a a randomly samples destination for say Joe
and also to know which is the likelihood. Considering that the destinations where Joe
has been are:
Trip
Barcelona 2
Amsterdam 1
Sydney 1
We could get for example:
for _ in range(5):
freq, dest = random_sample(df, 'Joe')
print('Chosen destination {} with a probability of {}'.format(dest, freq))
Chosen destination Sydney with a probability of 0.25
Chosen destination Barcelona with a probability of 0.5
Chosen destination Barcelona with a probability of 0.5
Chosen destination Barcelona with a probability of 0.5
Chosen destination Sydney with a probability of 0.25
New contributor
$endgroup$
$begingroup$
Thanks Alex for your answer. For a smaller subset this works perfectly fine. Now my next step is to try this with a large dataset (Say, I have over 10 million records). Will the same approach works in that case ? P.S I have also updated my question with this constraint now.
$endgroup$
– Maddy
2 days ago
1
$begingroup$
For that you could do the groupby only once, so before calling the funcction, and send it as an extra agument. Should be fast enough. Let me know :)
$endgroup$
– yatu
20 hours ago
$begingroup$
updated my question with the groupby approach !
$endgroup$
– Maddy
5 mins ago
add a comment |
$begingroup$
For something like this, you could go with a simpler approach. One idea is to sample randomly among the cities that a given passenger has visited using the amount of times each city has been visited as probabilites.
Here's a way you could do so. I've added a few more examples to the dataframe so that the application is seen more clearly. Say you instead have:
Passenger Trip
0 John London
1 Jack Girona
2 Jack Paris
3 Joe Sydney
4 Joe Amsterdam
5 Joe Barcelona
6 Joe Barcelona
7 John London
8 John Paris
9 Jill Newyork
10 Jim Sydney
11 Jack Paris
12 James Sydney
You could define a function like the folllowing in order to randomly sample from the existing data in the dataframe:
def random_sample(df, name):
import numpy as np
# group the dataframe by Passenger and count
# the different trips
g = df.groupby('Passenger').Trip.value_counts()
# Make the probabilities add up to 1
freq = g[name] / g[name].sum()
# random destination based on
# its probabilities
random_name = np.random.choice(a=freq.index, size=1,
p = freq.values)[0]
# return likelyhood of next randomly chosen
# destination and destination
return freq[random_name], random_name
Usage
Say we want to select a a randomly samples destination for say Joe
and also to know which is the likelihood. Considering that the destinations where Joe
has been are:
Trip
Barcelona 2
Amsterdam 1
Sydney 1
We could get for example:
for _ in range(5):
freq, dest = random_sample(df, 'Joe')
print('Chosen destination {} with a probability of {}'.format(dest, freq))
Chosen destination Sydney with a probability of 0.25
Chosen destination Barcelona with a probability of 0.5
Chosen destination Barcelona with a probability of 0.5
Chosen destination Barcelona with a probability of 0.5
Chosen destination Sydney with a probability of 0.25
New contributor
$endgroup$
For something like this, you could go with a simpler approach. One idea is to sample randomly among the cities that a given passenger has visited using the amount of times each city has been visited as probabilites.
Here's a way you could do so. I've added a few more examples to the dataframe so that the application is seen more clearly. Say you instead have:
Passenger Trip
0 John London
1 Jack Girona
2 Jack Paris
3 Joe Sydney
4 Joe Amsterdam
5 Joe Barcelona
6 Joe Barcelona
7 John London
8 John Paris
9 Jill Newyork
10 Jim Sydney
11 Jack Paris
12 James Sydney
You could define a function like the folllowing in order to randomly sample from the existing data in the dataframe:
def random_sample(df, name):
import numpy as np
# group the dataframe by Passenger and count
# the different trips
g = df.groupby('Passenger').Trip.value_counts()
# Make the probabilities add up to 1
freq = g[name] / g[name].sum()
# random destination based on
# its probabilities
random_name = np.random.choice(a=freq.index, size=1,
p = freq.values)[0]
# return likelyhood of next randomly chosen
# destination and destination
return freq[random_name], random_name
Usage
Say we want to select a a randomly samples destination for say Joe
and also to know which is the likelihood. Considering that the destinations where Joe
has been are:
Trip
Barcelona 2
Amsterdam 1
Sydney 1
We could get for example:
for _ in range(5):
freq, dest = random_sample(df, 'Joe')
print('Chosen destination {} with a probability of {}'.format(dest, freq))
Chosen destination Sydney with a probability of 0.25
Chosen destination Barcelona with a probability of 0.5
Chosen destination Barcelona with a probability of 0.5
Chosen destination Barcelona with a probability of 0.5
Chosen destination Sydney with a probability of 0.25
New contributor
New contributor
answered Feb 22 at 11:20
yatuyatu
1214
1214
New contributor
New contributor
$begingroup$
Thanks Alex for your answer. For a smaller subset this works perfectly fine. Now my next step is to try this with a large dataset (Say, I have over 10 million records). Will the same approach works in that case ? P.S I have also updated my question with this constraint now.
$endgroup$
– Maddy
2 days ago
1
$begingroup$
For that you could do the groupby only once, so before calling the funcction, and send it as an extra agument. Should be fast enough. Let me know :)
$endgroup$
– yatu
20 hours ago
$begingroup$
updated my question with the groupby approach !
$endgroup$
– Maddy
5 mins ago
add a comment |
$begingroup$
Thanks Alex for your answer. For a smaller subset this works perfectly fine. Now my next step is to try this with a large dataset (Say, I have over 10 million records). Will the same approach works in that case ? P.S I have also updated my question with this constraint now.
$endgroup$
– Maddy
2 days ago
1
$begingroup$
For that you could do the groupby only once, so before calling the funcction, and send it as an extra agument. Should be fast enough. Let me know :)
$endgroup$
– yatu
20 hours ago
$begingroup$
updated my question with the groupby approach !
$endgroup$
– Maddy
5 mins ago
$begingroup$
Thanks Alex for your answer. For a smaller subset this works perfectly fine. Now my next step is to try this with a large dataset (Say, I have over 10 million records). Will the same approach works in that case ? P.S I have also updated my question with this constraint now.
$endgroup$
– Maddy
2 days ago
$begingroup$
Thanks Alex for your answer. For a smaller subset this works perfectly fine. Now my next step is to try this with a large dataset (Say, I have over 10 million records). Will the same approach works in that case ? P.S I have also updated my question with this constraint now.
$endgroup$
– Maddy
2 days ago
1
1
$begingroup$
For that you could do the groupby only once, so before calling the funcction, and send it as an extra agument. Should be fast enough. Let me know :)
$endgroup$
– yatu
20 hours ago
$begingroup$
For that you could do the groupby only once, so before calling the funcction, and send it as an extra agument. Should be fast enough. Let me know :)
$endgroup$
– yatu
20 hours ago
$begingroup$
updated my question with the groupby approach !
$endgroup$
– Maddy
5 mins ago
$begingroup$
updated my question with the groupby approach !
$endgroup$
– Maddy
5 mins ago
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45989%2fmodel-to-predict-based-on-frequency-of-occurrence%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown