How to deal with possible data leakage in time series data?Unsupervised binning of non-normal dataMachine...
Is there any risk in sharing info about technologies and products we use with a supplier?
How can I remove (non-trivial) duplicates from a VCF file?
Why TEventArgs wasn't made contravariant in standard event pattern in the .Net ecosystem?
What to look for when criticizing poetry?
What is the wife of a henpecked husband called?
How to tell if a BJT is PNP or NPN by looking at the circuit?
Why do cars have plastic shrouds over the engine?
Is Krishna the only avatar among dashavatara who had more than one wife?
Finding a logistic regression model which can achieve zero error on a training set training data for a binary classification problem with two features
Play Zip, Zap, Zop
How do you funnel food off a cutting board?
Crontab: Ubuntu running script (noob)
What is the purpose of easy combat scenarios that don't need resource expenditure?
How can a large fleets maintain formation in interstellar space?
Why was Lupin comfortable with saying Voldemort's name?
What is the data structure of $@ in shell?
Move fast ...... Or you will lose
How does Leonard in "Memento" remember reading and writing?
Is a new Boolean field better than a null reference when a value can be meaningfully absent?
In mixed effect models, how account for grouped random effects?
Making him into a bully (how to show mild violence)
Does every functor from Set to Set preserve products?
Potential client has a problematic employee I can't work with
Why zero tolerance on nudity in space?
How to deal with possible data leakage in time series data?
Unsupervised binning of non-normal dataMachine Learning models in production environmentModel that adapts to sample updatesinformation leakage when using empirical Bayesian to generate a predictorConvolutional network for classification, extremely sensitive to lightingIs it possible to use the saved xgboost model (with one-hot encoding features) on unseen data (without one-hot encoding) for prediction?Predicting with categorical dataData Snooping, Information Leakage When Performing Feature NormalizationMachine Learning Algorithm for Dynamic EnvironmentsData leakage and predictive models: should we use past predictions as a feature?
$begingroup$
I have historical consumer data who have taken out a loan at some point in time. The task is to predict if a consumer will default when requesting a loan.
My issue is that for some customer in the data set, historical transactions are only available after the loan was issued. I believe using data after the loan event for prediction will cause data leakage.
This is a subtle leakage because it does not involve using information not available at prediction time. My concern is more about behavioral change when the customer is indebted that create a shift in the underlying distribution.
To test my hypothesis I was wondering if comparing whether the two samples before and after the loan is issued come from the same distribution will be a good approach.
These are my questions:
Is there really data leakage in the scenario I described
If yes, can I test it in any way?
Can a two samples test provide an answer? Which one? Note that the sample is composed of multivariate data
Can I do testing using any machine learning approach? I was thinking of using a Mixture Model to test for instance.
Any suggestion on how to best deal with this problem other than what I suggested will be appreciated.
Thanks
machine-learning data-leakage
$endgroup$
add a comment |
$begingroup$
I have historical consumer data who have taken out a loan at some point in time. The task is to predict if a consumer will default when requesting a loan.
My issue is that for some customer in the data set, historical transactions are only available after the loan was issued. I believe using data after the loan event for prediction will cause data leakage.
This is a subtle leakage because it does not involve using information not available at prediction time. My concern is more about behavioral change when the customer is indebted that create a shift in the underlying distribution.
To test my hypothesis I was wondering if comparing whether the two samples before and after the loan is issued come from the same distribution will be a good approach.
These are my questions:
Is there really data leakage in the scenario I described
If yes, can I test it in any way?
Can a two samples test provide an answer? Which one? Note that the sample is composed of multivariate data
Can I do testing using any machine learning approach? I was thinking of using a Mixture Model to test for instance.
Any suggestion on how to best deal with this problem other than what I suggested will be appreciated.
Thanks
machine-learning data-leakage
$endgroup$
add a comment |
$begingroup$
I have historical consumer data who have taken out a loan at some point in time. The task is to predict if a consumer will default when requesting a loan.
My issue is that for some customer in the data set, historical transactions are only available after the loan was issued. I believe using data after the loan event for prediction will cause data leakage.
This is a subtle leakage because it does not involve using information not available at prediction time. My concern is more about behavioral change when the customer is indebted that create a shift in the underlying distribution.
To test my hypothesis I was wondering if comparing whether the two samples before and after the loan is issued come from the same distribution will be a good approach.
These are my questions:
Is there really data leakage in the scenario I described
If yes, can I test it in any way?
Can a two samples test provide an answer? Which one? Note that the sample is composed of multivariate data
Can I do testing using any machine learning approach? I was thinking of using a Mixture Model to test for instance.
Any suggestion on how to best deal with this problem other than what I suggested will be appreciated.
Thanks
machine-learning data-leakage
$endgroup$
I have historical consumer data who have taken out a loan at some point in time. The task is to predict if a consumer will default when requesting a loan.
My issue is that for some customer in the data set, historical transactions are only available after the loan was issued. I believe using data after the loan event for prediction will cause data leakage.
This is a subtle leakage because it does not involve using information not available at prediction time. My concern is more about behavioral change when the customer is indebted that create a shift in the underlying distribution.
To test my hypothesis I was wondering if comparing whether the two samples before and after the loan is issued come from the same distribution will be a good approach.
These are my questions:
Is there really data leakage in the scenario I described
If yes, can I test it in any way?
Can a two samples test provide an answer? Which one? Note that the sample is composed of multivariate data
Can I do testing using any machine learning approach? I was thinking of using a Mixture Model to test for instance.
Any suggestion on how to best deal with this problem other than what I suggested will be appreciated.
Thanks
machine-learning data-leakage
machine-learning data-leakage
edited Feb 15 at 2:03
irkinosor
asked Feb 14 at 12:14
irkinosorirkinosor
1415
1415
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
There is no need for sample tests.
A customer may have received many loans 1 to n - 1. To predict the default rate of nth request at time t(n), you are allowed to use any information up until t(n). When a user has no transaction history before t(1) system cannot predict the default rate for her; except maybe based on her age, income, etc. However, for the next loan request at t(2) system can use the transactions between t(1) and t(2), but still cannot use any transaction that happened after t(2). For any particular prediction at t(n), events happened after t(n) must never be used.
Regarding "it does not involve using information not available at prediction time", it does involve using information not available at prediction time t(n), since system is trying to utilize transactions that occur after t(n).
New contributor
P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45575%2fhow-to-deal-with-possible-data-leakage-in-time-series-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
There is no need for sample tests.
A customer may have received many loans 1 to n - 1. To predict the default rate of nth request at time t(n), you are allowed to use any information up until t(n). When a user has no transaction history before t(1) system cannot predict the default rate for her; except maybe based on her age, income, etc. However, for the next loan request at t(2) system can use the transactions between t(1) and t(2), but still cannot use any transaction that happened after t(2). For any particular prediction at t(n), events happened after t(n) must never be used.
Regarding "it does not involve using information not available at prediction time", it does involve using information not available at prediction time t(n), since system is trying to utilize transactions that occur after t(n).
New contributor
P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
$begingroup$
There is no need for sample tests.
A customer may have received many loans 1 to n - 1. To predict the default rate of nth request at time t(n), you are allowed to use any information up until t(n). When a user has no transaction history before t(1) system cannot predict the default rate for her; except maybe based on her age, income, etc. However, for the next loan request at t(2) system can use the transactions between t(1) and t(2), but still cannot use any transaction that happened after t(2). For any particular prediction at t(n), events happened after t(n) must never be used.
Regarding "it does not involve using information not available at prediction time", it does involve using information not available at prediction time t(n), since system is trying to utilize transactions that occur after t(n).
New contributor
P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
$begingroup$
There is no need for sample tests.
A customer may have received many loans 1 to n - 1. To predict the default rate of nth request at time t(n), you are allowed to use any information up until t(n). When a user has no transaction history before t(1) system cannot predict the default rate for her; except maybe based on her age, income, etc. However, for the next loan request at t(2) system can use the transactions between t(1) and t(2), but still cannot use any transaction that happened after t(2). For any particular prediction at t(n), events happened after t(n) must never be used.
Regarding "it does not involve using information not available at prediction time", it does involve using information not available at prediction time t(n), since system is trying to utilize transactions that occur after t(n).
New contributor
P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
There is no need for sample tests.
A customer may have received many loans 1 to n - 1. To predict the default rate of nth request at time t(n), you are allowed to use any information up until t(n). When a user has no transaction history before t(1) system cannot predict the default rate for her; except maybe based on her age, income, etc. However, for the next loan request at t(2) system can use the transactions between t(1) and t(2), but still cannot use any transaction that happened after t(2). For any particular prediction at t(n), events happened after t(n) must never be used.
Regarding "it does not involve using information not available at prediction time", it does involve using information not available at prediction time t(n), since system is trying to utilize transactions that occur after t(n).
New contributor
P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
answered 12 mins ago
P. EsmailianP. Esmailian
1
1
New contributor
P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45575%2fhow-to-deal-with-possible-data-leakage-in-time-series-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown