IDA 2016 Challenge ------------------- ++ About the data + Introduction The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS) which generates pressurised air that are utilized in various functions in a truck, such as braking and gear changes. The datasets' positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS. The data consists of a subset of all available data, selected by experts. + Attributes The attribute names of the data have been anonymized for proprietary reasons. It consists of both single numerical counters and histograms consisting of bins with different conditions. Typically the histograms have open-ended conditions at each end. For example if we measuring the ambient temperature "T" then the histogram could be defined with 4 bins where: bin 1 collect values for temperature T < -20 bin 2 collect values for temperature T >= -20 and T < 0 bin 3 collect values for temperature T >= 0 and T < 20 bin 4 collect values for temperature T > 20 | b1 | b2 | b3 | b4 | ----------------------------- -20 0 20 The attributes are as follows: class, then anonymized operational data. The operational data have an identifier and a bin id, like "Identifier_Bin". In total there are 171 attributes, of which 7 are histogram variabels. Missing values are denoted by "na". + Data characteristics The data comes in two files, one training set which contain the class labels and a test set which has an additional id number for each example and "na" as class label. Both of these sets are sampled from the same source with stratification regarding the class, hence they should contain roughly the same distribution of faulty and non-faulty components. The training set contains 60000 examples in total in which 59000 belong to the negative class and 1000 positive class. The test set contains 16000 examples. ++ Participating + Challenge The challenge task is to come up with a good prediction model for judging whether or not a vehicle faces imminent failure of the specific component or not. The contestants should classify the test set, and send their predictions to "ida2016challenge@dsv.su.se". The prediction should be sent as an attachment, in at "txt" file format, each row in the file should be in the following format: id_1, predicted_class_1 id_2, predicted_class_2 ... id_n, predicted_class_n where the "id_1" should refer to an examples id of the test_set followed by the models prediction for this instance. Further must the competitors also write a paper describing the methods they used when creating their predictive model. It is encouraged that this paper is written in a clear and inspirational style in which they motivate their decision when developing their model. The papers should use the same template as the other papers in the conference, see, "ida2016.blogs.dsv.su.se" for more details. + Awards The top contenders' papers will be published in the IDA 2016 proceedings and will be asked to hold an oral presentation at the conference. Scania who supplied the data, also sponsor a prize of 500 € divided into two awards. The two awards are: "The best result", which goes to the authors/creators of the prediction model which achieves the lowest cost (see metric below) of all participants on the given test set. And the "most innovative and interesting" approach when addressing the challenge. Note that these awards are not mutually exclusive so one approach could win them both. Deadline for sending in classifications and describing paper is 2016-07-01. + Challenge metric The contestants will be measure according to the following cost-metric of miss-classification: Predicted class | True class | | pos | neg | ----------------------------------------- | | | pos | - | Cost_1 | ----------------------------------------- | | | neg | Cost_2 | - | ----------------------------------------- Cost_1 = 10 and cost_2 = 500 The total cost of a prediction model the sum of "Cost_1" multiplied by the number of Instances with type 1 failure and "Cost_2" with the number of instances with type 2 failure, resulting in a "Total_cost". In this case Cost_1 refers to the cost that an unnessecary check needs to be done by an mechanic at an workshop, while Cost_2 refer to the cost of missing a faulty truck, which may cause a breakdown. Total_cost = Cost_1*No_Instances + Cost_2*No_Instances. If you have any question regarding the challenge, please contact: ida2016challenge@dsv.su.se Good luck! Tony Lindgren, Industrial Challenge Chair.