• English
    • français
    • Deutsch
    • español
    • português (Brasil)
    • Bahasa Indonesia
    • русский
    • العربية
    • 中文
  • English 
    • English
    • français
    • Deutsch
    • español
    • português (Brasil)
    • Bahasa Indonesia
    • русский
    • العربية
    • 中文
  • Login
View Item 
  •   Home
  • OAI Data Pool
  • OAI Harvested Content
  • View Item
  •   Home
  • OAI Data Pool
  • OAI Harvested Content
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Browse

All of the LibraryCommunitiesPublication DateTitlesSubjectsAuthorsThis CollectionPublication DateTitlesSubjectsAuthorsProfilesView

My Account

Login

The Library

AboutNew SubmissionSubmission GuideSearch GuideRepository PolicyContact

Statistics

Most Popular ItemsStatistics by CountryMost Popular Authors

GraBaSS - Graph-based Subspace Search

  • CSV
  • RefMan
  • EndNote
  • BibTex
  • RefWorks
Author(s)
Marco Neumann
Contributor(s)
Böhm, Klemens
Tichy, Walter
Müller, Emmanuel
Nguyen, Hoang Vu

Full record
Show full item record
URI
http://hdl.handle.net/20.500.12424/689903
Abstract
Abstract
<p>The 21th century is the age of information. Every year the amount of new data that is accessible to analysts get higher. More sensor networks get installed, higher resolutions in both, the data dimensions itself and the time, get possible, more communication over the global network is made and new methods for measuring health, environment, social and finance parameters are developed. If this data is used by the right, ethical way, it can have a huge impact on our society, pushing the global development and making new technologies possible. Not just since “Big Data” got the buzzword of the year and thousands of new companies try to sell their products for data analysis, it is clear that this amount of data can only be analyzed using fast computers and clever algorithms.</p> <p>In the last decades many researchers developed methods to group data, predict new data or find anomalies. But more data does not only mean more data points, it also means more dimensions. And so, many methods getting slow or does not work anymore. This fact is known as curse of dimensionality. When the number of dimensions get higher, dimensions can be grouped together because they are similar or describe disjunct features. This process is also called feature selection and the groups can be called subspaces. If the data is projected into this subspaces, standard analysis methods can be used. So researchers developed methods for feature selection. They offer good results when used with high number of objects and a high number of dimensions and some of them are also proven in a theoretical way, but have one common problem: they are really slow. Having a cubic or higher complexity in the number of dimensions, it is not possible to use them for todays or future data sets. For many of them, it is also not possible to use them in a parallel way, which is highly important today and will get essential in the next years. Another problem are the parameters of the algorithms. If there are too many parameters which interfere with each other and are not intuitive, analysts just choose default, random or dummy parameters and are not able to get a good result. It is also very common to have parameters which ranges that heavily depends on the structure of the input data and not only the data size. The perfect case would be a few parameters with intersected effects and fixed ranges.</p> <p>So why is this a problem so relevant, you may ask. Just use a faster computer and more memory or wait a day, a week, a month or even longer. It is important because we are wasting the most important resource we have. It is not water, not energy, not oil, not gold or lithium. It is not knowledge or intelligence. Our most important resource is time and we all are running out of them. I believe that there is a way to get the relevant information faster, just at the moment when you need them. Even when ad-hoc data analysis is not possible today, I believe it is possible in the future. And it will change everything — the way we consume information and media, the art of describing our environment and our society, the way people life and communicate, the methods of research, production, planning and design, even the way we think and decide.</p> <p>As a first, small step to this future I propose a new faster method for feature selection: Graph Based Subspace Search, or GraBaSS. It may not be as exact or mathematically proven as some of the competitors. But it works fast, parallel, the parameters are easy to choose and the results are intuitive. Depending on the chosen parameters, it can be used for automatic and manual data processing. It is also possible to choose how strong the requirement of similarity is, depending on the data set and the ways the subspaces should used later.</p> <p>GraBaSS is build on the insight that in the most subspaces all dimensions are similar to each other, which forms a binary relationship. In contrast to other approaches, which often use a bottom up approach to find subspaces and require an enormous amount of time and space, GraBaSS uses this binary relationship to form a graph. This graph gets optimized and the cliques in the resulting graph are forming the subspaces. This work also discusses how to decide if two dimensions are similar and builds a framework around it. This can be used for further research and as possibility to modify GraBaSS for special purposes, e.g. to find subspaces with only linear relationships or where the dimensions are a special transformation of each other.</p> <p>A special chapter of this thesis discusses the implementation of the algorithm. It explains decisions about programming languages and frameworks and gives useful tips that can be used to implement other methods in a high performance way. The implementation of GraBaSS is provided as Open Source, so that everyone can use it to process its own data and to learn from the implementation. Together with GraBaSS a data backend is provided that stores parsed data like a column storage but gives low level access for good performance. It enables to use the same parsed data set for other tasks like cluster search or outlier detection after doing the subspace analysis. Because dimensions are stored separately, no changes of the storage is required if other tasks are only done in a specific subspace.</p>
Date
2013
Type
Software
Identifier
oai:oai.datacite.org:6501074
doi:10.5281/zenodo.20696
url:http://zenodo.org/record/20696
DOI
10.5281/zenodo.20696
Copyright/License
Open Access
ae974a485f413a2113503eed53cd6c53
10.5281/zenodo.20696
Scopus Count
Collections
OAI Harvested Content

entitlement

 
DSpace software (copyright © 2002 - 2022)  DuraSpace
Quick Guide | Contact Us
Open Repository is a service operated by 
Atmire NV
 

Export search results

The export option will allow you to export the current search results of the entered query to a file. Different formats are available for download. To export the items, click on the button corresponding with the preferred download format.

By default, clicking on the export buttons will result in a download of the allowed maximum amount of items.

To select a subset of the search results, click "Selective Export" button and make a selection of the items you want to export. The amount of items that can be exported at once is similarly restricted as the full export.

After making a selection, click one of the export format buttons. The amount of items that will be exported is indicated in the bubble next to export format.