por Roberto Mascarenhas Braga
Então você gosta de tecnologia e resolveu abrir uma empresa? Ótimo! Resolvemos escrever esse artigo para compartilhar algumas coisas que concluímos desde que fundamos a IPe. Claro que não existe receita de bolo ou fórmula certa, mas a gente percebeu características comuns em vários negócios de sucesso e aprendemos um bocado com nosso erros também. Vamos lá...
- Pare de falar e comece a fazer
Geralmente empreendedores são pessoas eloquentes, falantes. E isso é ótimo, é inclusive uma das características mais desejadas em alguém que tenha um negócio. Entretanto, é comum que novos empresários comecem a planejar demais, falar demais, projetar demais e esqueçam de colocar a mão na massa. Não caia nessa armadilha! Tão logo possível, comece a agir: mapear clientes em potencial, verificar concorrentes de mercado, fazer um protótipo... Acredite, isso vai ajudar muito e as ideias para o novo negócio vão ferver na cabeça!
- Dedique-se ao negócio
Quando começamos a IPe, faziamos várias outras coisas. Fora estudar - estávamos ainda na faculdade - tinhamos estágios, freelas, projetos paralelos... Na nossa cabeça, era um jeito de conseguir se manter enquanto a empresa deslanchava. Entretanto, notamos que mesmo conciliando o tempo, não davamos o gás que a empresa merecia porque tinhamos uma fonte de renda garantida. É a famosa zona de conforto, só nos superamos quando somos colocados em situações extremas. Nossa solução foi aos poucos ir nos livrando dessas atividades e nos dedicar cada vez mais apenas ao negócio. O dinheiro para se manter? Leia o próximo tópico! :)
- Comece com serviços, foque em produtos
Serviços são uma ótima maneira de começar uma empresa: não necessitam de investimentos em maquinário e matéria-prima e são uma fonte de renda. Se você sabe fazer algo que seus clientes em potencial não sabem, você pode vender isso!
Entretanto, empresas de serviço tem que lidar com alguns problemas. Não é nada gratificante trabalhar para terceiros. O empreendedor gosta de estabelecer seus próprios prazos, metas e objetivos. Quando se trata de um projeto para terceiros, nem sempre isso é possível e você se torna um funcionário do seu cliente. Outro ponto é que é difícil manter o padrão de qualidade quando a empresa vai crescendo.
Nossa sugestão é, desde o início, ter uma galinha dos ovos de ouro, uma ideia de produto que será o que realmente vai dar dinheiro para a sua empresa a longo prazo. O tempo que sua equipe se dedica a serviços deve ser compartilhado com o tempo que se dedica ao desenvolvimento deste produto. O mais interessante é, na verdade, que você descubra serviços que tragam novas ideias ao produto ou que sirvam como teste dele. Assim você gera receita para a empresa sem parar o desenvolvimento do produto.
- Aproxime-se de outros empreendedores
Acredite, muito mais gente está passando pelas dúvidas e desafios que você passa. Converse com amigos que tem empresas e veja o que eles tem a dizer. Vá a eventos do Sebrae, universidades, incubadoras. Essas serão ótimas oportunidades para você pensar mais sobre sua empresa e sobre o que você pode fazer para torná-la mais interessante. Ficar isolado desenvolvendo um produto não combina com empreendedorismo. É preciso conhecer o que está acontecendo em volta.
- Trabalhe duro, mas não tenha compromisso com erros
Às vezes tendemos a pensar que para começar um bom negócio precisamos ter uma grande ideia, revolucionária e que vai fazer a gente ganhar milhões. Isso é apenas em parte verdade. Mais do que ter a boa ideia, é necessário planejamento, dedicação e suor. Um bom exemplo é a o filme "A Rede Social", que conta a história do Facebook. Friendster, Hi5, MySpace, Orkut... Eram várias a redes sociais que já existiam. O que fez o negócio do Zuckerberg decolar foi a dedicação em fazer algo realmente bom (noites e noites ralando), incorporando pequenas inovações (reputação por faculdades, status de relacionamentos) que contribuiram para o Facebook emplacar.
A lição é: mesmo que sua ideia não seja lá tão inovadora, dedique-se para fazê-la melhor do que o que já está por aí no mercado. Se você faz com paixão, bem provavelmente sua dedicação vai transformar isso em um bom trabalho e bem rápido seu trabalho vai ter fãs que vão fazer o trabalho de divulgação por você.
Entretanto, um grande perigo é se apaixonar tanto à ideia e não ver os problemas que ela tem. Tenha sempre por perto um time de conselheiros/gurus, amigos que podem te dar um alerta se verem que você está indo por um caminho muito errado. Não tenha medo de mudar de caminho, de pensar em uma nova ideia ou de repensar o negócio.
Claro que há muito mais coisa, mas esperamos que esse artigo seja uma inspiração. Os comentários estão aí para a gente conversar mais!
Hora da T.I.
Compartilhando conhecimento nas "horas vagas"
segunda-feira, 6 de junho de 2011
quarta-feira, 18 de maio de 2011
Open Source Cloud Computing with Hadoop
Have you ever wondered how Google, Facebook and other Internet giants process their massive workloads? Billions of request are served every day by the biggest players on the Internet, resulting in background processing involving datasets in the petabyte scale. Of course they rely on Linux and cloud computing for obtaining the necessary scalability and performance. The flexibility of Linux combined with the seamless scalability of cloud environments provide the perfect framework for processing huge datasets while eliminating the need for expensive infrastructure and custom proprietary software. Nowadays, Hadoop is one of the best choices in open source cloud computing, offering a platform for large scale data crunching.
Introduction
In this article we introduce and analyze the Hadoop project, which has been embraced by many commercial and scientific initiatives that need to process huge datasets. It provides a full platform for large scale dataset processing in cloud environments, being easily scalable since it can be deployed on heterogeneous cluster infrastructure and regular hardware. As of April 2011, Amazon, AOL, Adobe, Ebay, Google, IBM, Twitter, Yahoo and several universities are listed as users in the project's wiki. Being maintained by the Apache Foundation, Hadoop comprises a full suite for seamless distributed scalable computing on huge datasets. It provides base components on top of which new distributed computing sub projects can be implemented. Among its main components there is an open source implementation of the MapReduce framework (for distributed data processing) together with a data storage solution composed by a distributed filesystem and a data warehouse.
The MapReduce Framework
The MapReduce framework was created and patented by Google in order to process their own page rank algorithm and other applications that support their search engine. The idea behind it was actually introduced many years ago by the first functional programming languages such as LISP, and basically consists in partitioning a large problem into several "smaller" problems that can be solved separately. The partitioning and finally the main problem's result are computed by two functions: Map and Reduce. In terms of data processing, the Map function takes a large dataset and partitions it into several smaller intermediate datasets that can be processed in parallel by different nodes in a cluster. The reduces function then takes the separate results of each computation and aggregates them to form the final output. The power of MapReduce can be leveraged by different applications to perform operations such as sorting and statistical analysis on large datasets, which may be mapped into smaller partitions and processed in parallel.
Hdaoop MapReduce
Hadoop includes a Java implementation of the MapReduce framework, its underlying components and the necessary large scale data storage solutions. Although application programming is mostly done in Java, it provides APIs in different languages such as Ruby and Python, allowing developers to integrate Hadoop to diverse existing applications. It was first inspired by Google's implementation of MapReduce and the GFS distributed filesystem, absorbing new features as the community proposed new specific sub projects and improvements. Currently, Yahoo is one of the main contributors to this project, making public the modifications carried out by their internal developers. The basis of Hadoop and its several sub projects is the Core, which provides components and interfaces for distributed I/O and filesystems. The Avro data serialization system is also an important building block, providing cross-language RPC and persistent data storage.
On top of the Core, there's the actual implementation of MapReduce and its APIs, including the Hadoop Streaming, which allows flexible development of Map and Reduce functions in any desired language. A MapReduce cluster is composed by a master node and a cloud of several worker nodes. The nodes in this cluster may be any Java enabled platform, but large Hadoop installations are mostly run on Linux due to its flexibility, reliability and lower TCO. The master node manages the worker nodes, receiving jobs and distributing the workload across the nodes. In Hadoop terminology, the master node runs the JobTracker, responsible for handling incoming jobs and allocating nodes for performing separate tasks. Worker nodes run TaskTrackers, which offer virtual task slots that are allocated to specific map or reduce tasks depending on their access to the necessary input data and overall availability. Hadoop offers a web management interface, which allows administrators to obtain information on the status of jobs and individual nodes in the cloud. It also allows fast and easy scalability through the addition of cheap worker nodes without disrupting regular operations.
HDFS: A distributed filesystem
The main use of the MapReduce framework is in processing large volumes of data, and before any processing takes place it is necessary to first store this data in some volume accessible by the MapReduce cluster. However, it is impractical to store such large data sets on local filesystems, and much more impractical to synchronize the data across the worker nodes in the cluster. In order to address this issue, Hadoop also provides the Hadoop Distributed Filesystem (HDFS), which easily scales across the several nodes in a MapReduce cluster, leveraging the storage capacity of each node to provide storage volumes in the petabyte scale. It eliminates the need for expensive dedicated storage area network solutions while offering similar scalability and performance. HDFS runs on top of the Core and is perfectly integrated into the MapReduce APIs provided by Hadoop. It is also accessible via command line utilities and the Thrift API, which provides interfaces for various programming languages, such as Perl, C++, Python and Ruby. Furthermore, a FUSE (Filesystem in Userspace) driver can be used to mount HDFS as a standard filesystem.
In a typical HDFS+MapReduce cluster, the master node runs a NameNode, while the rest of the (worker) nodes run DataNodes. The NameNode manages HDFS volumes, being queried by clients to carry out standard filesystem operations such as add, copy, move or delete files. The DataNodes do the actual data storage, receiving commands from the NameNode and performing operations on locally stored data. In order to increase performance and optimize network communications, HDFS implements rack awareness capabilities. This feature enables the distributed filesystem and the MapReduce environment to determine which worker nodes are connected to the same switch (i.e. in the same rack), distributing data and allocating tasks in such a way that communication takes place between nodes in the same rack without overloading the network core. HDFS and MapReduce automatically manage which pieces of a given file are stored on each node, allocating nodes for processing these data accordingly. When the JobTracker receives a new job, it first queries the DataNodes of worker nodes in a same rack, allocating a task slot if the the node has the necessary data stored locally. If no available slots are found in the rack, the JobTracker then allocates the first free slot it finds.
Hive: A petabyte scale database
On top of the HDFS distributed filesystem, Hadoop implements Hive, a distributed data warehouse solution. Actually, Hive started as an internal project at Facebook and has now evolved into a fully blown project of its own, being maintained by the Apache foundation. It provides ETL (Extract, Transform and Load) features and QL, a query language similar to standard SQL. Hive queries are translated into MapReduce jobs run on table data stored on HDFS volumes. This allows Hive to process queries that involve huge datasets with performances comparable to MapReduce jobs while providing the same abstraction level of a database. Its performance is most apparent when running queries over large datasets that do not change frequently. For example, Facebook relies on Hive to store user data, run statistical analysis, process logs and generate reports.
Conclusion
We have briefly overviewed the main features and components of Hadoop. Leveraging the power of cloud computing, many large companies rely on this project to perform their day to day data processing. This is yet another example of open source software being used to build large scale scalable applications while keeping costs low. However, we have only scratched the surface of the fascinating infrastructure behind Hadoop and its many possible uses. In future articles we will see how to set up a basic Hadoop cluster and how to use it for interesting applications such as log parsing and statistical analysis.
Further Reading
If you are interested in learning more about Hadoop's architecture, administration and application development these are the best places to start:
- Hadoop: The Definitive Guide, Tim White, O'Rilley/Yahoo Press, 2 edition, 2010
- Apache Hadoop Project homepage: http://hadoop.apache.org/
Originally Published in: http://www.linuxjournal.com/content/open-source-cloud-computing-hadoop
Introduction
In this article we introduce and analyze the Hadoop project, which has been embraced by many commercial and scientific initiatives that need to process huge datasets. It provides a full platform for large scale dataset processing in cloud environments, being easily scalable since it can be deployed on heterogeneous cluster infrastructure and regular hardware. As of April 2011, Amazon, AOL, Adobe, Ebay, Google, IBM, Twitter, Yahoo and several universities are listed as users in the project's wiki. Being maintained by the Apache Foundation, Hadoop comprises a full suite for seamless distributed scalable computing on huge datasets. It provides base components on top of which new distributed computing sub projects can be implemented. Among its main components there is an open source implementation of the MapReduce framework (for distributed data processing) together with a data storage solution composed by a distributed filesystem and a data warehouse.
The MapReduce Framework
The MapReduce framework was created and patented by Google in order to process their own page rank algorithm and other applications that support their search engine. The idea behind it was actually introduced many years ago by the first functional programming languages such as LISP, and basically consists in partitioning a large problem into several "smaller" problems that can be solved separately. The partitioning and finally the main problem's result are computed by two functions: Map and Reduce. In terms of data processing, the Map function takes a large dataset and partitions it into several smaller intermediate datasets that can be processed in parallel by different nodes in a cluster. The reduces function then takes the separate results of each computation and aggregates them to form the final output. The power of MapReduce can be leveraged by different applications to perform operations such as sorting and statistical analysis on large datasets, which may be mapped into smaller partitions and processed in parallel.
Hdaoop MapReduce
Hadoop includes a Java implementation of the MapReduce framework, its underlying components and the necessary large scale data storage solutions. Although application programming is mostly done in Java, it provides APIs in different languages such as Ruby and Python, allowing developers to integrate Hadoop to diverse existing applications. It was first inspired by Google's implementation of MapReduce and the GFS distributed filesystem, absorbing new features as the community proposed new specific sub projects and improvements. Currently, Yahoo is one of the main contributors to this project, making public the modifications carried out by their internal developers. The basis of Hadoop and its several sub projects is the Core, which provides components and interfaces for distributed I/O and filesystems. The Avro data serialization system is also an important building block, providing cross-language RPC and persistent data storage.
On top of the Core, there's the actual implementation of MapReduce and its APIs, including the Hadoop Streaming, which allows flexible development of Map and Reduce functions in any desired language. A MapReduce cluster is composed by a master node and a cloud of several worker nodes. The nodes in this cluster may be any Java enabled platform, but large Hadoop installations are mostly run on Linux due to its flexibility, reliability and lower TCO. The master node manages the worker nodes, receiving jobs and distributing the workload across the nodes. In Hadoop terminology, the master node runs the JobTracker, responsible for handling incoming jobs and allocating nodes for performing separate tasks. Worker nodes run TaskTrackers, which offer virtual task slots that are allocated to specific map or reduce tasks depending on their access to the necessary input data and overall availability. Hadoop offers a web management interface, which allows administrators to obtain information on the status of jobs and individual nodes in the cloud. It also allows fast and easy scalability through the addition of cheap worker nodes without disrupting regular operations.
HDFS: A distributed filesystem
The main use of the MapReduce framework is in processing large volumes of data, and before any processing takes place it is necessary to first store this data in some volume accessible by the MapReduce cluster. However, it is impractical to store such large data sets on local filesystems, and much more impractical to synchronize the data across the worker nodes in the cluster. In order to address this issue, Hadoop also provides the Hadoop Distributed Filesystem (HDFS), which easily scales across the several nodes in a MapReduce cluster, leveraging the storage capacity of each node to provide storage volumes in the petabyte scale. It eliminates the need for expensive dedicated storage area network solutions while offering similar scalability and performance. HDFS runs on top of the Core and is perfectly integrated into the MapReduce APIs provided by Hadoop. It is also accessible via command line utilities and the Thrift API, which provides interfaces for various programming languages, such as Perl, C++, Python and Ruby. Furthermore, a FUSE (Filesystem in Userspace) driver can be used to mount HDFS as a standard filesystem.
In a typical HDFS+MapReduce cluster, the master node runs a NameNode, while the rest of the (worker) nodes run DataNodes. The NameNode manages HDFS volumes, being queried by clients to carry out standard filesystem operations such as add, copy, move or delete files. The DataNodes do the actual data storage, receiving commands from the NameNode and performing operations on locally stored data. In order to increase performance and optimize network communications, HDFS implements rack awareness capabilities. This feature enables the distributed filesystem and the MapReduce environment to determine which worker nodes are connected to the same switch (i.e. in the same rack), distributing data and allocating tasks in such a way that communication takes place between nodes in the same rack without overloading the network core. HDFS and MapReduce automatically manage which pieces of a given file are stored on each node, allocating nodes for processing these data accordingly. When the JobTracker receives a new job, it first queries the DataNodes of worker nodes in a same rack, allocating a task slot if the the node has the necessary data stored locally. If no available slots are found in the rack, the JobTracker then allocates the first free slot it finds.
Hive: A petabyte scale database
On top of the HDFS distributed filesystem, Hadoop implements Hive, a distributed data warehouse solution. Actually, Hive started as an internal project at Facebook and has now evolved into a fully blown project of its own, being maintained by the Apache foundation. It provides ETL (Extract, Transform and Load) features and QL, a query language similar to standard SQL. Hive queries are translated into MapReduce jobs run on table data stored on HDFS volumes. This allows Hive to process queries that involve huge datasets with performances comparable to MapReduce jobs while providing the same abstraction level of a database. Its performance is most apparent when running queries over large datasets that do not change frequently. For example, Facebook relies on Hive to store user data, run statistical analysis, process logs and generate reports.
Conclusion
We have briefly overviewed the main features and components of Hadoop. Leveraging the power of cloud computing, many large companies rely on this project to perform their day to day data processing. This is yet another example of open source software being used to build large scale scalable applications while keeping costs low. However, we have only scratched the surface of the fascinating infrastructure behind Hadoop and its many possible uses. In future articles we will see how to set up a basic Hadoop cluster and how to use it for interesting applications such as log parsing and statistical analysis.
Further Reading
If you are interested in learning more about Hadoop's architecture, administration and application development these are the best places to start:
- Hadoop: The Definitive Guide, Tim White, O'Rilley/Yahoo Press, 2 edition, 2010
- Apache Hadoop Project homepage: http://hadoop.apache.org/
Originally Published in: http://www.linuxjournal.com/content/open-source-cloud-computing-hadoop
Cloud Forensics: Challenges and Possible Solutions
Introduction
The cloud computing paradigm basically consists in moving applications, services and data to virtual distributed infrastructures. It is further characterized by the provision of computing power and storage as a service, responding to varying client demands. In this paradigm, instead of deploying own applications and infrastructure, a client relies on a \textit{cloud service provider} to obtain such resources as services on demand. This dramatically improves the availability, reliability and scalability of applications and systems, while reducing the total cost of ownership. On the other hand, the client does not have direct access to the underlying information resources, having to implicitly trust the provider.
Digital information resources are commonly used to perpetrate illegal actions and crimes, which may leave behind relevant evidence. Digital forensics and investigation methods are employed in order to obtain such evidence and solve digital crimes. Forensic analysis of information resources may focus on different layers (\textit{e.g.} file systems, network, volatile memory) to obtain the information necessary to elucidate a given malicious activity. For example, an investigator's need may range from retrieving a sole deleted file to reconstructing network activity and running processes. Several methods have been proposed to address different forensic purposes, all of them requiring physical access to the resources involved.
It is clearly impractical to track data, services and communications as they migrate between different underlying resources in a cloud computing environment. It may even be impossible to determine where a given piece of data is stored or which system originated a given network connection. Since it is practically impossible to physically access the virtual distributed data and communications of a cloud computing environment, regular forensics methods are not effective in such scenarios. Moreover, depending on implementation specific characteristics, no adequate log generation mechanisms may be available, increasing the difficulties in analysing and auditing malicious activities in these systems.
Since it is impossible to obtain of physical access to cloud computing resources, one of the viable approaches is applying a digital forensics framework that acts as a middleware between cloud applications and the underlying computing and storage resources. It would capture relevant information on filesystem, network and operating system operations as they are requested by applications and processed by the cloud infrastructure. This information would be aggregated at a central repository for further analysis in case of future investigations. Being completely application agnostic and software based, this solution is compatible with current environments and may be easily deployed.
Digital Forensics
Digital forensics aims at unearthing such evidence and providing valuable data on the possibly malicious actions conducted on a digital resource. Notice that by digital resource we mean any information stored, generated or transmitted by digital information resources, e.g. images, documents, network traffic and files. The different steps of a digital investigation process may be summarized as follows:
1) Preservation: The first step of any investigation processes, digital or not, is to preserve evidence and subsequently collect it in conditions useful for further examination. Once an incident is identified, actions must be taken in order to ensure that the relevant affected digital resources remain unaltered, preserving the same state as immediately after the incident occurred.These actions may vary according to the type of digital sources and collection process.
2) Collection This process usually involves copying the evidence to the system where it will be actually analysed. Observe that precautions must be taken not to modify the evidence data as it is copied, since it would affect the accuracy of investigation or even render them ineffective. Collection is performed under two main scenarios, namely: live and post-mortem. The classical collection process involves capturing evidence (mostly filesystem data) from a system that was previously taken offline (often after the incident occurrence). In a live collection process, digital evidence is collected from a running system, which may possibly be actively serving client requests. This process aims at taking a snapshot of the system's current state, allowing the investigator to analyse items such as RAM memory content and volatile operating system parameters.
3) Validation and Identification: The validation phase consists in ensuring that the acquired forensic was not corrupted in the collection process, providing accurate data for the investigation process. Validation is usually carried out using hashing techniques, which allow the investigator to efficiently identify modification in the collected evidence by comparing its hashes with hashes of the original data. The same hashing techniques can be used for identification, a process that aims at assigning unique tags to the collected evidence, allowing the investigator to determine to which system or resource the evidence is related.
Here we will focus on preservation, collection, validation and identification of evidence on cloud environments, since the other steps can be carried out with standard tools.
Network Forensics
Traditionally, digital forensics has been performed on filesystems and individual files. However, since most current information resources are connected to a network (be it local or the Internet) and many applications have been moved to the web, much valuable evidence is lost when considering only locally stored or processed data..Network forensics method have been developed to address this problem, providing means of analysing network traffic and identifying relevant evidence among it. It has also been observed that network forensics may help increase information security in networked systems. Network traffic may provide useful information on the activities conducted on a given networked information resource, help determine remote attacker's identities and techniques and even contain information (such as documents and messages) that would not be retrieved in a local storage device.
Cloud Computing
The constant and rapid growth in the volume of data and communications in current networks and information systems raises the need for efficient scalable data storage and processing techniques. In order to address this issue, a new paradigm commonly called cloud computing was introduced. In this section we present the main characteristics of current cloud computing models and widely adopted architectures.
The main characteristic of the cloud computing paradigm is the transparent decentralization of data processing and storage through clustered environments that seamlessly scale to fit the constantly increasing demand, offering high performance and achieving efficient response times. Cloud computing based applications and systems run on virtual environments that may be distributed across several different physical information resources, migrating over different resources in order to respond to client demand. Such virtual environments are ubiquitously accessible through networks and provide diverse applications as network services. This approach has several advantages over the traditional centralized client server architecture, since it eliminates single points of failure (improving availability and reliability) and provides seamless scalability for constantly growing demands.
Architectures and models for cloud computing address mainly two scenarios: massive data processing and providing service for final users. Frameworks such as MapReduce focus on storing and processing large volumes of data, acting as a back-end service for data intensive systems and providing information for end user applications. Such environments are not usually directly accessed by end users. In this paper, we focus on frameworks such as Amazon EC2 \cite{amazon}, which provides applications and full operating systems as services. These frameworks are commonly referred to as market-oriented cloud computing and leverage technologies such as virtual machines (VM) and storage area networks (SAN) to provide end users with ubiquitous access to virtual systems, which may be physically hosted in dynamically changing locations.
A common architecture for market-oriented clouds is mainly based on VM and SAN technology and provides access to virtual systems hosted on dynamic physical resources. End users access a front-end interface that redirects them to their respective VMs, which are dynamically allocated in different physical systems depending on the Service Level Agreement (SLA). The VM data is stored in background SANs, which offer transparent access to data across different systems. A similar approach is taken to provide virtual cloud applications, a cloud service provider runs the application its serving on VM that dynamically migrate across its underlying infrastructure.
Challenges and Issues in Cloud Forensics
Although cloud computing has several benefits to end users, it poses new challenges to digital forensics as regular digital forensic techniques cannot be applied in such environments. In this section we point out the main issues in conducting digital forensics on cloud computing environments and the challenges that have yet to be addressed by current methods.
The collection process of a digital investigation requires physical access to the digital resource, since the compromised or potantially malicious operating system of a digital resource cannot be trusted to provide honest responses. However, in a cloud computing environment it is impossible to determine exactly where a VM (or application) was executed as it dynamically migrates across physical systems. Notice that it is necessary to access implementation specific metadata in order to determine where each VM (or application) is running and to correlate a VM (or application) to its user. Thus, an investigator cannot perform live collection and extract data such as operating system state or RAM memory contents.
Deeper issues are present in the storage back-end, which is usually composed by one or more storage area networks with distributed disk drives. Once again, it is impossible to determine where data pertaining to a specific application or VM is physically stored. Moreover, even if one can track the exact disk array that stores such data, it is impractical to forensically reconstruct disk array data without accessing the controller and potentially compromised metadata. This renders cloud filesystem forensics virtually impossible.
Network forensics is also severely affected by the inherently dynamical and massive nature of cloud computing environments. It is clearly infeasible to capture all traffic originated and directed at a cloud environment. Furthermore, the dynamically changing allocation of VMs makes it impossible to track the exact origin of network traffic and activities inside the cloud environment.
A Framework for Cloud Forensics
Bearing in mind the various issues in cloud computing forensics, we describe a potential framework for digital forensics in cloud computing environments that addresses the main evidence collection requirements. The objective here is to establish some guidelines towards efficient cloud forensics analysis.
The main concept of this framework is to collect relevant evidence as it is generated by cloud based applications and aggregate it at central system for further analysis. It addresses three aspects of digital forensics evidence collection: filesystem, network and operating system state. It is completely software based, being composed by an evidence broker middleware running on each physical resource and a central evidence repository. We consider that each evidence broker has previously registered a digital certificate with the evidence broker (which also acts as a CA), subsequently using this certificate to sign evidence data.
The evidence brokers run between the VMs or cloud applications and their underlying physical resources, intercepting any important filesystem, network, or operating system operations. They can be implemented with minor modifications to the underlying infrastructure through techniques such as API hooking and loadable kernel modules. Once an operation considered to be relevant evidence is performed, the evidence broker captures data on this operation. It then computes a suitable hash (SHA-1 or the more recent SHA-2) of the captured data and creates an evidence ticket composed by the captured data, its hash, a time stamp and a description (specifying whether the evidence pertains to a filesystem, network or operating system event). The evidence ticket is then signed and sent to the evidence repository.
Filesystem evidence is collected for each operation performed on a virtual filesystem hosted at the cloud environment. The captured data consists in a description of the operation and its parameters, which can be directly obtained from the filesystem system call. The actual data being recorded is not captured, since it would be impractical to transfer and store such a large volume of data.
In order to collect network traffic, the evidence brokers use network intrusion detection techniques to identify attack signatures and patterns that may indicate relevant activities. The packets that correspond to a given signature or pattern are fully captured (including their payload) and sent it the evidence ticket.
Operating system information is constantly captured by the evidence brokers and sent to the evidence repository. Evidence brokers keep track of sensitive OS security alerts, processes, changes in user's databases and previously determined log files. Any activity concerning these items is captured and sent to the evidence repository.
Upon receiving an evidence ticket, the evidence repository first verifies its authenticity, rejecting it if the verification fails. Otherwise, it verifies its integrity and stores the evidence in a central database if it was not altered. The hash is used as index field that uniquely identifies the piece of evidence and allows fast searches. Moreover, the database record contains the time stamp and description of the event to which the evidence is related, allowing analysis mechanisms to construct time lines. The evidence repository must be hosted on infrastructure independent from the cloud environment in order to ensure preservation and reliability of evidence.
Conclusion
The growing adoption of cloud computing based applications and services has rendered ineffective current digital forensics methods. The virtual decentralized nature of cloud computing resources makes it impossible to apply regular investigation and forensic techniques, raising the need for new methodologies adapted to this paradigm. The digital forensics and investigation framework for cloud environments described in this article aims at collecting and aggregating the necessary forensic data as it is generated by the applications and underlying infrastructure. It seems that this framework can be completely implemented in software with minor alterations to the underlying infrastructure and operating systems. Furthermore, being application agnostic, it may be readily used with current applications. However, it has not been implemented, and is still waiting for any adventurous talented programmers out there that would take on the challenge.
The cloud computing paradigm basically consists in moving applications, services and data to virtual distributed infrastructures. It is further characterized by the provision of computing power and storage as a service, responding to varying client demands. In this paradigm, instead of deploying own applications and infrastructure, a client relies on a \textit{cloud service provider} to obtain such resources as services on demand. This dramatically improves the availability, reliability and scalability of applications and systems, while reducing the total cost of ownership. On the other hand, the client does not have direct access to the underlying information resources, having to implicitly trust the provider.
Digital information resources are commonly used to perpetrate illegal actions and crimes, which may leave behind relevant evidence. Digital forensics and investigation methods are employed in order to obtain such evidence and solve digital crimes. Forensic analysis of information resources may focus on different layers (\textit{e.g.} file systems, network, volatile memory) to obtain the information necessary to elucidate a given malicious activity. For example, an investigator's need may range from retrieving a sole deleted file to reconstructing network activity and running processes. Several methods have been proposed to address different forensic purposes, all of them requiring physical access to the resources involved.
It is clearly impractical to track data, services and communications as they migrate between different underlying resources in a cloud computing environment. It may even be impossible to determine where a given piece of data is stored or which system originated a given network connection. Since it is practically impossible to physically access the virtual distributed data and communications of a cloud computing environment, regular forensics methods are not effective in such scenarios. Moreover, depending on implementation specific characteristics, no adequate log generation mechanisms may be available, increasing the difficulties in analysing and auditing malicious activities in these systems.
Since it is impossible to obtain of physical access to cloud computing resources, one of the viable approaches is applying a digital forensics framework that acts as a middleware between cloud applications and the underlying computing and storage resources. It would capture relevant information on filesystem, network and operating system operations as they are requested by applications and processed by the cloud infrastructure. This information would be aggregated at a central repository for further analysis in case of future investigations. Being completely application agnostic and software based, this solution is compatible with current environments and may be easily deployed.
Digital Forensics
Digital forensics aims at unearthing such evidence and providing valuable data on the possibly malicious actions conducted on a digital resource. Notice that by digital resource we mean any information stored, generated or transmitted by digital information resources, e.g. images, documents, network traffic and files. The different steps of a digital investigation process may be summarized as follows:
1) Preservation: The first step of any investigation processes, digital or not, is to preserve evidence and subsequently collect it in conditions useful for further examination. Once an incident is identified, actions must be taken in order to ensure that the relevant affected digital resources remain unaltered, preserving the same state as immediately after the incident occurred.These actions may vary according to the type of digital sources and collection process.
2) Collection This process usually involves copying the evidence to the system where it will be actually analysed. Observe that precautions must be taken not to modify the evidence data as it is copied, since it would affect the accuracy of investigation or even render them ineffective. Collection is performed under two main scenarios, namely: live and post-mortem. The classical collection process involves capturing evidence (mostly filesystem data) from a system that was previously taken offline (often after the incident occurrence). In a live collection process, digital evidence is collected from a running system, which may possibly be actively serving client requests. This process aims at taking a snapshot of the system's current state, allowing the investigator to analyse items such as RAM memory content and volatile operating system parameters.
3) Validation and Identification: The validation phase consists in ensuring that the acquired forensic was not corrupted in the collection process, providing accurate data for the investigation process. Validation is usually carried out using hashing techniques, which allow the investigator to efficiently identify modification in the collected evidence by comparing its hashes with hashes of the original data. The same hashing techniques can be used for identification, a process that aims at assigning unique tags to the collected evidence, allowing the investigator to determine to which system or resource the evidence is related.
Here we will focus on preservation, collection, validation and identification of evidence on cloud environments, since the other steps can be carried out with standard tools.
Network Forensics
Traditionally, digital forensics has been performed on filesystems and individual files. However, since most current information resources are connected to a network (be it local or the Internet) and many applications have been moved to the web, much valuable evidence is lost when considering only locally stored or processed data..Network forensics method have been developed to address this problem, providing means of analysing network traffic and identifying relevant evidence among it. It has also been observed that network forensics may help increase information security in networked systems. Network traffic may provide useful information on the activities conducted on a given networked information resource, help determine remote attacker's identities and techniques and even contain information (such as documents and messages) that would not be retrieved in a local storage device.
Cloud Computing
The constant and rapid growth in the volume of data and communications in current networks and information systems raises the need for efficient scalable data storage and processing techniques. In order to address this issue, a new paradigm commonly called cloud computing was introduced. In this section we present the main characteristics of current cloud computing models and widely adopted architectures.
The main characteristic of the cloud computing paradigm is the transparent decentralization of data processing and storage through clustered environments that seamlessly scale to fit the constantly increasing demand, offering high performance and achieving efficient response times. Cloud computing based applications and systems run on virtual environments that may be distributed across several different physical information resources, migrating over different resources in order to respond to client demand. Such virtual environments are ubiquitously accessible through networks and provide diverse applications as network services. This approach has several advantages over the traditional centralized client server architecture, since it eliminates single points of failure (improving availability and reliability) and provides seamless scalability for constantly growing demands.
Architectures and models for cloud computing address mainly two scenarios: massive data processing and providing service for final users. Frameworks such as MapReduce focus on storing and processing large volumes of data, acting as a back-end service for data intensive systems and providing information for end user applications. Such environments are not usually directly accessed by end users. In this paper, we focus on frameworks such as Amazon EC2 \cite{amazon}, which provides applications and full operating systems as services. These frameworks are commonly referred to as market-oriented cloud computing and leverage technologies such as virtual machines (VM) and storage area networks (SAN) to provide end users with ubiquitous access to virtual systems, which may be physically hosted in dynamically changing locations.
A common architecture for market-oriented clouds is mainly based on VM and SAN technology and provides access to virtual systems hosted on dynamic physical resources. End users access a front-end interface that redirects them to their respective VMs, which are dynamically allocated in different physical systems depending on the Service Level Agreement (SLA). The VM data is stored in background SANs, which offer transparent access to data across different systems. A similar approach is taken to provide virtual cloud applications, a cloud service provider runs the application its serving on VM that dynamically migrate across its underlying infrastructure.
Challenges and Issues in Cloud Forensics
Although cloud computing has several benefits to end users, it poses new challenges to digital forensics as regular digital forensic techniques cannot be applied in such environments. In this section we point out the main issues in conducting digital forensics on cloud computing environments and the challenges that have yet to be addressed by current methods.
The collection process of a digital investigation requires physical access to the digital resource, since the compromised or potantially malicious operating system of a digital resource cannot be trusted to provide honest responses. However, in a cloud computing environment it is impossible to determine exactly where a VM (or application) was executed as it dynamically migrates across physical systems. Notice that it is necessary to access implementation specific metadata in order to determine where each VM (or application) is running and to correlate a VM (or application) to its user. Thus, an investigator cannot perform live collection and extract data such as operating system state or RAM memory contents.
Deeper issues are present in the storage back-end, which is usually composed by one or more storage area networks with distributed disk drives. Once again, it is impossible to determine where data pertaining to a specific application or VM is physically stored. Moreover, even if one can track the exact disk array that stores such data, it is impractical to forensically reconstruct disk array data without accessing the controller and potentially compromised metadata. This renders cloud filesystem forensics virtually impossible.
Network forensics is also severely affected by the inherently dynamical and massive nature of cloud computing environments. It is clearly infeasible to capture all traffic originated and directed at a cloud environment. Furthermore, the dynamically changing allocation of VMs makes it impossible to track the exact origin of network traffic and activities inside the cloud environment.
A Framework for Cloud Forensics
Bearing in mind the various issues in cloud computing forensics, we describe a potential framework for digital forensics in cloud computing environments that addresses the main evidence collection requirements. The objective here is to establish some guidelines towards efficient cloud forensics analysis.
The main concept of this framework is to collect relevant evidence as it is generated by cloud based applications and aggregate it at central system for further analysis. It addresses three aspects of digital forensics evidence collection: filesystem, network and operating system state. It is completely software based, being composed by an evidence broker middleware running on each physical resource and a central evidence repository. We consider that each evidence broker has previously registered a digital certificate with the evidence broker (which also acts as a CA), subsequently using this certificate to sign evidence data.
The evidence brokers run between the VMs or cloud applications and their underlying physical resources, intercepting any important filesystem, network, or operating system operations. They can be implemented with minor modifications to the underlying infrastructure through techniques such as API hooking and loadable kernel modules. Once an operation considered to be relevant evidence is performed, the evidence broker captures data on this operation. It then computes a suitable hash (SHA-1 or the more recent SHA-2) of the captured data and creates an evidence ticket composed by the captured data, its hash, a time stamp and a description (specifying whether the evidence pertains to a filesystem, network or operating system event). The evidence ticket is then signed and sent to the evidence repository.
Filesystem evidence is collected for each operation performed on a virtual filesystem hosted at the cloud environment. The captured data consists in a description of the operation and its parameters, which can be directly obtained from the filesystem system call. The actual data being recorded is not captured, since it would be impractical to transfer and store such a large volume of data.
In order to collect network traffic, the evidence brokers use network intrusion detection techniques to identify attack signatures and patterns that may indicate relevant activities. The packets that correspond to a given signature or pattern are fully captured (including their payload) and sent it the evidence ticket.
Operating system information is constantly captured by the evidence brokers and sent to the evidence repository. Evidence brokers keep track of sensitive OS security alerts, processes, changes in user's databases and previously determined log files. Any activity concerning these items is captured and sent to the evidence repository.
Upon receiving an evidence ticket, the evidence repository first verifies its authenticity, rejecting it if the verification fails. Otherwise, it verifies its integrity and stores the evidence in a central database if it was not altered. The hash is used as index field that uniquely identifies the piece of evidence and allows fast searches. Moreover, the database record contains the time stamp and description of the event to which the evidence is related, allowing analysis mechanisms to construct time lines. The evidence repository must be hosted on infrastructure independent from the cloud environment in order to ensure preservation and reliability of evidence.
Conclusion
The growing adoption of cloud computing based applications and services has rendered ineffective current digital forensics methods. The virtual decentralized nature of cloud computing resources makes it impossible to apply regular investigation and forensic techniques, raising the need for new methodologies adapted to this paradigm. The digital forensics and investigation framework for cloud environments described in this article aims at collecting and aggregating the necessary forensic data as it is generated by the applications and underlying infrastructure. It seems that this framework can be completely implemented in software with minor alterations to the underlying infrastructure and operating systems. Furthermore, being application agnostic, it may be readily used with current applications. However, it has not been implemented, and is still waiting for any adventurous talented programmers out there that would take on the challenge.
O que é a Computação Ubíqua
Aproveitando que iniciei este semestre o meu projeto final de graduação, vou postar para vocês um pouco sobre o que estou estudando: Computação Ubíqua (que carinhosamente chamarei de C.U.).
Contexto
O mundo está evoluindo para um novo paradigma computacional: a Computação Ubíqua. Esta consiste na interação transparente de vários computadores com um usuário(computação pervasiva) aonde quer que ele esteja(computação móvel).
Antes da computação ubíqua surgir houveram outras duas gerações computacionais, são elas:
- 1ª geração: Mainframe – um computador para vários usuários;e
- 2ª geração: PC – um computador para cada usuário.
O avanço tecnológico dos hardwares criados nestas duas gerações e a criação, difusão e evolução das redes de computadores possibilitaram o surgimento da C.U.
Como comentado antes, a CU é composta por outras duas computações:
- Computação Pervasiva
- Computação Móvel
Computação Pervasiva
Primeiramente vamos entender o que é ser pervasivo.
Definição: Ser pervasivo é espalhar-se, difundir-se por toda parte, propagar-se ou estender-se totalmente por meio de diversos canais, tecnologias, sistemas, dispositivos...
O objetivo da computação pervasiva é tornar o uso do computador transparente ao usuário. Em outras palavras, esta computação visa fazer com que uma pessoa sem vasto conhecimento em tecnologia tenha a capacidade de usufruir dos bens trazidos pelos sistemas computacionais distribuídos em um ambiente.
Um exemplo de um sistema computacional pervasivo seria uma casa inteligente. Nesta os equipamentos se conhecem e podem interagir entre si. Este tipo de casa possui uma central física controlada via software(espécie de mainframe) que é capaz de interagir e controlar os outros dispositivos previamente instalados na casa. Observe que os computadores ali presentes(sensores de temperatura, controladores de iluminação, câmeras de segurança...) estão espalhados pelo ambiente e o usuário trafega por entre eles sem necessariamente notar que eles estão ali e sem necessariamente saber que são controlados via software. Observe também que apesar de controlável e configurável, este ambiente não necessariamente se adéqua às necessidades de cada usuário, ou seja, não interage com o contexto.
Computação Móvel
A computação móvel é aquela que permite a um usuário utilizar um computador enquanto se desloca para qualquer lugar. Esta computação(considerando em CNTP e a Terra perfeitamente esférica e cor-de-rosa) gera a possibilidade de utilização de serviços prestados por um computador aonde quer que se vá e enquanto há locomoção.
Computação Ubíqua
A difusão do PC, da internet e das LANs junto à dispositivos inteligentes e redes de sensores possibilitou o surgimento de ambientes nos quais os usuários estivessem imersos em tecnologia. A CU vem para tornar estes ambientes capazes de interagir com as necessidades do usuário de forma transparente e personalizada aonde quer que ele vá. Em outras palavras, ela vem para fazer dispositivos se comunicarem tornando-os mais úteis e fáceis de se utilizar do que cada dispositivo seria separadamente.
Para atingir este objetivo o primeiro desafio da computação ubíqua é ser capaz de interagir com um usuário aonde quer que ele esteja e utilizar os recursos disponíveis à sua volta de forma transparente e distribuída.
Dentre as várias aplicações da CU, pdoemos destacar as seguintes:
- Casas inteligentes
- Acompanhamento da Saúde
- Monitoramento de Ambiente
- Sistemas inteligentes de Transporte
Nas próximas postagem de títulos iniciados por C.U. (procurem no "cardápio" na parte de cima da coluna da direta ou na caixa de busca do topo do blog) explicarei mais sobre a Computação Ubiqua, suas aplicações, limitações, requisitos... Por hora queria só expor panoramicamente o que ela é.
Até o próxmo Post!
Assinar:
Postagens (Atom)