Custom cover image
Custom cover image

Automated data collection with R : a practical guide to web scraping and text mining / Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis

By: Contributor(s): Resource type: Ressourcentyp: Buch (Online)Book (Online)Language: English Publisher: Chichester : Wiley, 2014Description: Online-Ressource (XXII, 452 S.)ISBN:
  • 9781322236414
  • 1322236410
  • 9781118834787
  • 9781118834800
Subject(s): Additional physical formats: 9781322236414 | 9781118834817 | Erscheint auch als: Automated data collection with R. Druck-Ausgabe Chichester : Wiley, 2015. xxii, 452 SeitenDDC classification:
  • 006.312
  • 006.3/12
  • COM021030
RVK: RVK: ST 250LOC classification:
  • QA76.9 .D343 M865 2014
  • QA76.9.D343
Online resources:
Contents:
Automated Data Collection with R; Contents; Preface; What you won't learn from reading this book; Why R?; Recommended reading to get started with R; Typographic conventions; The book's website; Disclaimer; Acknowledgments; 1 Introduction; 1.1 Case study: World Heritage Sites in Danger; 1.2 Some remarks on web data quality; 1.3 Technologies for disseminating, extracting, and storing web data; 1.3.1 Technologies for disseminating content on the Web; 1.3.2 Technologies for information extraction from web documents; 1.3.3 Technologies for data storage; 1.4 Structure of the book
Part One A Primer on Web and Data Technologies2 HTML; 2.1 Browser presentation and source code; 2.2 Syntax rules; 2.2.1 Tags, elements, and attributes; 2.2.2 Tree structure; 2.2.3 Comments; 2.2.4 Reserved and special characters; 2.2.5 Document type definition; 2.2.6 Spaces and line breaks; 2.3 Tags and attributes; 2.3.1 The anchor tag ; 2.3.2 The metadata tag ; 2.3.3 The external reference tag ; 2.3.4 Emphasizing tags , , ; 2.3.5 The paragraphs tag ; 2.3.6 Heading tags , , ,; 2.3.7 Listing content with , , and
2.3.8 The organizational tags and 2.3.9 The tag and its companions; 2.3.10 The foreign script tag ; 2.3.11 Table tags , , , and ; 2.4 Parsing; 2.4.1 What is parsing?; 2.4.2 Discarding nodes; 2.4.3 Extracting information in the building process; Summary; Further reading; Problems; 3 XML and JSON; 3.1 A short example XML document; 3.2 XML syntax rules; 3.2.1 Elements and attributes; 3.2.2 XML structure; 3.2.3 Naming and special characters; 3.2.4 Comments and character data; 3.2.5 XML syntax summary; 3.3 When is an XML document well formed or valid?
3.4 XML extensions and technologies3.4.1 Namespaces; 3.4.2 Extensions of XML; 3.4.3 Example: Really Simple Syndication; 3.4.4 Example: scalable vector graphics; 3.5 XML and R in practice; 3.5.1 Parsing XML; 3.5.2 Basic operations on XML documents; 3.5.3 From XML to data frames or lists; 3.5.4 Event-driven parsing; 3.6 A short example JSON document; 3.7 JSON syntax rules; 3.8 JSON and R in practice; Summary; Further reading; Problems; 4 XPath; 4.1 XPath-a query language for web documents; 4.2 Identifying node sets with XPath; 4.2.1 Basic structure of an XPath query; 4.2.2 Node relations
4.2.3 XPath predicates4.3 Extracting node elements; 4.3.1 Extending the fun argument; 4.3.2 XML namespaces; 4.3.3 Little XPath helper tools; Summary; Further reading; Problems; 5 HTTP; 5.1 HTTP fundamentals; 5.1.1 A short conversation with a web server; 5.1.2 URL syntax; 5.1.3 HTTP messages; 5.1.4 Request methods; 5.1.5 Status codes; 5.1.6 Header fields; 5.2 Advanced features of HTTP; 5.2.1 Identification; 5.2.2 Authentication; 5.2.3 Proxies; 5.3 Protocols beyond HTTP; 5.3.1 HTTP Secure; 5.3.2 FTP; 5.4 HTTP in action; 5.4.1 The libcurl library; 5.4.2 Basic request methods
5.4.3 A low-level function of RCurl
Machine generated contents note: Dedication Table of Contents List of Figures List of Tables Preface 1 Introduction 1.1 Case Study: World Heritage Sites in Danger 1.2 Some Remarks on Web Data Quality 1.3 Technologies for Disseminating, Extracting and Storing Web Data 1.3.1 Technologies for disseminating content on the Web 1.4 Structure of the Book Part One A Primer on Web and Data Technologies 2 HTML 2.1 Browser Presentation and Source Code 2.2 Syntax Rules 2.3 Tags and Attributes 2.4 Parsing Summary Further Reading Problems 3 XML and JSON 3.1 A Short Example XML Document 3.2 XML Syntax Rules 3.3 When Is an XML Document Well-formed or Valid? 3.4 XML Extensions and Technologies 3.5 XML and R in Practice 3.6 A Short Example JSON Document 3.7 JSON Syntax Rules 3.8 JSON and R in Practice Summary Further Reading Problems 4 XPath 4.1 XPath - a Querying Language for Web Documents 4.2 Identifying Node Sets with XPath 4.3 Extracting Node Elements Summary Further Reading Problems 5 HTTP 5.1 HTTP Fundamentals 5.2 Advanced Features of HTTP 5.3 Protocols beyond HTTP 5.4 HTTP in Action Summary Further Reading Problems 6 AJAX 6.1 JavaScript 6.2 XHR 6.3 Exploring AJAX with Web Developer Tools Summary Further Reading Problems 7 SQL and Relational Databases 7.1 Overview and Terminology 7.2 Relational Databases 7.3 SQL: a Language to Communicate with Databases 7.4 Databases in Action Summary Further Reading Problems 8 Regular Expressions and String Functions 8.1 Regular Expressions 8.2 String Processing 8.3 A Word on Character Encodings Summary Further Reading Problems Part Two A Practical Toolbox for Web Scraping and Text Mining 9 Scraping the Web 9.1 Retrieval Scenarios 9.2 Extraction Strategies 9.3 Web Scraping: Good Practice 9.4 Valuable Sources of Inspiration Summary Further Reading Problems 10 Statistical Text Processing 10.1 The running example: classifying press releases of the British government 10.2 Processing Textual Data 10.3 Supervised Learning Techniques 10.4 Unsupervised Learning Techniques Summary Further reading 11 Managing Data Projects 11.1 Interacting with the File System 11.2 Processing Multiple Documents/Links 11.3 Organizing Scraping Procedures 11.4 Executing R Scripts on a Regular Basis Part Three A Bag of Case Studies 12 Collaboration Networks in the U.S. Senate 12.1 Information on the Bills 12.2 Information on the Senators 12.3 Analyzing the network structure 12.4 Conclusion 13 Parsing Information from Semi-Structured Documents 13.1 Downloding Data from the FTP Server 13.2 Parsing Semi-Structured Text Data 13.3 Visualizing station and temperature data 14 Predicting the 2014 Academy Awards using Twitter 14.1 Twitter APIs: Overview 14.2 Twitter-based Forecast of the 2014 Academy Awards 14.3 Conclusion 15 Mapping the Geographic Distribution of Names 15.1 Developing a Data Collection Strategy 15.2 Web Site Inspection 15.3 Data Retrieval and Information Extraction 15.4 Mapping Names 15.5 Automating the Process 15.6 Summary 16 Gathering Data on Mobile Phones 16.1 Page Exploration 16.2 Scraping Procedure 16.3 Graphical Analysis 16.4 Data storage 17 Analyzing Sentiments of Product Reviews 17.1 Introduction 17.2 Collecting the data 17.3 Analyzing the Data 17.4 Conclusion References Bibliography Indices General Index Package Index Function Index .
Summary: A hands on guide to web scraping and text mining for both beginners and experienced users of R Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL. Provides basic techniques to query web documents and data sets (XPath and regular expressions). An extensive set of exercises are presented to guide the reader through each technique. Explores both supervised and unsupervised techniques as well as advanced techniques such as data scraping and text management. Case studies are featured throughout along with examples for each technique presented. R code and solutions to exercises featured in the book are provided on a supporting website.Summary: Intro -- Automated Data Collection with R -- Contents -- Preface -- What you won't learn from reading this book -- Why R? -- Recommended reading to get started with R -- Typographic conventions -- The book's website -- Disclaimer -- Acknowledgments -- 1 Introduction -- 1.1 Case study: World Heritage Sites in Danger -- 1.2 Some remarks on web data quality -- 1.3 Technologies for disseminating, extracting, and storing web data -- 1.3.1 Technologies for disseminating content on the Web -- 1.3.2 Technologies for information extraction from web documents -- 1.3.3 Technologies for data storage -- 1.4 Structure of the book -- Part One A Primer on Web and Data Technologies -- 2 HTML -- 2.1 Browser presentation and source code -- 2.2 Syntax rules -- 2.2.1 Tags, elements, and attributes -- 2.2.2 Tree structure -- 2.2.3 Comments -- 2.2.4 Reserved and special characters -- 2.2.5 Document type definition -- 2.2.6 Spaces and line breaks -- 2.3 Tags and attributes -- 2.3.1 The anchor tag -- 2.3.2 The metadata tag -- 2.3.3 The external reference tag -- 2.3.4 Emphasizing tags , , -- 2.3.5 The paragraphs tag -- 2.3.6 Heading tags , , , -- 2.3.7 Listing content with , , and -- 2.3.8 The organizational tags and -- 2.3.9 The tag and its companions -- 2.3.10 The foreign script tag -- 2.3.11 Table tags , , , and -- 2.4 Parsing -- 2.4.1 What is parsing? -- 2.4.2 Discarding nodes -- 2.4.3 Extracting information in the building process -- Summary -- Further reading -- Problems -- 3 XML and JSON -- 3.1 A short example XML document -- 3.2 XML syntax rules -- 3.2.1 Elements and attributes -- 3.2.2 XML structure -- 3.2.3 Naming and special characters -- 3.2.4 Comments and character data -- 3.2.5 XML syntax summary -- 3.3 When is an XML document well formed or valid?.PPN: PPN: 816332533Package identifier: Produktsigel: ZDB-26-MYL | ZDB-30-PAD | ZDB-30-PQE
No physical items for this record